PreprintPDF Available

Accelerating Approximate Nonnegative Canonical Polyadic Decomposition using Extrapolation

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In this work, we consider the problem of approximate Nonnegative Canonical Polyadic Decomposition (aNCPD) of third-order nonnegative tensors, which boils down to minimizing ||T − (U ⊗ V ⊗ W)Ir||_F^2 over element-wise nonnegative matrices U , V and W. We present an accelerated block coordinate descent algorithm that uses Nesterov-like extrapolation in the form U^{k+1} = U^k + β_k (U^k − U^{k−1}). Experimental results showcase the effectiveness of the proposed algorithm with respect to other block-coordinate descent algorithms on ill-conditioned synthetic data.
Content may be subject to copyright.
Accelerating Approximate Nonnegative
Canonical Polyadic Decomposition using Extrapolation
Andersen AN G1, Jeremy E. COH EN2, Nicolas GILLIS1
e de Mons, Rue de Houdain 9, 7000, Mons, Belgium
e de Rennes, INRIA, CNRS, IRISA, France
e – On consid`
ere dans cet article le probl`
eme de la D´
ecomposition Positive approch´
ee Canonique Polyadique (aNCPD) d’un tenseur
positif d’ordre trois. Ce probl`
eme consiste `
a minimiser kT −(UVW)Irk2
Fpart rapport `
a des matrices positives U,Vet W. Nous proposons
un algorithme de descente par blocs de coordonn´
ees utilisant une extrapolation `
a la Nesterov, de la forme Uk+1 =Uk+βk(UkUk1). Les
esultats exp´
erimentaux prouvent l’efficacit´
e de l’algorithme propos´
e par rapport aux autres approches de descente par blocs pour des donn´
ees mal conditionn´
Abstract – In this work, we consider the problem of approximate Nonnegative Canonical Polyadic Decomposition (aNCPD) of third-order
nonnegative tensors, which boils down to minimizing kT (UVW)Irk2
Fover element-wise nonnegative matrices U,Vand W. We
present an accelerated block coordinate descent algorithm that uses Nesterov-like extrapolation in the form Uk+1 =Uk+βk(UkUk1).
Experimental results showcase the effectiveness of the proposed algorithm with respect to other block-coordinate descent algorithms on ill-
conditioned synthetic data.
1 Introduction
Given a nonnegative three dimensional tensor T RI×J×K
assume one wants to compute the approximation of Tby a non-
negative tensor G= (UVW)Irof given small nonnega-
tive rank r, where Gijk =
UiqVj q Wkq . Using the Frobenius
norm as an error metric, the problem becomes :
U0,V 0,W 0f(U, V, W ) = 1
2kT (UVW)Irk2
where the inequalities are meant element-wise.
Problem (1) is often referred to as the approximate Nonnega-
tive Canonical Polyadic Decomposition (aNCPD) of T, as long
as ris smaller than the rank of T. This problem is the extension
of the low-rank nonnegative approximation problem for ma-
trices, namely nonnegative matrix factorization (NMF). Sol-
ving aNCPD is an important issue for mining information out
of tensors collected in a wide variety of applications; see [7]
and references therein. Note that (1) is well-posed. In fact, a
best low-rank approximation for tensors may not exist, but be-
cause the nonnegativity constraints are lower bounds for the
parameters, this prevents degeneracy and the best nonnegative
low-rank approximation always exists [9, 13].
NG acknowledges of the Fonds de la Recherche Scientifique - FNRS and
the Fonds Wetenschappelijk Onderzoek - Vlanderen (FWO) under EOS Pro-
ject no O005318F-RG47, and of the European Research Council (ERC starting
grant no 679515).
Contribution In this paper, we are interested in algorithms
featuring both low-cost iterations and fast convergence speed.
Such fast and cheap algorithms are the backbone of modern
machine learning, where both the data and the number of pa-
rameters to estimate can be very large. To this end, we first
study some existing Block-Coordinate Descent (BCD) algo-
rithms, and present an extrapolated BCD algorithm based on
Hierarchical Alternating Least Squares (HALS). To be speci-
fic, we adopt the numerical scheme of [1], which has found
to be able to significantly improve the convergence speed on
the problem of NMF. Experimental results showcase that our
approach is faster at producing accurate results than state-of-
the-art BCD algorithms in all tested scenarios.
2 Existing block-coordinate descent al-
gorithms for aNCPD
Let us first mention some existing algorithms to compute
aNCPD. Problem (1) is non-convex and admits no closed-form
solution. A baseline method called ANLS solves aNCPD by al-
ternating minimization on blocs {U, V, W }as detailled below.
However since computing the gradient of fis simple, and pro-
jecting on the nonnegative orthant only requires clipping, some
algorithms have been proposed based on projected first-order
methods [6, 15]. Compression using Tucker models has also
been explored [3].
As mentioned in the introduction, in what follows, we study
algorithms with i) a low cost per iteration, and ii) a high conver-
gence speed. First-order methods are typically slow to converge
despite having low complexity. However, it is possible to ac-
celerate significantly first-order methods using extrapolation,
which is our main contribution described in Section 3 below.
Note that other accelerations are of interest but outside the
scope of this paper, such as online, mini-batch or randomized
Furthermore, we turn towards a particular set of first-order
algorithms coined as Block-Coordinate Descent (BCD), where
only a subset of the parameters are updated at each iteration.
2.1 Two types of blocks for BCD
To compute approximate unconstrained tensor decomposi-
tion models such as the Canonical Polyadic Decomposition or
the Tucker decomposition [7], the Alternating Least Squares
(ALS) algorithm is considered a baseline algorithm. In what
follows, we first describe the well-known Alternating Nonne-
gative Least Squares algorithm (ANLS), which is an adaptation
of ALS for aNCPD.
2.1.1 A block-coordinate framework : ANLS
In ANLS, the cost function fis minimized over each factor
matrix U, V and Walternatively until convergence. For factor
matrix U, we solve
U0kT[1] U(VW)Tk2
where matrix T[1] is the first mode unfolding of tensor Tas
defined in [4] and is the Khatri-Rao product. Note that (2) is
simply an equivalent rewriting of (1), with fixed Vand W.
It appears now clearly that fis quadratic with respect to each
block U,Vand W. In other words, problem (2) is a nonnega-
tive least squares problem in a matrix format. Clearly, ANLS is
a BCD algorithm where blocks are {U, V, W }, in this order.
2.1.2 Solving ANLS with BCD : HALS
There are two categories of algorithms to compute NLS. A
first category of algorithms produces the exact solution of the
NLS in a finite number of steps, such as the historical active
set solution by Lawson and Hanson [8]. The number of step is
however not known and can be quite large. Also, each iteration
in the active set algorithm requires solving a least squares pro-
blem, which can be costly. A second category asymptotically
converges towards the optimal solution. These algorithms are
only interesting if they converge in a shorter time than the exact
ones. Among those, Hierarchical Alternating Least Squares is
state of the art [5] when dealing with several, alternated NLS
In this paper, we call HALS-NLS the BCD algorithm solving
the NLS problem that alternates on the columns of Uin (2)
while fixing the others. The solution for the ith column uiof U
requires to solve
ui0kT[1] U\i(V\iW\i)Tui(viwi)Tk2
and has the following closed-form solution
ui="T[1] U\i(V\iW\i)T(viwi)
since (3) boils down to the coordinate-wise minimization of se-
cond order polynomial functions over a half space. In (3), ma-
trix M\istands for a matrix Mwith the i-th column removed,
while stands for the Kronecker product. Operator [x]+clips
negative values in xto zero. The NLS problem is solved by al-
ternatively using update rule (4) on the columns of the matrix
being updated in the outer ANLS algorithm until convergence.
In what follows, we refer to the whole procedure “ANLS with
NLS solved by HALS-NLS” as HALS.
Several important remarks can be done at this stage :
HALS can be seen itself as a BCD algorithm to solve (1),
where blocs are
{u1, u2, . . . , ur, u1, . . . , ur, v1, v2, . . . , w1, w2, . . .}.
Because updating the columns of Urequires the com-
putation of T[1] (VW)which is costly, it is computa-
tionally interesting to update the columns of Useveral
times before switching to the columns of V.
The product (VW)T(viwi)can be computed effi-
ciently as (VTviWTwi)where is the element-wise
3 Extrapolation for ANLS
Let us now discuss the proposed extrapolation scheme for
ANLS, which is an extension of the one proposed in [1].
In each step of the regular ANLS, we update a block of pa-
rameters using a NLS solver. For instance, when updating U,
we compute Uk+1 =NLS(Vk, W k;Uk)where Ukis the ini-
tialization for Uin an NLS solver such as HALS-NLS.
Extrapolated ANLS (E-ANLS) involves pairing variables ˜
Vand ˜
Winitialized equal to U,Vand W. At each iteration,
e.g., for U, we compute
Update Uk+1 =NLS(˜
Extrapolate ˜
Uk+1 =Uk+1 +βk(Uk+1 Uk),(6)
where kis the current iteration index and βkis the extrapolation
parameter. Similar updates are carried out for Vand W.
Intuitively, extrapolation makes use of the geometry of the
cost function to predict estimates for a future iteration using
previous estimates. In a convex optimization setting, Nesterov
provided an optimal way to compute βkfor steps (6) in terms
of convergence speed. He showed that for this choice of βk, the
error of the iterates of accelerated projected gradient methods
converge in O(1
k2)instead of O(1
However, as aNCPD (1) is non-convex, and because we are
using extrapolation with the same βkfor different blocks, there
is a priori no closed-form solution to update βk. Thus, the cri-
tical part of the update is the parameter βk, which we now dis-
3.1 Update of β
The extrapolation scheme [1] has 5 variables, two of them
indexed by the iteration index k:βk,¯
βk, γ, ¯γand η, where
βk[0,1] is the extrapolation variable, ¯
βk[0,1] is the cei-
ling parameter of βk. Parameters γ(η) control the way βgrows
(decays). Lastly ¯γis the growth parameter for ¯
βk. We grow or
decay βkafter each iteration as follows : let ekdenote the error
at iteration kafter the extrapolation, then
βk+1 =(min{γβk,¯
β}if ek+1 ek,
It is important to note that extrapolation may increase the
objective function value (ek) due to a bad parameter βk. If this
occurs, we abandon the extrapolation at that iteration and sim-
ply set the paring variable ˜
Uk+1 =Uk+1. In other words, when
the error increases, we perform restart [12]. In addition, an in-
crease of ekmeans the current βkis too large, we then decrease
βkas (7). To avoid βk+1 to grow back to the large value which
caused the increase of the error, we update ¯
βkby (8)
βk+1 =(min{¯γ¯
βk,1}if ek+1 ekand ¯
The relations between the parameters are
3.2 An existing line-search approach for ANLS
It is often reported in the literature on computing approxi-
mate CPD that ALS is a viable algorithm only when using
“line search” [2]. Although it is not stated explicitly in that re-
ference, what is actually done is extrapolation 1. It is also repor-
ted that “line search” speeds up NLS for aNCPD considerably.
The scheme is the following heuristic :
Update Uk+1
2=NLS(Vk, W k;Uk)(9)
Extrapolate Uk+1 =Uk+1
2+ (k1
where h(k)is a recursive function so that h(k+ 1) = h(k)
if the error has not increased for more than four iterations,
h(k+ 1) = 1 + h(k)otherwise, and h(1) = 3. Also, it is re-
ported that for constrained optimization, “line-search” should
be discarded in the first few iterations (4in tests below) be-
cause of instability issues. The main difference between this
approach, which we refer to as “Bro” after the name of its au-
thor, and the extrapolation scheme we propose is that on top
of being based on the Nesterov acceleration, the variables after
our scheme are always nonnegative (only the pairing variables
may have negative entries).
4 Experiments
Let us now compare various accelerated BCD techniques
for aNCPD, namely Projected gradient (PG), HALS (baseline),
1. Line-search as usually understood in smooth optimization has been stu-
died for approximate CPD in [14]
50 100 150 200 250 300 350 400 450 500
All curves
Computational time (sec)
100 200 300 400 500
Median curves
Computational Tim e (sec)
1 2 3 4 5 6 7 8 9 10
FIGURE 1 – The error curves ekemin of various algorithms,
where emin is the lowest eamong all trials of all algorithms.
HALS with extrapolation using Bro’s βtuning procedure, E-
HALS (proposed), Accelerated Projected Gradient [15] (APG).
Notice that the HALS-based algorithms haves inner loops wi-
thin the HALS and the PG-based algorithms have no inner
loop. For fair comparisons we also consider PG and APG with
inner loops, namely PG-r and APG-r. We compare these me-
thods in three scenarios described below. In all tests, the rank r
is small enough for uniqueness of factor matrices to hold up to
trivial permutations and scaling.
Set up We form the ground truth tensor Ttrue by sampling ele-
ments of matrices Utrue,Vtrue and Wtrue from the uniform dis-
tribution U[0,1]. To make the problem more difficult, we in-
crease the condition number of the mode-1 matrix Uby re-
placing Utrue(:,1) with 0.01Utrue (:,1) + 0.99Utrue(:,2) (using
Matlab array notations) and we add i.i.d. Gaussian noise to the
tensor with variance σ2= 104. Note that some values in the
noisy tensor may be negative due to the noise. We run all the al-
gorithms under the same number of iteration (500) with maxi-
mum run time of 10 seconds, same number of inner loops (50)
and random initialization Uini,Vini ,Wini sampled from U[0,1].
For E-HALS, we set ¯
β0= 1,β0= 0.4,γ= 1.1,¯γ= 1.001
and η= 2 after calibration. We repeat the whole process 20
times. We perform the following three tests :
1. Low rank, balanced sizes We set I=J=K= 50, R =
10. Fig. 1 shows the cost function values for each algorithm
against iteration and computational time. Table 1 provides the
median reconstruction error of each mode compared to ground
truth : we define REfinal as the relative error between Ukand
Utrue in percent after column normalization and matching. The
unpredictable behavior of Bro (error may increase significantly)
shows that it is not a priori well-designed for aNCPD, even
though its final error is sometimes on par with other algorithms.
APG-r performs better than HALS, while E-HALS is faster
than the other methods.
2. Low rank, balanced sizes, ill-conditioned We use the same
setting as in the previous test. However here the condition num-
ber of factor matrix Uis severely increased using U=U(Ir+
1r), where Iris identity matrix of size rand 1ris r-by-rall-1
matrix. Due to Ubeing very ill-conditioned, method Bro di-
100 200 300 400 500
2 4 6 8
FIGURE 2 – Median value of ekemin of HALS, E-HALS and
APG-r for test 2.
10 20 30
2 4 6 8 10
FIGURE 3 – Median value of ekemin of HALS, E-HALS and
APG-r for test 3.
verges (it therefore does not show on Fig. 2). On the other hand,
E-HALS is not only the fastest method, see Fig. 2, but also pro-
vides much more accurate factor estimates, see Table 1. Unex-
pectedly, E-HALS seems to perform even better in this second
test than in the previous one.
3. Medium rank, unbalanced sizes Here I= 150,J= 103,
K= 35 and R= 20. Fig. 3 shows similar results to the pre-
vious tests : E-HALS outperforms all other BCD approaches.
TABL E 1 – Median REfinal in % of U, V, W , (* means 40)
Algo Test 1 Test 2 Test 3
PGD 15, 1.8, 1.8 4.3, *, * *, *, *
PGD-r 0.2, 1.8, 1.6 4.1, *, * *, *, *
HALS 0.2, 1.6, 1.6 2.2, 22, 23 4.7, 5.2, 5.2
E-HALS 0.2, 1.8, 1.8 0.04,0.3, 0.3 0.4,0.8,0.8
Bro 0.2, 1.5, 1.5 0.2, 2.4, 2.4 *, *, *
APG 0.5, 3.0, 2.9 4.3, *, * *, *, *
APG-r 0.2, 1.3, 1.2 2.7, *, 28 10,11,11
5 Conclusion
In this paper, we presented an algorithm to compute approxi-
mate Nonnegative Canonical Polyadic Decomposition that em-
ploys extrapolation to accelerate the convergence of a block-
coordinate scheme. The extrapolation is in the form of Uk+1 =
Uk+βk(UkUk1)where kis the iteration number and
βk0is the extrapolation parameter. Experiment results show
that the proposed scheme can significantly speed up the algo-
rithm and is much faster than benchmark algorithms in several
ill-conditionned settings. Future works will further improve on
speed and adapt the proposed method to other decompositions
such as approximate nonnegative Tucker. Also, we will adapt
other acceleration methods desgigned in the unconstrained case
(here we only compare with Bro [2]), such as the one in [10].
[1] A. M. S. Ang and N. Gillis. Accelerating nonnegative matrix
factorization algorithms using extrapolation. Neural computa-
tion, 31(2) :417–439, 2019.
[2] R. Bro. Multi-way Analysis in the Food Industry : Models, Algo-
rithms, and Applications. PhD thesis, University of Amsterdam,
The Netherlands, 1998.
[3] J. E. Cohen, R. Cabral-Farias, and P. Comon. Fast decomposition
of large nonnegative tensors. IEEE Signal Processing Letters,
22(7) :862–866, July 2015.
[4] Jeremy E Cohen. About notations in multiway array processing.
arXiv preprint arXiv :1511.01306, 2015.
[5] N. Gillis and F. Glineur. Accelerated multiplicative updates and
hierarchical ALS algorithms for nonnegative matrix factoriza-
tion. Neural computation, 24(4) :1085–1105, 2012.
[6] K. Huang, N. D. Sidiropoulos, and A. P. Liavas. A flexible
and efficient algorithmic framework for constrained matrix and
tensor factorization. IEEE Transactions on Signal Processing,
64(19) :5052–5065, 2016.
[7] T. G. Kolda and B. W. Bader. Tensor decompositions and appli-
cations. SIAM Review, 51(3) :455–500, September 2009.
[8] C. L. Lawson and R. J. Hanson. Solving least squares problems,
volume 15. Siam, 1995.
[9] L-H. Lim and P. Comon. Nonnegative approximations of non-
negative tensors. J. Chemom., 23(7-8) :432–441, 2009.
[10] Drew Mitchell, Nan Ye, and Hans De Sterck. Nesterov accele-
ration of alternating least squares for canonical tensor decompo-
sition. arXiv preprint arXiv :1810.05846, 2018.
[11] Y. Nesterov. Introductory lectures on convex optimization : A
basic course. Springer Science & Business Media, 2013.
[12] B. O’donoghue and E. Candes. Adaptive restart for accelerated
gradient schemes. Foundations of computational mathematics,
15(3) :715–732, 2015.
[13] Y. Qi, P. Comon, and L. H. Lim. Uniqueness of nonnegative
tensor approximations. IEEE Trans. Inf. Theory, 62(4) :2170–
2183, April 2016. arXiv :1410.8129.
[14] M. Rajih, P. Comon, and R. A. Harshman. Enhanced line search :
A novel method to accelerate parafac. SIAM journal on matrix
analysis and applications, 30(3) :1128–1147, 2008.
[15] Y. Xu and W. Yin. A block coordinate descent method for regula-
rized multiconvex optimization with applications to nonnegative
tensor factorization and completion. SIAM Journal on imaging
sciences, 6(3) :1758–1789, 2013.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
We propose a general framework to accelerate significantly the algorithms for nonnegative matrix factorization (NMF). This framework is inspired from the extrapolation scheme used to accelerate gradient methods in convex optimization and from the method of parallel tangents. However, the use of extrapolation in the context of the exact coordinate descent algorithms tackling the nonconvex NMF problems is novel. We illustrate the performance of this approach on two state-of-the-art NMF algorithms: accelerated hierarchical alternating least squares and alternating nonnegative least squares, using synthetic, image, and document data sets. arXiv preprint version :
Full-text available
In signal processing, tensor decompositions have gained in popularity this last decade. In the meantime, the volume of data to be processed has drastically increased. This calls for novel methods to handle Big Data tensors. Since most of these huge data are issued from physical measurements, which are intrinsically real nonnegative, being able to compress nonnegative tensors has become mandatory. Following recent works on HOSVD compression for Big Data, we detail solutions to decompose a nonnegative tensor into decomposable terms in a compressed domain.
Full-text available
This paper considers regularized block multi-convex optimization, where the feasible set and objective function are generally non-convex but convex in each block of variables. We review some of its interesting examples and propose a generalized block coordinate descent method. Under certain conditions, we show that any limit point satisfies the Nash equi-librium conditions. Furthermore, we establish its global convergence and estimate its asymptotic convergence rate by assuming a property based on the Kurdyka-Lojasiewicz inequality. The proposed algorithms are adapted for factorizing nonnegative matrices and tensors, as well as completing them from their incomplete observations. The algorithms were tested on synthetic data, hyperspectral data, as well as image sets from the CBCL and ORL databases. Compared to the existing state-of-the-art algorithms, the proposed algorithms demonstrate superior performance in both speed and solution quality. The Matlab code of nonnegative matrix/tensor decomposition and completion, along with a few demos, are accessible from the authors' homepages.
Full-text available
Nonnegative matrix factorization (NMF) is a data analysis technique used in a great variety of applications such as text mining, image processing, hyperspectral data analysis, computational biology, and clustering. In this letter, we consider two well-known algorithms designed to solve NMF problems: the multiplicative updates of Lee and Seung and the hierarchical alternating least squares of Cichocki et al. We propose a simple way to significantly accelerate these schemes, based on a careful analysis of the computational cost needed at each iteration, while preserving their convergence properties. This acceleration technique can also be applied to other algorithms, which we illustrate on the projected gradient method of Lin. The efficiency of the accelerated algorithms is empirically demonstrated on image and text data sets and compares favorably with a state-of-the-art alternating nonnegative least squares algorithm.
We propose a general algorithmic framework for constrained matrix and tensor factorization, which is widely used in machine learning. The new framework is a hybrid between alternating optimization (AO) and the alternating direction method of multipliers (ADMM): each matrix factor is updated in turn, using ADMM, hence the name AO-ADMM. This combination can naturally accommodate a great variety of constraints on the factor matrices, and almost all possible loss measures for the fitting. Computation caching and warm start strategies are used to ensure that each update is evaluated efficiently, while the outer AO framework exploits recent developments in block coordinate descent (BCD)-type methods which help ensure that every limit point is a stationary point, as well as faster and more robust convergence in practice. Three special cases are studied in detail: non-negative matrix/tensor factorization, constrained matrix/tensor completion, and dictionary learning. Extensive simulations and experiments with real data are used to showcase the effectiveness and broad applicability of the proposed framework.
We show that the best nonnegative rank-$r$ approximation of a nonnegative tensor is almost always unique and that nonnegative tensors with nonunique best nonnegative rank-$r$ approximation form a semialgebraic set contained in an algebraic hypersurface. We then establish a singular vector variant of the Perron--Frobenius Theorem for positive tensors that may be used for answering various questions involving nonnegative tensor approximations. In particular, we apply it to show that the best nonnegative rank-$1$ approximation of a positive tensor is always unique and that a best nonnegative rank-$r$ approximation of a positive tensor can almost never be obtained by deflation.
In this paper we demonstrate a simple heuristic adaptive restart technique that can dramatically improve the convergence rate of accelerated gradient schemes. The analysis of the technique relies on the observation that these schemes exhibit two modes of behavior depending on how much momentum is applied. In what we refer to as the 'high momentum' regime the iterates generated by an accelerated gradient scheme exhibit a periodic behavior, where the period is proportional to the square root of the local condition number of the objective function. This suggests a restart technique whereby we reset the momentum whenever we observe periodic behavior. We provide analysis to show that in many cases adaptively restarting allows us to recover the optimal rate of convergence with no prior knowledge of function parameters.
This survey provides an overview of higher-order tensor decompositions, their applications, and available software. A tensor is a multidimensional or N-way array. Decompositions of higher-order tensors (i.e., N-way arrays with N 3) have applications in psychomet- rics, chemometrics, signal processing, numerical linear algebra, computer vision, numerical analysis, data mining, neuroscience, graph analysis, and elsewhere. Two particular tensor decompositions can be considered to be higher-order extensions of the matrix singular value decomposition: CANDECOMP/PARAFAC (CP) decomposes a tensor as a sum of rank- one tensors, and the Tucker decomposition is a higher-order form of principal component analysis. There are many other tensor decompositions, including INDSCAL, PARAFAC2, CANDELINC, DEDICOM, and PARATUCK2 as well as nonnegative variants of all of the above. The N-way Toolbox, Tensor Toolbox, and Multilinear Engine are examples of software packages for working with tensors.