Content uploaded by Andersen Ang

Author content

All content in this area was uploaded by Andersen Ang on May 28, 2019

Content may be subject to copyright.

Accelerating Approximate Nonnegative

Canonical Polyadic Decomposition using Extrapolation

Andersen AN G1, Jeremy E. COH EN2, Nicolas GILLIS1∗

,

1Universit´

e de Mons, Rue de Houdain 9, 7000, Mons, Belgium

2Universit´

e de Rennes, INRIA, CNRS, IRISA, France

manshun.ang@umons.ac.be jeremy.cohen@irisa.fr nicolas.gillis@umons.ac.be

R´

esum´

e – On consid`

ere dans cet article le probl`

eme de la D´

ecomposition Positive approch´

ee Canonique Polyadique (aNCPD) d’un tenseur

positif d’ordre trois. Ce probl`

eme consiste `

a minimiser kT −(U⊗V⊗W)Irk2

Fpart rapport `

a des matrices positives U,Vet W. Nous proposons

un algorithme de descente par blocs de coordonn´

ees utilisant une extrapolation `

a la Nesterov, de la forme Uk+1 =Uk+βk(Uk−Uk−1). Les

r´

esultats exp´

erimentaux prouvent l’efﬁcacit´

e de l’algorithme propos´

e par rapport aux autres approches de descente par blocs pour des donn´

ees

simul´

ees mal conditionn´

ees.

Abstract – In this work, we consider the problem of approximate Nonnegative Canonical Polyadic Decomposition (aNCPD) of third-order

nonnegative tensors, which boils down to minimizing kT − (U⊗V⊗W)Irk2

Fover element-wise nonnegative matrices U,Vand W. We

present an accelerated block coordinate descent algorithm that uses Nesterov-like extrapolation in the form Uk+1 =Uk+βk(Uk−Uk−1).

Experimental results showcase the effectiveness of the proposed algorithm with respect to other block-coordinate descent algorithms on ill-

conditioned synthetic data.

1 Introduction

Given a nonnegative three dimensional tensor T ∈ RI×J×K

+,

assume one wants to compute the approximation of Tby a non-

negative tensor G= (U⊗V⊗W)Irof given small nonnega-

tive rank r, where Gijk =

r

P

q=1

UiqVj q Wkq . Using the Frobenius

norm as an error metric, the problem becomes :

min

U≥0,V ≥0,W ≥0f(U, V, W ) = 1

2kT − (U⊗V⊗W)Irk2

F,(1)

where the inequalities are meant element-wise.

Problem (1) is often referred to as the approximate Nonnega-

tive Canonical Polyadic Decomposition (aNCPD) of T, as long

as ris smaller than the rank of T. This problem is the extension

of the low-rank nonnegative approximation problem for ma-

trices, namely nonnegative matrix factorization (NMF). Sol-

ving aNCPD is an important issue for mining information out

of tensors collected in a wide variety of applications; see [7]

and references therein. Note that (1) is well-posed. In fact, a

best low-rank approximation for tensors may not exist, but be-

cause the nonnegativity constraints are lower bounds for the

parameters, this prevents degeneracy and the best nonnegative

low-rank approximation always exists [9, 13].

∗NG acknowledges of the Fonds de la Recherche Scientiﬁque - FNRS and

the Fonds Wetenschappelijk Onderzoek - Vlanderen (FWO) under EOS Pro-

ject no O005318F-RG47, and of the European Research Council (ERC starting

grant no 679515).

Contribution In this paper, we are interested in algorithms

featuring both low-cost iterations and fast convergence speed.

Such fast and cheap algorithms are the backbone of modern

machine learning, where both the data and the number of pa-

rameters to estimate can be very large. To this end, we ﬁrst

study some existing Block-Coordinate Descent (BCD) algo-

rithms, and present an extrapolated BCD algorithm based on

Hierarchical Alternating Least Squares (HALS). To be speci-

ﬁc, we adopt the numerical scheme of [1], which has found

to be able to signiﬁcantly improve the convergence speed on

the problem of NMF. Experimental results showcase that our

approach is faster at producing accurate results than state-of-

the-art BCD algorithms in all tested scenarios.

2 Existing block-coordinate descent al-

gorithms for aNCPD

Let us ﬁrst mention some existing algorithms to compute

aNCPD. Problem (1) is non-convex and admits no closed-form

solution. A baseline method called ANLS solves aNCPD by al-

ternating minimization on blocs {U, V, W }as detailled below.

However since computing the gradient of fis simple, and pro-

jecting on the nonnegative orthant only requires clipping, some

algorithms have been proposed based on projected ﬁrst-order

methods [6, 15]. Compression using Tucker models has also

been explored [3].

As mentioned in the introduction, in what follows, we study

algorithms with i) a low cost per iteration, and ii) a high conver-

gence speed. First-order methods are typically slow to converge

despite having low complexity. However, it is possible to ac-

celerate signiﬁcantly ﬁrst-order methods using extrapolation,

which is our main contribution described in Section 3 below.

Note that other accelerations are of interest but outside the

scope of this paper, such as online, mini-batch or randomized

algorithms.

Furthermore, we turn towards a particular set of ﬁrst-order

algorithms coined as Block-Coordinate Descent (BCD), where

only a subset of the parameters are updated at each iteration.

2.1 Two types of blocks for BCD

To compute approximate unconstrained tensor decomposi-

tion models such as the Canonical Polyadic Decomposition or

the Tucker decomposition [7], the Alternating Least Squares

(ALS) algorithm is considered a baseline algorithm. In what

follows, we ﬁrst describe the well-known Alternating Nonne-

gative Least Squares algorithm (ANLS), which is an adaptation

of ALS for aNCPD.

2.1.1 A block-coordinate framework : ANLS

In ANLS, the cost function fis minimized over each factor

matrix U, V and Walternatively until convergence. For factor

matrix U, we solve

min

U≥0kT[1] −U(VW)Tk2

F(2)

where matrix T[1] is the ﬁrst mode unfolding of tensor Tas

deﬁned in [4] and is the Khatri-Rao product. Note that (2) is

simply an equivalent rewriting of (1), with ﬁxed Vand W.

It appears now clearly that fis quadratic with respect to each

block U,Vand W. In other words, problem (2) is a nonnega-

tive least squares problem in a matrix format. Clearly, ANLS is

a BCD algorithm where blocks are {U, V, W }, in this order.

2.1.2 Solving ANLS with BCD : HALS

There are two categories of algorithms to compute NLS. A

ﬁrst category of algorithms produces the exact solution of the

NLS in a ﬁnite number of steps, such as the historical active

set solution by Lawson and Hanson [8]. The number of step is

however not known and can be quite large. Also, each iteration

in the active set algorithm requires solving a least squares pro-

blem, which can be costly. A second category asymptotically

converges towards the optimal solution. These algorithms are

only interesting if they converge in a shorter time than the exact

ones. Among those, Hierarchical Alternating Least Squares is

state of the art [5] when dealing with several, alternated NLS

problems.

In this paper, we call HALS-NLS the BCD algorithm solving

the NLS problem that alternates on the columns of Uin (2)

while ﬁxing the others. The solution for the ith column uiof U

requires to solve

min

ui≥0kT[1] −U\i(V\iW\i)T−ui(viwi)Tk2

F(3)

and has the following closed-form solution

ui="T[1] −U\i(V\iW\i)T(viwi)

kvik2

2kwik2

2#+

(4)

since (3) boils down to the coordinate-wise minimization of se-

cond order polynomial functions over a half space. In (3), ma-

trix M\istands for a matrix Mwith the i-th column removed,

while stands for the Kronecker product. Operator [x]+clips

negative values in xto zero. The NLS problem is solved by al-

ternatively using update rule (4) on the columns of the matrix

being updated in the outer ANLS algorithm until convergence.

In what follows, we refer to the whole procedure “ANLS with

NLS solved by HALS-NLS” as HALS.

Several important remarks can be done at this stage :

— HALS can be seen itself as a BCD algorithm to solve (1),

where blocs are

{u1, u2, . . . , ur, u1, . . . , ur, v1, v2, . . . , w1, w2, . . .}.

— Because updating the columns of Urequires the com-

putation of T[1] (VW)which is costly, it is computa-

tionally interesting to update the columns of Useveral

times before switching to the columns of V.

— The product (VW)T(viwi)can be computed efﬁ-

ciently as (VTvi∗WTwi)where ∗is the element-wise

product.

3 Extrapolation for ANLS

Let us now discuss the proposed extrapolation scheme for

ANLS, which is an extension of the one proposed in [1].

In each step of the regular ANLS, we update a block of pa-

rameters using a NLS solver. For instance, when updating U,

we compute Uk+1 =NLS(Vk, W k;Uk)where Ukis the ini-

tialization for Uin an NLS solver such as HALS-NLS.

Extrapolated ANLS (E-ANLS) involves pairing variables ˜

U,

˜

Vand ˜

Winitialized equal to U,Vand W. At each iteration,

e.g., for U, we compute

Update Uk+1 =NLS(˜

Vk,˜

Wk;˜

Uk)(5)

Extrapolate ˜

Uk+1 =Uk+1 +βk(Uk+1 −Uk),(6)

where kis the current iteration index and βkis the extrapolation

parameter. Similar updates are carried out for Vand W.

Intuitively, extrapolation makes use of the geometry of the

cost function to predict estimates for a future iteration using

previous estimates. In a convex optimization setting, Nesterov

provided an optimal way to compute βkfor steps (6) in terms

of convergence speed. He showed that for this choice of βk, the

error of the iterates of accelerated projected gradient methods

converge in O(1

k2)instead of O(1

k)[11].

However, as aNCPD (1) is non-convex, and because we are

using extrapolation with the same βkfor different blocks, there

is a priori no closed-form solution to update βk. Thus, the cri-

tical part of the update is the parameter βk, which we now dis-

cuss.

3.1 Update of β

The extrapolation scheme [1] has 5 variables, two of them

indexed by the iteration index k:βk,¯

βk, γ, ¯γand η, where

βk∈[0,1] is the extrapolation variable, ¯

βk∈[0,1] is the cei-

ling parameter of βk. Parameters γ(η) control the way βgrows

(decays). Lastly ¯γis the growth parameter for ¯

βk. We grow or

decay βkafter each iteration as follows : let ekdenote the error

at iteration kafter the extrapolation, then

βk+1 =(min{γβk,¯

β}if ek+1 ≤ek,

βk/η else.(7)

It is important to note that extrapolation may increase the

objective function value (ek) due to a bad parameter βk. If this

occurs, we abandon the extrapolation at that iteration and sim-

ply set the paring variable ˜

Uk+1 =Uk+1. In other words, when

the error increases, we perform restart [12]. In addition, an in-

crease of ekmeans the current βkis too large, we then decrease

βkas (7). To avoid βk+1 to grow back to the large value which

caused the increase of the error, we update ¯

βkby (8)

¯

βk+1 =(min{¯γ¯

βk,1}if ek+1 ≤ekand ¯

βk<1,

βkelse.(8)

The relations between the parameters are

∀k≥0,0≤βk≤¯

βk≤1<¯γ≤γ≤η.

3.2 An existing line-search approach for ANLS

It is often reported in the literature on computing approxi-

mate CPD that ALS is a viable algorithm only when using

“line search” [2]. Although it is not stated explicitly in that re-

ference, what is actually done is extrapolation 1. It is also repor-

ted that “line search” speeds up NLS for aNCPD considerably.

The scheme is the following heuristic :

Update Uk+1

2=NLS(Vk, W k;Uk)(9)

Extrapolate Uk+1 =Uk+1

2+ (k1

h(k)−1)(Uk+1

2−Uk)(10)

where h(k)is a recursive function so that h(k+ 1) = h(k)

if the error has not increased for more than four iterations,

h(k+ 1) = 1 + h(k)otherwise, and h(1) = 3. Also, it is re-

ported that for constrained optimization, “line-search” should

be discarded in the ﬁrst few iterations (4in tests below) be-

cause of instability issues. The main difference between this

approach, which we refer to as “Bro” after the name of its au-

thor, and the extrapolation scheme we propose is that on top

of being based on the Nesterov acceleration, the variables after

our scheme are always nonnegative (only the pairing variables

may have negative entries).

4 Experiments

Let us now compare various accelerated BCD techniques

for aNCPD, namely Projected gradient (PG), HALS (baseline),

1. Line-search as usually understood in smooth optimization has been stu-

died for approximate CPD in [14]

Iteration

50 100 150 200 250 300 350 400 450 500

e-emin

10-10

All curves

Computational time (sec)

0246810

e-emin

10-15

10-10

10-5

Iteration

100 200 300 400 500

10-10

10-5

Median curves

Computational Tim e (sec)

1 2 3 4 5 6 7 8 9 10

10-10

10-5

PGD PGD-r HALS E-HALS Bro APG APG-r

FIGURE 1 – The error curves ek−emin of various algorithms,

where emin is the lowest eamong all trials of all algorithms.

HALS with extrapolation using Bro’s βtuning procedure, E-

HALS (proposed), Accelerated Projected Gradient [15] (APG).

Notice that the HALS-based algorithms haves inner loops wi-

thin the HALS and the PG-based algorithms have no inner

loop. For fair comparisons we also consider PG and APG with

inner loops, namely PG-r and APG-r. We compare these me-

thods in three scenarios described below. In all tests, the rank r

is small enough for uniqueness of factor matrices to hold up to

trivial permutations and scaling.

Set up We form the ground truth tensor Ttrue by sampling ele-

ments of matrices Utrue,Vtrue and Wtrue from the uniform dis-

tribution U[0,1]. To make the problem more difﬁcult, we in-

crease the condition number of the mode-1 matrix Uby re-

placing Utrue(:,1) with 0.01Utrue (:,1) + 0.99Utrue(:,2) (using

Matlab array notations) and we add i.i.d. Gaussian noise to the

tensor with variance σ2= 10−4. Note that some values in the

noisy tensor may be negative due to the noise. We run all the al-

gorithms under the same number of iteration (500) with maxi-

mum run time of 10 seconds, same number of inner loops (50)

and random initialization Uini,Vini ,Wini sampled from U[0,1].

For E-HALS, we set ¯

β0= 1,β0= 0.4,γ= 1.1,¯γ= 1.001

and η= 2 after calibration. We repeat the whole process 20

times. We perform the following three tests :

1. Low rank, balanced sizes We set I=J=K= 50, R =

10. Fig. 1 shows the cost function values for each algorithm

against iteration and computational time. Table 1 provides the

median reconstruction error of each mode compared to ground

truth : we deﬁne REﬁnal as the relative error between Ukand

Utrue in percent after column normalization and matching. The

unpredictable behavior of Bro (error may increase signiﬁcantly)

shows that it is not a priori well-designed for aNCPD, even

though its ﬁnal error is sometimes on par with other algorithms.

APG-r performs better than HALS, while E-HALS is faster

than the other methods.

2. Low rank, balanced sizes, ill-conditioned We use the same

setting as in the previous test. However here the condition num-

ber of factor matrix Uis severely increased using U=U(Ir+

1r), where Iris identity matrix of size rand 1ris r-by-rall-1

matrix. Due to Ubeing very ill-conditioned, method Bro di-

100 200 300 400 500

10-8

10-6

10-4

2 4 6 8

10-8

10-6

10-4

HALS E-HALS APG-r

FIGURE 2 – Median value of ek−emin of HALS, E-HALS and

APG-r for test 2.

10 20 30

10-8

10-6

10-4

2 4 6 8 10

10-8

10-6

10-4

HALS E-HALS APG-r

FIGURE 3 – Median value of ek−emin of HALS, E-HALS and

APG-r for test 3.

verges (it therefore does not show on Fig. 2). On the other hand,

E-HALS is not only the fastest method, see Fig. 2, but also pro-

vides much more accurate factor estimates, see Table 1. Unex-

pectedly, E-HALS seems to perform even better in this second

test than in the previous one.

3. Medium rank, unbalanced sizes Here I= 150,J= 103,

K= 35 and R= 20. Fig. 3 shows similar results to the pre-

vious tests : E-HALS outperforms all other BCD approaches.

TABL E 1 – Median REﬁnal in % of U, V, W , (* means ≥40)

Algo Test 1 Test 2 Test 3

PGD 15, 1.8, 1.8 4.3, *, * *, *, *

PGD-r 0.2, 1.8, 1.6 4.1, *, * *, *, *

HALS 0.2, 1.6, 1.6 2.2, 22, 23 4.7, 5.2, 5.2

E-HALS 0.2, 1.8, 1.8 0.04,0.3, 0.3 0.4,0.8,0.8

Bro 0.2, 1.5, 1.5 0.2, 2.4, 2.4 *, *, *

APG 0.5, 3.0, 2.9 4.3, *, * *, *, *

APG-r 0.2, 1.3, 1.2 2.7, *, 28 10,11,11

5 Conclusion

In this paper, we presented an algorithm to compute approxi-

mate Nonnegative Canonical Polyadic Decomposition that em-

ploys extrapolation to accelerate the convergence of a block-

coordinate scheme. The extrapolation is in the form of Uk+1 =

Uk+βk(Uk−Uk−1)where kis the iteration number and

βk≥0is the extrapolation parameter. Experiment results show

that the proposed scheme can signiﬁcantly speed up the algo-

rithm and is much faster than benchmark algorithms in several

ill-conditionned settings. Future works will further improve on

speed and adapt the proposed method to other decompositions

such as approximate nonnegative Tucker. Also, we will adapt

other acceleration methods desgigned in the unconstrained case

(here we only compare with Bro [2]), such as the one in [10].

R´

ef´

erences

[1] A. M. S. Ang and N. Gillis. Accelerating nonnegative matrix

factorization algorithms using extrapolation. Neural computa-

tion, 31(2) :417–439, 2019.

[2] R. Bro. Multi-way Analysis in the Food Industry : Models, Algo-

rithms, and Applications. PhD thesis, University of Amsterdam,

The Netherlands, 1998.

[3] J. E. Cohen, R. Cabral-Farias, and P. Comon. Fast decomposition

of large nonnegative tensors. IEEE Signal Processing Letters,

22(7) :862–866, July 2015.

[4] Jeremy E Cohen. About notations in multiway array processing.

arXiv preprint arXiv :1511.01306, 2015.

[5] N. Gillis and F. Glineur. Accelerated multiplicative updates and

hierarchical ALS algorithms for nonnegative matrix factoriza-

tion. Neural computation, 24(4) :1085–1105, 2012.

[6] K. Huang, N. D. Sidiropoulos, and A. P. Liavas. A ﬂexible

and efﬁcient algorithmic framework for constrained matrix and

tensor factorization. IEEE Transactions on Signal Processing,

64(19) :5052–5065, 2016.

[7] T. G. Kolda and B. W. Bader. Tensor decompositions and appli-

cations. SIAM Review, 51(3) :455–500, September 2009.

[8] C. L. Lawson and R. J. Hanson. Solving least squares problems,

volume 15. Siam, 1995.

[9] L-H. Lim and P. Comon. Nonnegative approximations of non-

negative tensors. J. Chemom., 23(7-8) :432–441, 2009.

[10] Drew Mitchell, Nan Ye, and Hans De Sterck. Nesterov accele-

ration of alternating least squares for canonical tensor decompo-

sition. arXiv preprint arXiv :1810.05846, 2018.

[11] Y. Nesterov. Introductory lectures on convex optimization : A

basic course. Springer Science & Business Media, 2013.

[12] B. O’donoghue and E. Candes. Adaptive restart for accelerated

gradient schemes. Foundations of computational mathematics,

15(3) :715–732, 2015.

[13] Y. Qi, P. Comon, and L. H. Lim. Uniqueness of nonnegative

tensor approximations. IEEE Trans. Inf. Theory, 62(4) :2170–

2183, April 2016. arXiv :1410.8129.

[14] M. Rajih, P. Comon, and R. A. Harshman. Enhanced line search :

A novel method to accelerate parafac. SIAM journal on matrix

analysis and applications, 30(3) :1128–1147, 2008.

[15] Y. Xu and W. Yin. A block coordinate descent method for regula-

rized multiconvex optimization with applications to nonnegative

tensor factorization and completion. SIAM Journal on imaging

sciences, 6(3) :1758–1789, 2013.