Content uploaded by Anh-Huy Phan

Author content

All content in this area was uploaded by Anh-Huy Phan

Content may be subject to copyright.

IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x

1

INVITED PAPER

Special Section on Signal Processing

Fast Local Algorithms for Large Scale Nonnegative Matrix and

Tensor Factorizations

Andrzej CICHOCKI

†a)

, Member and Anh-Huy PHAN

††b)

, Nonmember

SUMMARY Nonnegative matrix factorization (NMF) and its exten-

sions such as Nonnegative Tensor Factorization (NTF) have become promi-

nent techniques for blind sources separation (BSS), analysis of image

databases, data mining and other information retrieval and clustering ap-

plications. In this paper we propose a family of eﬃcient algorithms

for NMF/NTF, as well as sparse nonnegative coding and representation,

that has many potential applications in computational neuroscience, multi-

sensory processing, compressed sensing and multidimensional data anal-

ysis. We have developed a class of optimized local algorithms which are

referred to as Hierarchical Alternating Least Squares (HALS) algorithms.

For these purposes, we have performed sequential constrained minimiza-

tion on a set of squared Euclidean distances. We then extend this approach

to robust cost functions using the Alpha and Beta divergences and derive

ﬂexible update rules. Our algorithms are locally stable and work well for

NMF-based blind source separation (BSS) not only for the over-determined

case but also for an under-determined (over-complete) case (i.e., for a sys-

tem which has less sensors than sources) if data are suﬃciently sparse. The

NMF learning rules are extended and generalized for N-th order nonneg-

ative tensor factorization (NTF). Moreover, these algorithms can be tuned

to diﬀerent noise statistics by adjusting a single parameter. Extensive ex-

perimental results conﬁrm the accuracy and computational performance of

the developed algorithms, especially, with usage of multi-layer hierarchical

NMF approach [3].

key words: Nonnegative matrix factorization (NMF), nonnegative tensor

factorizations (NTF), nonnegative PARAFAC, model reduction, feature ex-

traction, compression, denoising, multiplicative local learning (adaptive)

algorithms, Alpha and Beta divergences.

1. Introduction

Recent years have seen a surge of interest in nonnegative

and sparse matrix and tensor factorization - decomposi-

tions which provide physically meaningful latent (hidden)

components or features. Nonnegative Matrix Factorization

(NMF) and its extension Nonnegative Tensor Factorization

(NTF) - multidimensional models with nonnegativity con-

straints - have been recently proposed as sparse and eﬃcient

representations of signals, images and in general natural sig-

nals/data. From signal processing point of view and data

analysis, NMF/NTF are very attractive because they take

into account spatial and temporal correlations between vari-

ables and usually provide sparse common factors or hidden

Manuscript received July 30, 2008.

Manuscript revised November 11, 2008.

Final manuscript received December 12, 2008.

†

RIKEN Brain Science Institute, 2-1 Hirosawa, Wako, 351-

0198 Saitama, Japan and Warsaw University of Technology and

Systems Research Institute, Polish Academy of Science, Poland.

††

RIKEN Brain Science Institute, 2-1 Hirosawa, Wako, 351-

0198 Saitama, Japan.

a)E-mail: cia@brain.riken.jp

b)E-mail: phan@brain.riken.jp

Table 1 Basic tensor operations and notations [16]

◦ outer product

⊙

Khatri-Rao product

⊗

Kronecker product

⊛

Hadamard product

⊘

element-wise division

[U]

j

jth column vector of [U]

U

(n)

the n −th factor

u

(n)

j

jth column vector of U

(n)

n

u

j

o

n

u

(1)

j

, u

(2)

j

, . . . , u

(N)

j

o

Y

tensor

×

n

n −mode product of tensor and matrix

×

n

n −mode product of tensor and vector

Y

(n)

n −mode matricized version of Y

U

⊙

U

(N)

⊙ U

(N−1)

⊙ ··· ⊙ U

(1)

U

⊙

−n

U

(N)

⊙ ··· ⊙ U

(n+1)

⊙ U

(n−1)

⊙ ··· ⊙ U

(1)

U

⊛

U

(N)

⊛ U

(N−1)

⊛ ··· ⊛ U

(1)

U

⊛

−n

U

(N)

⊛ ··· ⊛ U

(n+1)

⊛ U

(n−1)

⊛ ··· ⊛ U

(1)

SIR(a,b)10log

10

(kak

2

/ka − bk

2

)

PSNR

20log

10

(Range of Signal/RMSE)

Fit(Y

,

b

Y) 100(1 −kY −

b

Yk

2

F

/kY − E(Y)k

2

F

)

(latent) nonnegative components with physical or physio-

logical meaning and interpretations [1]–[5].

In fact, NMF and NTF are emerging techniques for

data mining, dimensionality reduction, pattern recognition,

object detection, classiﬁcation, gene clustering, sparse non-

negativerepresentation and coding, and blind source separa-

tion (BSS) [5]–[14]. For example, NMF/NTF have already

found a wide spectrum of applications in positron emission

tomography (PET), spectroscopy, chemometrics and envi-

ronmental science where the matrices have clear physical

meanings and some normalization or constraints are im-

posed on them [12],[13],[15].

This paper introduces several alternative approaches

and improved local learning rules (in the sense that vec-

tors and rows of matrices are processed sequentially one

by one) for solving nonnegative matrix and tensor factor-

izations problems. Generally, tensors (i.e., multi-way ar-

rays) are denoted by underlined capital boldface letter, e.g.,

Y ∈ R

I

1

×I

2

×···×I

N

. The order of a tensor is the number of

modes, also known as ways or dimensions. In contrast, ma-

trices are denoted by boldface capital letters, e.g., Y; vectors

are denoted by boldface lowercase letters, e.g., columns of

the matrix A by a

j

and scalars are denoted by lowercase

letters, e.g., a

ij

. The i-th entry of a vector a is denoted by

a

i

, and (i, j) element of a matrix A by a

ij

. Analogously, el-

ement (i, k, q) of a third-order tensor Y ∈ R

I×K×Q

by y

ikq

.

Indices typically range from 1 to their capital version, e.g.,

i = 1, 2, . . . , I; k = 1, 2, . . . , K; q = 1, 2, . . . , Q. Throughout

this paper,standard notations and basic tensor operationsare

used as indicated in Table 1.

2. Models and Problem Statements

In this paper, we consider at ﬁrst a simple NMF model de-

scribed as

2

IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x

=

Y

E

+

+

+

I

2

I

1

I

3

u

(1)

1

u

(2)

1

u

(3)

1

u

(3)

J

u

(2)

J

u

(1)

J

( )I I I

1 2 3

x x

( )I I I

1 2 3

x x

Fig.1 Illustration of a third-order tensor factorization using standard

NTF; Objective is to estimate nonnegative vectors u

(n)

j

for j = 1, 2, . . . , J

and n = 1, 2, 3.

Y = AX + E = AB

T

+ E, (1)

where Y = [y

ik

] ∈ R

I×K

is a known input data matrix,

A = [a

1

, a

2

, . . . , a

J

] ∈ R

I×J

+

is an unknown basis (mix-

ing) matrix with nonnegative vectors a

j

∈ R

I

+

, X = B

T

=

[x

T

1

, x

T

2

, . . . , x

T

J

]

T

∈ R

J×K

+

is a matrix representing unknown

nonnegative components x

j

and E = [e

ik

] ∈ R

I×K

repre-

sents errors or noise. For simplicity, we use also matrix

B = X

T

= [b

1

, b

2

, . . . , b

J

] ∈ R

K×J

+

which allows us to use

only column vectors. Our primary objective is to estimate

the vectors a

j

of the mixing (basis) matrix A and the sources

x

j

= b

T

j

(rows of the matrix X or columns of B), subject to

nonnegativity constraints

†

.

The simple NMF model (1) can be naturally extended

to the NTF (or nonnegative PARAFAC) as follows: “For

a given N-th order tensor Y ∈ R

I

1

×I

2

···×I

N

perform a non-

negative factorization (decomposition) into a set of N un-

known matrices: U

(n)

= [u

(n)

1

, u

(n)

2

, . . . , u

(n)

J

] ∈ R

I

n

×J

+

, (n =

1, 2, . . . , N) representing the common (loading) factors”,

i.e., [11],[16]

Y =

J

X

j=1

u

(1)

j

◦ u

(2)

j

◦ ··· ◦ u

(N)

j

+ E (2)

where ◦ means outer product of vectors

††

and

b

Y :=

J

P

j=1

u

(1)

j

◦ u

(2)

j

◦ ··· ◦ u

(N)

j

is an estimated or approximated

(actual) tensor (see Fig. 1). For simplicity, we use the fol-

lowing notations for the parameters of the estimated tensor

b

Y :=

J

P

j=1

~u

(1)

j

, u

(2)

j

, . . . , u

(N)

j

= ~{U} [16]. A residuum ten-

sor deﬁned as E = Y −

b

Y represents noise or errors depend-

ing on applications. This model can be referred to as non-

negative version of CANDECOMP proposed by Carroll and

Chang or equivalently nonnegative PARAFAC proposed in-

dependently by Harshman and Kruskal. In practice, we usu-

ally need to normalize vectors u

(n)

j

∈ R

J

to unit length, i.e.,

†

Usually, a sparsity constraint is naturally and intrinsically pro-

vided due to nonlinear projected approach (e.g., half-wave recti-

ﬁer or adaptive nonnegative shrinkage with gradually decreasing

threshold [17]).

††

For example, the outer product of two vectors a ∈ R

I

, b ∈ R

J

builds up a rank-one matrix A = a ◦ b = ab

T

∈ R

I×J

and the outer

product of three vectors: a ∈ R

I

, b ∈ R

K

, c ∈ R

Q

builds up third-

order rank-one tensor: Y = a◦ b◦ c ∈ R

I×K×Q

, with entries deﬁned

as y

ikq

= a

i

b

k

c

q

.

with ku

(n)

j

k

2

= 1 for n = 1, 2, . . . , N − 1, ∀j = 1, 2, . . . , J, or

alternatively apply a Kruskal model:

Y =

b

Y + E =

J

X

j=1

λ

j

(u

(1)

j

◦ u

(2)

j

◦ ··· ◦ u

(N)

j

) + E, (3)

where λ = [λ

1

, λ

2

, . . . , λ

J

]

T

∈ R

J

+

are scaling factors and the

factors matrices U

(n)

= [u

(n)

1

, u

(n)

2

, . . . , u

(n)

J

] have all vectors

u

(n)

j

normalized to unit length columns in the sense ku

(n)

j

k

2

2

=

u

(n)T

j

u

(n)

j

= 1, ∀j, n. Generally, the scaling vector λ could be

derived as λ

j

= ku

(N)

j

k

2

. However, we often assume that the

weight vector λ can be absorbed the (non-normalized)factor

matrix U

(N)

, and therefore the model can be expressed in the

simpliﬁed form (2). The objectiveis to estimatenonnegative

component matrices: U

(n)

or equivalently the set of vectors

u

(n)

j

, (n = 1, 2, . . . , N, j = 1, 2, . . . , J), assuming that the

number of factors J is known or can be estimated.

It is easy to check that for N = 2 and for U

(1)

= A and

U

(2)

= B = X

T

the NTF simpliﬁes to the standard NMF.

However, in order to avoid tedious and quite complex nota-

tions, we will derive most algorithms ﬁrst for NMF problem

and next attempt to generalize them to the NTF problem,

and present basic concepts in clear and easy understandable

forms.

Most of known algorithms for the NTF/NMF model

are based on alternating least squares (ALS) minimization

of the squared Euclidean distance (Frobenius norm) [13],

[16],[18]. Especially, for NMF we minimize the following

cost function:

D

F

(Y ||

b

Y) =

1

2

kY − AXk

2

F

,

b

Y = AX, (4)

and for the NTF model (2)

D

F

(Y ||

b

Y) =

1

2

Y

−

J

P

j=1

(u

(1)

j

◦ u

(2)

j

◦ ··· ◦ u

(N)

j

)

2

F

, (5)

subject to nonnegativity constraints and often additional

constraints such as sparsity or smoothness [10]. Such for-

mulated problems can be considered as a natural exten-

sion of the extensively studied NNLS (Nonnegative Least

Squares) formulated as the following optimization problem:

“Given a matrix A ∈ R

I×J

and a set of the observed values

given by a vector y ∈ R

I

, ﬁnd a nonnegative vector x ∈ R

J

to minimize the cost function J(x) =

1

2

||y − Ax||

2

2

, i.e.,

min

x

J(x) =

1

2

||y − Ax||

2

2

, (6)

subject to x ≥ 0” [13].

A basic approach to the above formulated optimization

problems (4-5) is alternating minimization or alternating

projection: The speciﬁed cost function is alternately min-

imized with respect to sets of parameters, each time opti-

mizing one set of arguments while keeping the others ﬁxed.

It should be noted that the cost function (4) is convex with

respect to entries of A or X, but not both. Alternating mini-

mization of the cost function (4) leads to a nonnegativeﬁxed

CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS

3

point ALS algorithm which can be described brieﬂy as fol-

lows:

1. Initialize A randomly or by using the recursive appli-

cation of Perron-Frobenius theory to SVD [13].

2. Estimate X from the matrix equation A

T

AX = A

T

Y

by solving

min

X

D

F

(Y||AX) =

1

2

||Y − AX||

2

F

, with ﬁxed A.

3. Set all negative elements of X to zero or a small posi-

tive value.

4. Estimate A from the matrix equation XX

T

A

T

= XY

T

by solving

min

A

D

F

(Y||AX) =

1

2

||Y

T

− X

T

A

T

||

2

F

, with ﬁxed X.

5. Set all negative elements of A to zero or a small posi-

tive value ε.

The above ALS algorithm can be written in the following

explicit form

X ← max{ε, (A

T

A)

−1

A

T

Y} := [A

†

Y]

+

, (7)

A ← max{ε, YX

T

(XX

T

)

−1

} := [Y X

†

]

+

, (8)

where A

†

is the Moore-Penrose inverse of A, ε is a small

constant (typically, 10

−16

) to enforce positive entries. Note

that max operator is performed component-wise for entries

of matrices. Various additional constraints on A and X can

be imposed [19].

For large scale NMF problem for J << I and J << K

the data matrix Y is usually low rank and in such cases we

do not need to process all elements of the matrix in order to

estimate factor matrices A and X (see Fig. 2). In fact, in-

stead of performing large scale factorization of (1), we can

consider alternating factorization of much smaller dimen-

sion problems:

Y

r

= A

r

X + E

r

, for ﬁxed (known) A

r

, (9)

Y

c

= AX

c

+ E

c

, for ﬁxed (known) X

c

, (10)

where Y

r

∈ R

R×K

+

and Y

c

∈ R

I×C

+

are matrices constructed

form preselected rows and columns of the data matrix Y, re-

spectively. Analogously, we construct reduced dimensions

matrices: A

r

∈ R

R×J

and X

c

∈ R

J×C

by using the same in-

dexes for columns and rows which were used for construc-

tion of the matrices Y

c

and Y

r

, respectively. There are sev-

eral strategies to chose columns and rows of the input matrix

data. The simplest scenario is to chose the ﬁrst R rows and

the ﬁrst C columns of data matrix Y. Alternatively, we can

select randomly, e.g., uniformly distributed, i.e. every N

row and column. Another option is to chose such rows and

columns that provide the largest ℓ

p

-norm values. For noisy

data with uncorrelated noise, we can construct new columns

and rows as local average (mean values) of some speciﬁc

numbers of columns and rows of the raw data. For exam-

ple, the ﬁrst selected column is created as average of the

ﬁrst M columns, the second columns is average of the next

@

I

R

C

Y

r

Y

c

X

c

A

r

J

J

(I T)´ (I J)´

(J T)´

C

R

T

T

Y

X

A

Fig.2 Conceptual illustration of processing of data for a large scale

NMF. Instead of processing the whole matrix Y ∈ R

I×K

, we process much

smaller dimensional block matrices Y

c

∈ R

I×C

and Y

r

∈ R

R×K

and cor-

responding factor matrices X

c

∈ R

J×C

and A

r

∈ R

R×J

with C << K and

R << I. For simplicity, we have assumed that the ﬁrst R rows and the ﬁrst

C columns of the matrices Y, A, X are chosen, respectively.

M columns, and so on. The same procedure is applied for

rows. Another approach is to cluster all columns and rows

in C and R cluster and select one column and one row form

each cluster, respectively. In practice, it is suﬃcient to chose

J < R ≤ 4J and J < C ≤ 4J. In the special case, for squared

Euclidean distance (Frobenius norm) instead of alternating

minimizing the cost function:

D

F

(Y || AX) =

1

2

kY − AXk

2

F

,

we can minimize sequentially two cost functions:

D

F

(Y

r

|| A

r

X) =

1

2

kY

r

− A

r

Xk

2

F

, for ﬁxed A

r

,

D

F

(Y

c

|| AX

c

) =

1

2

kY

c

− AX

c

k

2

F

, for ﬁxed X

c

.

Minimization of these cost functions with respect to X and

A, subject to nonnegativity constraints leads to simple ALS

update formulas for the large scale NMF:

A ←

h

Y

c

X

†

c

i

+

=

h

Y

c

X

T

c

(X

c

X

T

c

)

−1

i

+

, (11)

X ←

h

A

†

r

Y

r

i

+

=

h

(A

T

r

A

r

)

−1

A

T

r

Y

r

i

+

. (12)

The nonnegativeALS algorithm can be generalized for

the NTF problem (2) [16]:

U

(n)

←

Y

(n)

U

⊙

−n

U

⊙

−n

T

U

⊙

−n

−1

+

,

=

"

Y

(n)

U

⊙

−n

n

U

T

U

o

⊛

−n

−1

#

+

, n = 1, . . . , N. (13)

where Y

(n)

∈ R

I

n

×I

1

···I

n−1

I

n+1

···I

N

+

is n-mode unfolded matrix of

the tensor Y ∈ R

I

1

×I

2

×···×I

N

+

and

n

U

T

U

o

⊛

−n

= (U

(N)T

U

(N)

) ⊛

··· ⊛ (U

(n+1)T

U

(n+1)

) ⊛ (U

(n−1)T

U

(n−1)

) ⊛ ··· ⊛ (U

(1)T

U

(1)

).

At present, ALS algorithms for NMF and NTF are con-

sidered as “workhorse” approaches, however they may take

many iterations to converge. Moreover, they are also not

guaranteed to converge to a global minimum or even a sta-

tionary point, but only to a solution where the cost functions

cease to decrease [13],[16]. However, the ALS method can

be considerably improved and the computational complex-

ity reduced as will be shown in this paper.

4

IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x

In fact, in this paper, we use a diﬀerent and more so-

phisticated approach. Instead of minimizing one or two cost

functions, we minimize a set of local cost functions with the

same global minima (e.g., squared Euclidean distances and

Alpha or Beta divergences with a single parameter alpha or

beta). The majority of known algorithms for NMF work

only if the following assumption K >> I ≥ J is satisﬁed,

where J is the number of the nonnegative components. The

NMF algorithms developed in this paper are suitable also

for the under-determined case, i.e., for K > J > I, if sources

are sparse enough. Moreover, the proposed algorithms are

robust with respect to noise and suitable for large scale prob-

lems. Furthermore, in this paper we consider the extension

of our approach to NMF/NTF models with optional sparsity

and smoothness constraints.

3. Derivation of Fast HALS NMF Algorithms

Denoting the columns by A = [a

1

, a

2

, . . . , a

J

] and B =

[b

1

, b

2

, . . . , b

J

], we can express the squared Euclidean cost

function as

J(a

1

, . . . , a

J

, b

1

, . . . , b

J

) =

1

2

||Y − AB

T

||

2

F

=

1

2

||Y −

J

X

j=1

a

j

b

T

j

||

2

F

. (14)

The basic idea is to deﬁne residues:

Y

(j)

= Y −

X

p, j

a

p

b

T

p

= Y − AB

T

+ a

j

b

T

j

,

= Y − AB

T

+ a

j−1

b

T

j−1

− a

j−1

b

T

j−1

+ a

j

b

T

j

(15)

for j = 1, 2, . . . , J and minimize alternatively the set of cost

functions (with respect to set of parameters {a

j

} and {b

j

}):

D

(j)

A

(a

j

) =

1

2

||Y

(j)

− a

j

b

T

j

||

2

F

, for a ﬁxed b

j

, (16)

D

(j)

B

(b

j

) =

1

2

||Y

(j)

− a

j

b

T

j

||

2

F

, for a ﬁxed a

j

, (17)

for j = 1, 2, . . . , J subject to a

j

≥ 0 and b

j

≥ 0, respectively.

In other words, we minimize alternatively the set of

cost functions

D

(j)

F

(Y

(j)

||a

j

b

T

j

) =

1

2

||Y

(j)

− a

j

b

T

j

||

2

F

, (18)

for j = 1, 2, . . . , J subject to a

j

≥ 0 and b

j

≥ 0, respectively.

The gradients of the local cost functions (18) with re-

spect to the unknown vectors a

j

and b

j

(assuming that other

vectors are ﬁxed) are expressed by

∂D

(j)

F

(Y

(j)

||a

j

b

T

j

)

∂a

j

= a

j

b

T

j

b

j

− Y

(j)

b

j

, (19)

∂D

(j)

F

(Y

(j)

||a

j

b

T

j

)

∂b

j

= b

j

a

T

j

a

j

− Y

(j)T

a

j

. (20)

By equating the gradient components to zero and assuming

Algorithm 1 HALS for NMF: Given Y ∈ R

I×K

+

estimate

A ∈ R

I×J

+

and X = B

T

∈ R

J×K

+

1: Initialize nonnegative matrix A and/or X = B

T

using ALS

2: Normalize the vectors a

j

(or b

j

) to unit ℓ

2

-norm length,

3: E = Y − AB

T

;

4: repeat

5: for j = 1 to J do

6: Y

(j)

⇐ E + a

j

b

T

j

;

7: b

j

⇐

h

Y

(j)T

a

j

i

+

8: a

j

⇐

h

Y

(j)

b

j

i

+

9: a

j

⇐ a

j

/ka

j

k

2

;

10: E ⇐ Y

(j)

− a

j

b

T

j

;

11: end for

12: until convergence criterion is reached

that we enforce the nonnegativity constraints with a sim-

ple “half-wave rectifying” nonlinear projection, we obtain a

simple set of sequential learning rules:

b

j

←

1

a

T

j

a

j

h

Y

(j)T

a

j

i

+

, a

j

←

1

b

T

j

b

j

h

Y

(j)

b

j

i

+

, (21)

for j = 1, 2, . . . , J. We refer to these update rules as the

HALS algorithm which we ﬁrst introduced in [3]. The same

or similar update rules for the NMF have been proposed

or rediscovered independently in [20]–[23]. However, our

practical implementations of the HALS algorithm are quite

diﬀerent and allow various extensions to sparse and smooth

NMF, and also for the N-order NTF.

First of all, from the formula (15) it follows that we

do not need to compute explicitly the residue matrix Y

(j)

in

each iteration step but just smartly update it [24].

It is interesting to note that such nonlinear projections

can be imposed individuallyfor each source x

j

and/or vector

a

j

, so the algorithm can be directly extended to a semi-NMF

or a semi-NTF model in which some parameters are relaxed

to be bipolar (by removing the half-wave rectifying opera-

tor [·]

+

, if necessary). Furthermore, in practice, it is neces-

sary to normalize in each iteration step the column vectors

a

j

and/or b

j

to unit length vectors (in the sense of ℓ

p

-norm

(p = 1, 2, ..., ∞)). In the special case of ℓ

2

-norm, the above

algorithm can be further simpliﬁed by ignoring denomina-

tors in (21) and imposing normalizationof vectors after each

iteration steps. The standard HALS local updating rules can

be written in a simpliﬁed scalar form:

b

kj

←

I

X

i=1

a

ij

y

(j)

ik

+

, a

ij

←

K

X

k=1

b

kj

y

(j)

ik

+

, (22)

with a

ij

← a

ij

/||a

j

||

2

, where y

(j)

ik

= [Y

(j)

]

ik

= y

ik

−

P

p, j

a

ip

b

kp

. Eﬃcient implementation of the HALS algo-

rithm (22) is illustrated by detailed pseudo-code given in

Algorithm 1.

3.1 Extensions and Practical Implementations of Fast

HALS

The above simple algorithm can be further extended or im-

proved (in respect of convergence rate and performance

CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS

5

and by imposing additional constraints such as sparsity and

smoothness). First of all, diﬀerent cost functions can be

used for estimation of the rows of the matrix X = B

T

and the

columns of the matrix A (possibly with various additional

regularization terms [19],[25]). Furthermore, the columns

of A can be estimated simultaneously, instead of oneby one.

For example, by minimizing the set of cost functions in (4)

with respect to b

j

, and simultaneously the cost function (18)

with normalization of the columns a

j

to unit ℓ

2

-norm, we

obtain a very eﬃcient NMF learning algorithm in which the

individual vectors of B are updated locally (column by col-

umn) and the matrix A is updated globally using nonnega-

tive ALS (all columns a

j

simultaneously) (see also [19]):

b

j

←

h

Y

(j) T

r

˜a

j

i

+

/(˜a

T

j

˜a

j

), A ←

h

Y

c

X

T

c

(X

c

X

T

c

)

−1

i

+

, (23)

where ˜a

j

is an j-th vector of a reduced matrix A

r

∈ R

R×J

+

.

Matrix A needs to be normalized to the unit length column

vectors in the ℓ

2

-norm sense after each iteration.

Alternatively, even more eﬃcient approach is to per-

form factor by factor procedure, instead of updating

column-by column vectors [24]. From (21), we obtain the

following update rule for b

j

= x

T

j

b

j

← Y

(j)T

a

j

/(a

T

j

a

j

) =

Y − AB

T

+ a

j

b

T

j

T

a

j

/(a

T

j

a

j

)

= (Y

T

a

j

− BA

T

a

j

+ b

j

a

T

j

a

j

)/(a

T

j

a

j

),

=

h

Y

T

A

i

j

− B

h

A

T

A

i

j

+ b

j

a

T

j

a

j

/(a

T

j

a

j

), (24)

with b

j

←

h

b

j

i

+

. Due to ka

j

k

2

2

= 1, the learning rule for b

j

has a simpliﬁed form

b

j

←

b

j

+

h

Y

T

A

i

j

− B

h

A

T

A

i

j

+

. (25)

Analogously to equation (24), the learning rule for a

j

is

given by

a

j

←

a

j

b

T

j

b

j

+ [YB]

j

− A

h

B

T

B

i

j

+

, (26)

a

j

← a

j

/ka

j

k

2

. (27)

Based on these expressions, we have designed and imple-

mented the improved and modiﬁed HALS algorithm given

below in the pseudo-code as Algorithm 2. For large scale

data and block-wise strategy, the fast HALS learning rule

for b

j

is rewritten from (24) as follows

b

j

←

b

j

+

h

Y

T

r

A

r

i

j

/k˜a

j

k

2

2

− B

h

A

T

r

A

r

i

j

/k˜a

j

k

2

2

+

=

b

j

+

h

Y

T

r

A

r

D

A

r

i

j

− B

h

A

T

r

A

r

D

A

r

i

j

+

(28)

where D

A

r

= diag(k˜a

1

k

−2

2

, k˜a

2

k

−2

2

, . . . , k˜a

J

k

−2

2

) is a diagonal

matrix. The learning rule for a

j

has a similar form

a

j

←

a

j

+

Y

c

B

c

D

B

c

j

− A

h

B

T

c

B

c

D

B

c

i

j

+

(29)

where D

B

c

= diag(k

˜

b

1

k

−2

2

, k

˜

b

2

k

−2

2

, . . . , k

˜

b

J

k

−2

2

) and

˜

b

j

is the

j-th vector of the reduced matrix B

c

= X

T

c

∈ R

C×J

+

.

Algorithm 2 FAST HALS for NMF: Y ≈ AB

T

1: Initialize nonnegative matrix A and/or B using ALS

2: Normalize the vectors a

j

(or b

j

) to unit ℓ

2

-norm length

3: repeat

4: % Update B;

5: W = Y

T

A;

6: V = A

T

A;

7: for j = 1 to J do

8: b

j

⇐

h

b

j

+ w

j

− B v

j

i

+

9: end for

10: % Update A;

11: P = YB;

12: Q = B

T

B;

13: for j = 1 to J do

14: a

j

⇐

h

a

j

q

jj

+ p

j

− A q

j

i

+

15: a

j

⇐ a

j

/ka

j

k

2

;

16: end for

17: until convergence criterion is reached

3.2 HALS NMF Algorithm with Sparsity and Smoothness

Constraints

In order to impose sparseness and smoothness constraints

for vectors b

j

(source signals), we can minimize the follow-

ing set of cost functions:

D

(j)

F

(Y

(j)

ka

j

b

T

j

) =

1

2

kY

(j)

− a

j

b

T

j

k

2

F

+

+α

sp

kb

j

k

1

+ α

sm

kϕ(L b

j

)k

1

, (30)

for j = 1, 2, . . . , J subject to a

j

≥ 0 and b

j

≥ 0, where

α

sp

> 0, α

sm

> 0 are regularization parameters control-

ling level of sparsity and smoothness, respectively, L is a

suitably designed matrix (the Laplace operator) which mea-

sures the smoothness (by estimating the diﬀerences between

neighboring samples of b

j

)

†

and ϕ : R → R is an edge-

preserving function applied componentwise. Although this

edge-preserving nonlinear function may take various forms

[26]:

ϕ(t) = |t|

α

/α, 1 ≤ α ≤ 2, (31)

ϕ(t) =

√

α + t

2

, (32)

ϕ(t) = 1 + |t|/α − log(1 + |t|/α), α > 0, (33)

we restrict ourself to simple cases, where ϕ(t) = |t|

α

/α for

α = 1 or 2, and L is the derivative operator of the ﬁrst or

second order. For example, the ﬁrst order derivativeoperator

L with K points can take the form:

L =

1 −1

1 −1

.

.

.

.

.

.

1 −1

(34)

and the cost function (30) becomes similar to the total-

variation (TV) regularization (which is often used in sig-

nal and image recovery ) but with additional sparsity con-

straints:

†

In the special case for L = I

K

and ϕ(t) = |t|, the smoothness

regularization term becomes sparsity term.

6

IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x

D

(j)

F

(Y

(j)

ka

j

b

T

j

) =

1

2

Y

(j)

− a

j

b

T

j

2

F

+ α

sp

kb

j

k

1

+

+ α

sm

K−1

X

k=1

|b

k j

− b

(k+1) j

|. (35)

Another important case assumes that ϕ(t) =

1

2

|t|

2

and L is

the second order derivative operator with K points. In such

a case, we obtain the Tikhonov-like regularization:

D

(j)

F

(Y

(j)

ka

j

b

T

j

) =

1

2

kY

(j)

− a

j

b

T

j

k

2

F

+ α

sp

kb

j

k

1

+

+

1

2

α

sm

kLb

j

k

2

2

. (36)

In the such case the update rule for a

j

is the same as in (21),

whereas the update rule for b

j

is given by:

b

j

← (I + α

sm

L

T

L)

−1

(Y

(j)T

a

j

− α

sp

1

K

). (37)

where 1

K

∈ R

K

is a vector with all one. This learning rule

is robust to noise, however, it involves a rather high compu-

tational cost due to the calculation of an inverse of a large

matrix in each iteration. To circumvent this problem and

to considerably reduce the complexity of the algorithm we

present a second-order smoothing operator L in the follow-

ing form:

L =

−2 2

1 −2 1

1 −2 1

.

.

.

.

.

.

1 −2 1

2 −2

=

−2

−2

−2

.

.

.

−2

−2

+

0 2

1 0 1

1 0 1

.

.

.

.

.

.

.

.

.

1 0 1

2 0

= −2I + 2S. (38)

However, instead of computingdirectly Lb

j

= −2Ib

j

+2Sb

j

,

in the second term we replace b

j

by its estimation

ˆ

b

j

ob-

tained from the previous update. Hence, a new smoothing

regularization term with ϕ(t) = t

2

/8 takes a simpliﬁed and

computationally more eﬃcient form

J

sm

= kϕ( −2b

j

+ 2S

ˆ

b

j

)k

1

=

1

2

kb

j

− S

ˆ

b

j

k

2

2

. (39)

Finally, the learning rule of the regularized HALS algorithm

takes the following form:

b

j

←

h

Y

(j)T

a

j

− α

sp

1

K

+ α

sm

S

ˆ

b

j

i

+

/(a

T

j

a

j

+ α

sm

)

=

h

Y

(j)T

a

j

− α

sp

1

K

+ α

sm

S

ˆ

b

j

i

+

/(1 + α

sm

) . (40)

Alternatively, for a relatively small dimension of matrix A,

an eﬃcient solution is based on a combination of a local

learning rule for the vectors of B and a global one for A,

based on the nonnegative ALS algorithm:

b

j

←

h

Y

(j)T

a

j

− α

sp

1

K

+ α

sm

S

ˆ

b

j

i

+

/(1 + α

sm

),

A ←

h

Y

c

X

T

c

(X

c

X

T

c

)

−1

i

+

, (41)

with the normalization (scaling) of the columns of A to the

unit length ℓ

2

-norm.

An importantopenproblemis an optimal choice of reg-

ularization parameters α

sm

. Selection of appropriateregular-

ization parameters plays a key role. Similar to the Tikhonov-

like regularization approach we selected an optimal α

sm

by

applying the L-curve technique [27] to estimate a corner of

the L-curve. However, in the NMF, since both matrices A

and X are unknown, the procedure is slightly diﬀerent: ﬁrst,

we initiate α

sm

= 0 and perform a preliminary update to ob-

tain A and X; next we set α

sm

by the L-curve corner based

on the preliminary estimated matrix A; then, we continue

updating until convergence is achieved.

4. Fast HALS NTF Algorithm Using Squared Eu-

clidean Distances

The above approaches can be relatively easily extended to

the NTF problem. Let us consider sequential minimization

of a set of local cost functions:

D

(j)

F

(Y

(j)

||

b

Y

(j)

)

=

1

2

Y

(j)

−

u

(1)

j

◦ u

(2)

j

◦ ··· ◦ u

(N)

j

2

F

(42)

=

1

2

Y

(j)

(n)

− u

(n)

j

n

u

j

o

⊙

−n

T

2

F

, (43)

for j = 1, 2, . . . , J, subject to the nonnegativity constraints,

where

b

Y

(j)

= u

(1)

j

◦u

(2)

j

◦···◦u

(N)

j

,

n

u

j

o

⊙

−n

T

= [u

(N)

j

]

T

⊙···⊙

[u

(n+1)

j

]

T

⊙ [u

(n−1)

j

]

T

⊙ ··· ⊙ [u

(1)

j

]

T

and

Y

(j)

= Y −

X

p, j

u

(1)

p

◦ u

(2)

p

◦ ··· ◦ u

(N)

p

(44)

= Y −

J

X

p=1

(u

(1)

p

◦ ··· ◦ u

(N)

p

) + (u

(1)

j

◦ ··· ◦ u

(N)

j

)

= Y −

b

Y + ~{u

j

}. (45)

where ~{u

j

} = u

(1)

j

◦ ··· ◦ u

(N)

j

is a rank-one tensor. Note

that (43) is the n−mode matricized (unfolded) version of

(42). The gradients of (43) with respect to elements u

(n)

j

are given by

∂D

(j)

F

∂u

(n)

j

= −Y

(j)

(n)

n

u

j

o

⊙

−n

+ u

(n)

j

n

u

j

o

⊙

−n

T

n

u

j

o

⊙

−n

(46)

= −Y

(j)

(n)

n

u

j

o

⊙

−n

+ u

(n)

j

γ

(n)

j

, (47)

where scaling coeﬃcients γ

(n)

j

can be computed as follows:

γ

(n)

j

=

n

u

j

o

⊙

−n

T

n

u

j

o

⊙

−n

=

n

u

T

j

u

j

o

⊛

−n

=

n

u

T

j

u

j

o

⊛

/

u

(n)T

j

u

(n)

j

=

u

(N)T

j

u

(N)

j

/

u

(n)T

j

u

(n)

j

=

u

(N)T

j

u

(N)

j

, n , N

1, n = N.

(48)

Hence, a new HALS NTF learning rule for u

(n)

j

, (j =

CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS

7

1, 2, . . . , N; n = 1, 2, . . . , N) is obtained by equating the

gradient (47) to zero:

u

(n)

j

← Y

(j)

(n)

n

u

j

o

⊙

−n

. (49)

Note that the scaling factors γ

(n)

j

have been ignored due to

normalization after each iteration step u

(n)

j

= u

(n)

j

/ku

(n)

j

k

2

for n = 1, 2, . . . N −1. The learning rule (49) can be written

in an equivalent form expressed by n mode multiplication of

tensor by vectors:

u

(n)

j

← Y

(j)

×

1

u

(1)

j

···×

n−1

u

(n−1)

j

×

n+1

u

(n+1)

j

···×

N

u

(N)

j

:= Y

(j)

×

−n

{u

j

}, j = 1, . . . , J; n = 1, . . . , N. (50)

For simplicity, we use here a short notation Y

(j)

×

−n

{u

T

j

} in-

troduced by Kolda and Bader [28] to indicate multiplication

of the tensor Y by vectors in all modes, but n-mode. The

above updating formula is elegant and relatively simple but

involvesrather high computational cost for large scale prob-

lems. In order to derive a more eﬃcient (faster) algorithm

we exploit basic properties the Khatri-Rao and Kronecker

products of two vectors:

h

U

(1)

⊙ U

(2)

i

j

=

h

u

(1)

1

⊗ u

(2)

1

. . . u

(1)

J

⊗ u

(2)

J

i

j

= u

(1)

j

⊙ u

(2)

j

or in more general form:

n

u

j

o

⊙

−n

=

h

U

⊙

−n

i

j

. (51)

Hence, by replacing Y

(j)

(n)

terms in (49) by those in (45), and

taking into account (51), the update learning rule (49) can

be expressed as

u

(n)

j

← Y

(n)

h

U

⊙

−n

i

j

−

b

Y

(n)

h

U

⊙

−n

i

j

+ ~{u

j

}

(n)

n

u

j

o

⊙

−n

=

Y

(n)

U

⊙

−n

j

− U

(n)

U

⊙

−n

T

U

⊙

−n

j

+ u

(n)

j

n

u

j

o

⊙

−n

T

n

u

j

o

⊙

−n

=

h

Y

(n)

U

⊙

−n

i

j

− U

(n)

h

U

⊙

−n

T

U

⊙

−n

i

j

+ γ

(n)

j

u

(n)

j

=

h

Y

(n)

U

⊙

−n

i

j

− U

(n)

n

U

T

U

o

⊛

−n

j

+ γ

(n)

j

u

(n)

j

=

Y

(n)

U

⊙

−n

j

− U

(n)

n

U

T

U

o

⊛

⊘

U

(n)T

U

(n)

j

+ γ

(n)

j

u

(n)

j

, (52)

subject to the normalization of vectors u

(n)

j

for n =

1, 2, . . . , N −1 to unit length. In combination with a compo-

nentwise nonlinear half-wave rectifying operator, we ﬁnally

have a new algorithm referred as the Fast HALS NTF algo-

rithm:

u

(n)

j

←

"

γ

(n)

j

u

(n)

j

+

Y

(n)

U

⊙

−n

j

− U

(n)

n

U

T

U

o

⊛

⊘

U

(n)T

U

(n)

j

#

+

. (53)

The detailed pseudo-code of this algorithm is given in Al-

gorithm 3. In a special case of N = 2, FAST-HALS NTF

becomes FAST-HALS NMF algorithm described in the pre-

vious section.

†

For 3-way tensor, direct trilinear decomposition could be used

as initialization.

††

In practice, vectors u

(n)

j

have often ﬁxed sign before rectifying.

Algorithm 3 FAST-HALS NTF

1: Nonnegative random or nonnegative ALS initialization U

(n)

†

2: Normalize all u

(n)

j

for n = 1, . . . , N − 1 to unit length

3: T

1

= (U

(1)T

U

(1)

) ⊛ . . . ⊛ (U

(N)T

U

(N)

)

4: repeat

5: γ = diag(U

(N)T

U

(N)

)

6: for n = 1 to N do

7: γ = 1 if n = N

8: T

2

= Y

(n)

{U

⊙

−n

}

9: T

3

= T

1

⊘ (U

(n)T

U

(n)

)

10: for j = 1 to J do

11: u

(n)

j

⇐

h

γ

j

u

(n)

j

+ [T

2

]

j

− U

(n)

[T

3

]

j

i

+

††

12: u

(n)

j

= u

(n)

j

/ku

(n)

j

k

2

if n , N

13: end for

14: T

1

= T

3

⊛ U

(n)T

U

(n)

15: end for

16: until convergence criterion is reached

5. Flexible Local Algorithms Using Alpha Divergence

The algorithms derived in previous sections can be extended

to more robust algorithms by applying a family of general-

ized Alpha and Beta divergences.

For the NMF problem (1) we deﬁne the Alpha diver-

gence as follows (similar to [14],[18],[25],[29]):

D

(j)

α

([Y

(j)

]

+

) || a

j

x

j

=

X

ik

z

(j)

ik

α(α + 1)

z

(j)

ik

y

(j)

ik

α

− 1

−

z

(j)

ik

− y

(j)

ik

α + 1

, α , −1, 0, (54a)

X

ik

(z

(j)

ik

) ln

z

(j)

ik

y

(j)

ik

− z

(j)

ik

+ y

(j)

ik

, α=0, (54b)

X

ik

y

(j)

ik

ln

y

(j)

ik

z

(j)

ik

+ z

(j)

ik

− y

(j)

ik

, α=-1, (54c)

where y

(j)

ik

= [Y]

ik

−

P

p, j

a

ip

x

pk

and z

(j)

ik

= a

ij

x

jk

= a

ij

b

kj

for

j = 1, 2, . . . , J.

The choice of parameter α ∈ R depends on statistical

distributionsof noise and data. In the special cases of the Al-

pha divergence for α = {1, −0.5, −2}, we obtain respectively

the Pearson’s chi squared, Hellinger’s, and Neyman’s chi-

square distances while for the cases α = 0 and α = −1, the

divergence has to be deﬁned by the limits of (54a) as α → 0

and α → −1, respectively. When these limits are evaluated

for α → 0 we obtain the generalized Kullback-Leibler di-

vergence deﬁned by Eq. (54b) whereas for α → −1 we have

the dual generalized Kullback-Leibler divergence given in

Eq. (54c) [1],[14],[19],[25].

The gradient of the Alpha divergence (54) for α , −1

with respect to a

ij

and b

kj

can be expressed in a compact

form as:

∂D

(j)

α

∂b

kj

=

1

α

X

i

a

ij

z

(j)

ik

y

(j)

ik

α

− 1

, (55)

∂D

(j)

α

∂a

ij

=

1

α

X

k

b

kj

z

(j)

ik

y

(j)

it

α

− 1

. (56)

8

IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x

By equating the gradients to zero, we obtain a new multi-

plicative local α-HALS algorithm:

b

j

←

h

Y

(j) T

i

.[α]

+

a

j

a

T

j

a

.[α]

j

.[1/α]

, a

j

←

h

Y

(j)

i

.[α]

+

b

j

b

T

j

b

.[α]

j

.[1/α]

, (57)

where the “rise to the power” operations x

.[α]

are performed

componentwise. The above algorithm can be generalized to

the following form

b

j

← Ψ

−1

Ψ

h

Y

(j)T

i

+

a

j

a

T

j

Ψ(a

j

)

, a

j

← Ψ

−1

Ψ

h

Y

(j)

i

+

b

j

b

T

j

Ψ(b

j

)

, (58)

where Ψ(x) is suitable chosen function, for example, Ψ(x) =

x

.[α]

, componentwise

†

.

In a similar way, novel learning rules for the N-order

NTF problem (2) can be derived. For this purpose, we con-

sider the n-mode matricized (unfolded) version of the tensor

Y

Y

(n)

= U

(n)

(U

⊙

−n

)

T

. (59)

Actually, this can be considered as an NMF model with A ≡

U

(n)

and B ≡ U

⊙

−n

. From (51), we have

b

j

=

h

U

⊙

−n

i

j

=

n

u

j

o

⊙

−n

. (60)

Applying directly the learning rule (58) to the model (59)

gives

u

(n)

j

← Ψ

−1

Ψ

h

Y

(j)

(n)

i

+

b

j

b

T

j

Ψ(b

j

)

, (61)

where Y

(j)

(n)

is an n-mode matricized version of Y

(j)

in (45)

Y

(j)

(n)

= Y

(n)

−

b

Y

(n)

+ u

(n)

j

b

T

j

= Y

(n)

−

b

Y

(n)

+ u

(n)

j

n

u

j

o

⊙

−n

T

= Y

(n)

−

b

Y

(n)

+ ~{u

j

}

(n)

. (62)

For a speciﬁc nonlinear function Ψ(·) (Ψ(x) = x

α

)

Ψ(b

j

) = Ψ({u

j

}

⊙

−n

)

= Ψ(u

(N)

j

)··· ⊙ Ψ(u

(n+1)

j

) ⊙ Ψ(u

(n−1)

j

)··· ⊙ Ψ(u

(1)

j

)

= {Ψ(u

j

)}

⊙

−n

, (63)

and the denominator in (61) can be simpliﬁed as

b

T

j

Ψ(b

j

) = {u

j

}

⊙

−n

T

{Ψ(u

j

)}

⊙

−n

= {u

T

j

Ψ(u

j

)}

⊛

−n

, (64)

this completesthe derivationof a ﬂexible Alpha-HALS NTF

update rule, which in the tensor form is given by

u

(n)

j

← Ψ

−1

Ψ

[Y

(j)

]

+

×

−n

{u

j

}

n

u

T

j

Ψ(u

j

)

o

⊛

−n

+

, (65)

where all nonlinear operations are componentwise

††

.

†

For α = 0 instead of Φ(x) = x

α

we used Φ(x) = ln(x) [18].

††

In practice, instead of half-wave rectifying we often use dif-

ferent transformations, e.g., real part of Ψ(x) or adaptive nonneg-

ative shrinkage function with gradually decreasing threshold till

variance of noise σ

2

noise

.

Algorithm 4 Alpha-HALS NTF

1: ALS or random initialization for all nonnegative vectors u

(n)

j

2: Normalize all u

(n)

j

for n = 1, 2, ..., N − 1 to unit length,

3: Compute residue tensor E = Y − ~{U} = Y −

b

Y

4: repeat

5: for j = 1 to J do

6: Compute Y

(j)

= E + u

(1)

j

◦ u

(2)

j

◦ . . . ◦ u

(N)

j

7: for n = 1 to N do

8: u

(n)

j

as in (65)

9: Normalize u

(n)

j

to unit length vector if n , N

10: end for

11: Update E = Y

(j)

− u

(1)

j

◦ u

(2)

j

◦ . . . ◦ u

(N)

j

12: end for

13: until convergence criterion is reached

6. Flexible HALS Algorithms Using Beta Divergence

Beta divergence can be considered as a ﬂexible and com-

plementary cost function to the Alpha divergence. In order

to obtain local NMF algorithms we introduce the following

deﬁnition of the Beta divergence (similar to [14],[18],[30]):

D

(j)

β

([Y

(j)

]

+

|| a

j

x

j

) =

X

ik

([y

(j)

ik

]

+

)

[y

(j)

ik

]

β

+

− z

(j)β

ik

β

−

[y

(j)

ik

]

β+1

+

− z

(j) β+1

ik

β + 1

, β > 0, (66a)

X

ik

([y

(j)

ik

]

+

) ln

[y

(j)

ik

]

+

z

(j)

ik

− [y

(j)

ik

]

+

+ z

(j)

ik

, β=0, (66b)

X

ik

ln

z

(j)

ik

[y

(j)

ik

]

+

+

[y

(j)

ik

]

+

z

(j)

ik

− 1

, β=-1, (66c)

where y

(j)

ik

= y

ik

−

P

p, j

a

ip

b

kp

and z

(j)

ik

= a

ij

x

jk

= a

ij

b

kj

for j = 1, 2, . . . , J. The choice of the real-valued parameter

β ≤ −1 depends on the statistical distribution of data and

the Beta divergence corresponds to Tweedie models [14],

[19],[25],[30]. For example, if we consider the Maximum

Likelihood (ML) approach (with no a priori assumptions)

the optimal estimation consists of minimization of the Beta

Divergence measure when noise is Gaussian with β = 1.

For the Gamma distribution β = −1, for the Poisson distri-

bution β = 0, and for the compound Poisson β ∈ (−1, 0).

However, the ML estimation is not optimal in the sense of

a Bayesian approach where a priori information of sources

and mixing matrix (sparsity, nonnegativity)can be imposed.

It is interesting to note that the Beta divergence as special

cases includes the standard squared Euclidean distance (for

β = 1), the Itakura-Saito distance (β = −1), and the general-

ized Kullback-Leibler divergence ( β = 0).

In order to derive a local learning algorithm, we com-

pute the gradient of (66), with respect to elements to b

kj

, a

ij

:

CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS

9

Algorithm 5 Beta-HALS NTF

1: Initialize randomly all nonnegative factors U

(n)

2: Normalize all u

l, j

for l = 1...N − 1 to unit length,

3: Compute residue tensor E = Y − ~{U} = Y −

b

Y

4: repeat

5: for j = 1 to J do

6: Compute Y

(j)

= E + u

(1)

j

◦ u

(2)

j

◦ . . . ◦ u

(N)

j

7: for n = 1 to N − 1 do

8: u

(n)

j

⇐

h

Y

(j)

×

−n

{Ψ(u

j

)}

i

+

9: Normalize u

(n)

j

to unit length vector

10: end for

11: u

(N)

j

←

"

Y

(j)

×

−N

{Ψ(u

j

)}

{Ψ(u

j

)

T

u

j

}

⊛

−n

#

+

12: Update E = Y

(j)

− u

(1)

j

◦ u

(2)

j

◦ . . . ◦ u

(N)

j

13: end for

14: until convergence criterion is reached

∂D

(j)

β

∂b

kj

=

X

i

z

(j) β

ik

− ([y

(j)

ik

]

+

) z

(j) β−1

ik

a

ij

, (67)

∂D

(j)

β

∂a

ij

=

X

k

z

(j) β

ik

− ([y

(j)

ik

]

+

) z

(j) β−1

ik

b

kj

. (68)

By equating the gradient components to zero, we obtain a

set of simple HALS updating rules referred to as the Beta-

HALS algorithm:

b

kj

←

1

P

I

i=1

a

β+1

ij

I

X

i=1

a

β

ij

([y

(j)

ik

]

+

) , (69)

a

ij

←

1

P

K

k=1

b

β+1

kj

K

X

k=1

b

β

kj

([y

(j)

ik

]

+

). (70)

The above update rules can be written in a generalized com-

pact vector form as

b

j

←

([Y

(j) T

]

+

)Ψ(a

j

)

Ψ(a

T

j

) a

j

, a

j

←

([Y

(j)

]

+

) Ψ(b

j

)

Ψ(b

T

j

) b

j

, (71)

where Ψ(b) is a suitably chosen convex function (e.g.,

Ψ(b) = b

.[β]

) and the nonlinear operations are performed

element-wise.

The above learning rules could be generalized for the

N-order NTF problem (2) (using the similar approach as for

the Alpha-HALS NTF):

u

(n)

j

←

([Y

(j)

(n)

]

+

) Ψ(b

j

)

Ψ(b

T

j

) b

j

, (72)

where b

j

= {u

j

}

⊙

−n

, and Y

(j)

(n)

are deﬁned in (62) and (45).

By taking into account (63), the learning rule (72) can

be written as follows

u

(n)

j

←

([Y

(j)

(n)

]

+

) {Ψ(u

j

)}

⊙

−n

{Ψ(u

j

)}

⊙

−n

T

{u

j

}

⊙

−n

=

[Y

(j)

]

+

×

−n

{Ψ(u

j

)}

{Ψ(u

j

)

T

u

j

}

⊛

−n

.(73)

Actually, the update rule (73) can be simpliﬁed to reduce

computational cost by performing normalization of vectors

u

(n)

j

for n = 1, . . . , N − 1 to unit length vectors after each

iteration step:

u

(n)

j

←

h

Y

(j)

×

−n

{Ψ(u

j

)}

i

+

, u

(n)

j

← u

(n)

j

/ku

(n)

j

k

2

.(74)

The detailed pseudo-codeof the Beta-HALS NTF algorithm

is given in Algorithm 5. Once again, this algorithm can be

rewritten in the fast form as follows

u

(n)

j

←

"

γ

(n)

j

u

(n)

j

+

Y

(n)

{Ψ(U)}

⊙

−n

j

− U

(n)

n

Ψ(U)

T

U

o

⊛

−n

j

#

+

(75)

where γ

(n)

j

= {Ψ(u

T

j

)u

j

}

⊛

−n

, n = 1, . . . , N. The Fast HALS

NTF algorithm is a special case with Ψ(x) = x.

In order to avoid local minima we have also developed

a simple heuristic hierarchical Alpha- and Beta- HALS NTF

algorithms combined with multi-start initializations using

the ALS as follows:

1. Perform factorization of a tensor for any value of α or

β parameters (preferably, set the value of the param-

eters to unity due to simplicity and high speed of the

algorithm for this value).

2. If the algorithm has convergedbut has not achieved the

desirable ﬁt value (FIT max), restart the factorization

by keeping the previously estimated factors as the ini-

tial matrices for the ALS initialization.

3. If the algorithm does not converge, alter the values of

α or β parameters incrementally; this may help to over-

step local minima.

4. Repeat the procedure until a desirable ﬁt value is

reached or there is a negligible or no change in the ﬁt

value or a negligible or no change in the factor matri-

ces, or the value of the cost function in negligible or

zero.

7. Simulation Results

Extensive simulations were performed for synthetic and

real-world data on a 2.66 GHz Quad-Core Windows 64-bit

PC with 8GB memory. For tensor factorization, the results

were compared with some existing algorithms: the NMWF

[31], the lsNTF [32] and also with two eﬃcient implementa-

tions of general form of PARAFAC ALS algorithm by Kolda

and Bader [16] (denoted as ALS K) and by Andersson and

Bro [33] (denoted as ALS B). To make a fair comparison

we apply the same stopping criteria and conditions: maxi-

mum diﬀerence of ﬁt value, and we used three performance

indexes: Peak Signal to Noise Ratio (PSNR) for all frontal

slices, Signal to Interference Ratio (SIR)

†

for each columns

of factors, and the explained variation ratio (i.e., how well

the approximated tensor ﬁt the input data tensor) for a whole

tensor.

†

The signal to interference ratio is deﬁned as S IR(a

j

, ˆa

j

) =

10log(||a

j

||

2

2

/(||a

j

− ˆa

j

||

2

2

)) for normalized and matched vectors.

10

IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x

7.1 Experiments for NMF

In Example 1 we compare our HALS algorithms with

the multiplicative Lee-Seung algorithm [34] and Chih-Lin

Projected Gradient (PG) algorithm [35] for the benchmark

Xspectra [36] (see Fig.3(b)). Ten mixtures were randomly

generated from 5 sources (Fig.3(a)). We selected α = 1.5

for α-HALS and β = 2 for β-HALS in order to show the dif-

ference in performance in comparison to the standard gen-

eralized Kullback-Leibler (K-L) divergence. Monte Carlo

analysis was also performed with 100 trials and the average

values of SIR for X and running time for each trial were

summarized on Fig.3(c). Fast HALS NMF, α-HALS and β-

HALS achievedhigher performancethan the twoother well-

known NMF algorithms. The simulation results for Exam-

ple 2 presented in Fig.4 were performed for the synthetic

benchmark(Fig.4(a)) with 10 sparse (non-overlapping)non-

negative components. The sources were mixed by the ran-

domly generated full column rank matrix A ∈ R

2×10

+

, so

only two mixed signals were available. The typical mixed

signals are shown in Fig.4(b). The estimated components

by the new β-HALS NMF algorithm (69)-(71) with β =

0.1 are illustrated in Fig.4(c). Moreover, the performance

for diﬀerent values of the parameter β are illustrated in

Fig.4(d) and 4(e) with average Signal-to-Interference (SIR)

level greater than 30 [dB]. Since the proposed algorithms

(alternating technique) perform a non-convex optimization,

the estimated components depend on the initial conditions.

To estimate the performance in a statistical sense, we per-

formed a Monte Carlo (MC) analysis. Figures 4(d) and 4(e)

present the histograms of 100 mean-S IR samples for esti-

mations matrices A and X. We also conducted an experi-

ment for the large scale similar problem in which we used

100 very sparse non-overlapped source signals and we mix

them by random generated full column rank mixing ma-

trix A ∈ R

2×100

+

(i.e., only two mixtures were used). Us-

ing the same algorithm but with 25 NMF layers, we were

able to recover most of the sources in high probability.

The performance is evaluated through the correlation matrix

R

X

=

ˆ

X X

T

which should be a diagonal matrix for a perfect

estimation (given in Fig. 5(a)). Whereas distribution of the

SIR performance is shown in Fig. 5(b). Detailed results are

omitted due to space limits.

In Example 3 we used ﬁve noisy mixtures of three

smooth sources (benchmark signals X 5smooth [36]).

Mixed signals were corrupted by additive Gaussian noise

with SNR = 15 [dB] (Fig.6(a)). Fig.6 (c) illustrates eﬃ-

ciency of the HALS NMF algorithm with smoothness con-

straints using updates rules (41), including the Laplace op-

erator L of the second order. The estimated components

by the smooth HALS NMF using 3 layers [14] are depicted

in Fig.6(b), whereas the results of the same algorithm with

the smoothness constraint achievedS IR A = 29.22 [dB] and

S IR X = 15.53 [dB] are shown in Fig.6(c).

7.2 Experiments for NTF

In Example 4, we applied the NTF to a simple denois-

ing of images. At ﬁrst, a third-order tensor Y ∈ R

51×51×40

+

whose each layer was generated by the L-shaped membrane

function (which creates the MATLAB logo) Y[:, :, k] =

k∗membrane(1, 25), k = 1, . . . , 40 has been corrupted by ad-

ditive Gaussian noise with SNR 10 [dB] (Fig. 7(a)). Next,

the noisy tensor data has been approximated by NTF model

using our α-HALS and β-HALS algorithms with ﬁt value

96.1%. Fig.7(a), 7(b) and 7(c) are surface visualizations of

the 40-th noisy slice, and its reconstructed slices by α− and

β-HALS NTF (α = 2, β = 2), whereas Fig.7(d), 7(e) and 7(f)

are their iso-surface visualizations, respectively. In addition,

the performance for diﬀerent values of parameters α and β

are illustrated in Fig. 7(g) and 7(h) with PSNR in the left

(blue) axis and number of iterations in the right (red) axis.

In Example 5, we constructed a large scale tensor

with size of 500 × 500 ×500 corrupted by additive Gaus-

sian noise with SNR = 0 [dB] by using three benchmarks

X spectra sparse, ACPos24sparse10 and X spectra

[36] (see Fig.8(a)) and successfully reconstructed original

sparse and smooth components using α- and β-HALS NTF

algorithms. The performance is illustrated via volume, iso-

surface and factor visualizations as shown in Fig. 8(b), 8(c)

and 8(f); while running time and distributions of SIR and

PSNR performance factors are depicted in Fig. 8(g). Slice

10 and its reconstructed slice are displayed in Fig.8(d) and

8(e). In comparison to the known NTF algorithms the Fast

HALS NTF algorithm provides a higher accuracy for fac-

tor estimation based on SIR index, and the higher explained

variation with the faster running time.

In Example 6, we tested the Fast HALS NTF algo-

rithm for real-world data: Decomposition of amino acids

ﬂuorescence data (Fig.9(a)) from ﬁve samples containing

tryptophan, phenylalanine, and tyrosine (claus.mat) [33],

[37]. The data tensor was additionally corrupted by Gaus-

sian noise with SNR = 0 dB (Fig.9(b)) , and the factors were

estimated with J = 3. The β-HALS NTF was selected with

β = 1.2, where for α-HALS NTF we select α = 0.9. All

algorithms were set to process the data with the same num-

ber of iterations (100 times). The performances and running

times are compared in Fig. 10, and also in Table 3. In this

example, we applied a smoothness constraint for Fast NTF,

α- and β- HALS NTF. Based on ﬁt ratio and PSNR index

we see that, HALS algorithms usually exhibited better per-

formance than standard NTF algorithms. For example, the

ﬁrst recovered slice (Fig.9(c)) is almost identical to the slice

of the clean original tensor (99.51% Fit value). In compar-

ison, the NMWF, lsNTF, ALS K, ALS B produced some

artifacts as illustrated in Fig.9(d). Fig.9(e) and Fig.9(f).

In Example 7 we used real EEG data: tutorial-

dataset2.zip [38] which was pre-processed by complex

Morlet wavelet. The tensor is represented by the inter-trial

phase coherence (ITPC) for 14 subjects during a proprio-

ceptive pull of left and right hand (28 ﬁles) with size 64

CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS

11

0

2

4

y1

0

5

10

0

5

y3

0

5

10

0

5

10

0

5

y6

0

2

4

y7

0

5

10

0

5

10

100 200 300 400 500 600 700 800 900 1000

0

5

10

(a) 10 mixtures of dataset Xspectra

0

10

20

0

20

40

0

10

20

0

10

20

100 200 300 400 500 600 700 800 900 1000

0

20

40

(b) β-HALS (β = 2)

5

10

15

20

25

30

35

SIR in [dB]

0

0.5

1

1.5

2

2.5

3

Time in second

FastHALS

α-HALS

β-HALS

Lee-Seung

PG

(c) SIR for X and running time

Fig.3 Comparison of the Fast HALS NMF, α-HALS, β-HALS, Lee-

Seung and PG algorithms in Example 1 with the data set Xspectra. (a)

observed mixed signals, (b) reconstructed original spectra (sources) using

the β-HALS algorithm, (c) SIRs for the matrix X and computation time for

diﬀerent NMF algorithms.

0

200

400

0

200

400

0

200

400

0

200

400

0

200

400

0

200

400

0

500

1000

0

200

400

0

200

400

100 200 300 400

0

200

400

(a) 10 sources

0

2

4

6

8

10

12

100 200 300 400

0

5

10

15

20

(b) 2 mixtures

0

10

20

0

10

20

0

20

40

0

20

40

0

20

40

0

20

40

0

10

20

0

20

40

0

10

20

100 200 300 400

0

10

20

(c) β-HALS, β = 0.1

0.1 0.5 0.8 1 1.3

40

60

80

100

120

140

SIR [dB]

beta

Mean SIR for A

(d) SIR for A

0.1 0.5 0.8 1 1.3

50

100

150

200

250

SIR [dB]

Mean SIR for X

beta

(e) SIR for X

Fig.4 Illustration of performance of the β-HALS NMF algorithm (a) 10

sparse sources assumed to be unknown, (b) two mixtures, (c) 10 estimated

sources for β = 0.1. (d) & (e) SIR values for matrix A and sources X

(respectively)