ArticlePDF Available

Fast Local Algorithms for Large Scale Nonnegative Matrix and Tensor Factorizations

Authors:
  • Systems Research Insititute Polish Academy of Science

Abstract and Figures

Nonnegative matrix factorization (NMF) and its extensions such as Nonnegative Tensor Factorization (NTF) have become prominent techniques for blind sources separation (BSS), analysis of image databases, data mining and other information retrieval and clustering applications. In this paper we propose a family of efficient algorithms for NMF/NTF, as well as sparse nonnegative coding and representation, that has many potential applications in computational neuroscience, multi-sensory processing, compressed sensing and multidimensional data analysis. We have developed a class of optimized local algorithms which are referred to as Hierarchical Alternating Least Squares (HALS) algorithms. For these purposes, we have performed sequential constrained minimization on a set of squared Euclidean distances. We then extend this approach to robust cost functions using the alpha and beta divergences and derive flexible update rules. Our algorithms are locally stable and work well for NMF-based blind source separation (BSS) not only for the over-determined case but also for an under-determined (over-complete) case (i.e., for a system which has less sensors than sources) if data are sufficiently sparse. The NMF learning rules are extended and generalized for N-th order nonnegative tensor factorization (NTF). Moreover, these algorithms can be tuned to different noise statistics by adjusting a single parameter. Extensive experimental results confirm the accuracy and computational performance of the developed algorithms, especially, with usage of multi-layer hierarchical NMF approach [3].
Content may be subject to copyright.
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
1
INVITED PAPER
Special Section on Signal Processing
Fast Local Algorithms for Large Scale Nonnegative Matrix and
Tensor Factorizations
Andrzej CICHOCKI
a)
, Member and Anh-Huy PHAN
††b)
, Nonmember
SUMMARY Nonnegative matrix factorization (NMF) and its exten-
sions such as Nonnegative Tensor Factorization (NTF) have become promi-
nent techniques for blind sources separation (BSS), analysis of image
databases, data mining and other information retrieval and clustering ap-
plications. In this paper we propose a family of ecient algorithms
for NMF/NTF, as well as sparse nonnegative coding and representation,
that has many potential applications in computational neuroscience, multi-
sensory processing, compressed sensing and multidimensional data anal-
ysis. We have developed a class of optimized local algorithms which are
referred to as Hierarchical Alternating Least Squares (HALS) algorithms.
For these purposes, we have performed sequential constrained minimiza-
tion on a set of squared Euclidean distances. We then extend this approach
to robust cost functions using the Alpha and Beta divergences and derive
flexible update rules. Our algorithms are locally stable and work well for
NMF-based blind source separation (BSS) not only for the over-determined
case but also for an under-determined (over-complete) case (i.e., for a sys-
tem which has less sensors than sources) if data are suciently sparse. The
NMF learning rules are extended and generalized for N-th order nonneg-
ative tensor factorization (NTF). Moreover, these algorithms can be tuned
to dierent noise statistics by adjusting a single parameter. Extensive ex-
perimental results confirm the accuracy and computational performance of
the developed algorithms, especially, with usage of multi-layer hierarchical
NMF approach [3].
key words: Nonnegative matrix factorization (NMF), nonnegative tensor
factorizations (NTF), nonnegative PARAFAC, model reduction, feature ex-
traction, compression, denoising, multiplicative local learning (adaptive)
algorithms, Alpha and Beta divergences.
1. Introduction
Recent years have seen a surge of interest in nonnegative
and sparse matrix and tensor factorization - decomposi-
tions which provide physically meaningful latent (hidden)
components or features. Nonnegative Matrix Factorization
(NMF) and its extension Nonnegative Tensor Factorization
(NTF) - multidimensional models with nonnegativity con-
straints - have been recently proposed as sparse and ecient
representations of signals, images and in general natural sig-
nals/data. From signal processing point of view and data
analysis, NMF/NTF are very attractive because they take
into account spatial and temporal correlations between vari-
ables and usually provide sparse common factors or hidden
Manuscript received July 30, 2008.
Manuscript revised November 11, 2008.
Final manuscript received December 12, 2008.
RIKEN Brain Science Institute, 2-1 Hirosawa, Wako, 351-
0198 Saitama, Japan and Warsaw University of Technology and
Systems Research Institute, Polish Academy of Science, Poland.
††
RIKEN Brain Science Institute, 2-1 Hirosawa, Wako, 351-
0198 Saitama, Japan.
a)E-mail: cia@brain.riken.jp
b)E-mail: phan@brain.riken.jp
Table 1 Basic tensor operations and notations [16]
outer product
Khatri-Rao product
Kronecker product
Hadamard product
element-wise division
[U]
j
jth column vector of [U]
U
(n)
the n th factor
u
(n)
j
jth column vector of U
(n)
n
u
j
o
n
u
(1)
j
, u
(2)
j
, . . . , u
(N)
j
o
Y
tensor
×
n
n mode product of tensor and matrix
×
n
n mode product of tensor and vector
Y
(n)
n mode matricized version of Y
U
U
(N)
U
(N1)
··· U
(1)
U
n
U
(N)
··· U
(n+1)
U
(n1)
··· U
(1)
U
U
(N)
U
(N1)
··· U
(1)
U
n
U
(N)
··· U
(n+1)
U
(n1)
··· U
(1)
SIR(a,b)10log
10
(kak
2
/ka bk
2
)
PSNR
20log
10
(Range of Signal/RMSE)
Fit(Y
,
b
Y) 100(1 kY
b
Yk
2
F
/kY E(Y)k
2
F
)
(latent) nonnegative components with physical or physio-
logical meaning and interpretations [1]–[5].
In fact, NMF and NTF are emerging techniques for
data mining, dimensionality reduction, pattern recognition,
object detection, classification, gene clustering, sparse non-
negativerepresentation and coding, and blind source separa-
tion (BSS) [5]–[14]. For example, NMF/NTF have already
found a wide spectrum of applications in positron emission
tomography (PET), spectroscopy, chemometrics and envi-
ronmental science where the matrices have clear physical
meanings and some normalization or constraints are im-
posed on them [12],[13],[15].
This paper introduces several alternative approaches
and improved local learning rules (in the sense that vec-
tors and rows of matrices are processed sequentially one
by one) for solving nonnegative matrix and tensor factor-
izations problems. Generally, tensors (i.e., multi-way ar-
rays) are denoted by underlined capital boldface letter, e.g.,
Y R
I
1
×I
2
×···×I
N
. The order of a tensor is the number of
modes, also known as ways or dimensions. In contrast, ma-
trices are denoted by boldface capital letters, e.g., Y; vectors
are denoted by boldface lowercase letters, e.g., columns of
the matrix A by a
j
and scalars are denoted by lowercase
letters, e.g., a
ij
. The i-th entry of a vector a is denoted by
a
i
, and (i, j) element of a matrix A by a
ij
. Analogously, el-
ement (i, k, q) of a third-order tensor Y R
I×K×Q
by y
ikq
.
Indices typically range from 1 to their capital version, e.g.,
i = 1, 2, . . . , I; k = 1, 2, . . . , K; q = 1, 2, . . . , Q. Throughout
this paper,standard notations and basic tensor operationsare
used as indicated in Table 1.
2. Models and Problem Statements
In this paper, we consider at first a simple NMF model de-
scribed as
2
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
=
Y
E
+
+
+
I
2
I
1
I
3
u
(1)
1
u
(2)
1
u
(3)
1
u
(3)
J
u
(2)
J
u
(1)
J
( )I I I
1 2 3
x x
( )I I I
1 2 3
x x
Fig.1 Illustration of a third-order tensor factorization using standard
NTF; Objective is to estimate nonnegative vectors u
(n)
j
for j = 1, 2, . . . , J
and n = 1, 2, 3.
Y = AX + E = AB
T
+ E, (1)
where Y = [y
ik
] R
I×K
is a known input data matrix,
A = [a
1
, a
2
, . . . , a
J
] R
I×J
+
is an unknown basis (mix-
ing) matrix with nonnegative vectors a
j
R
I
+
, X = B
T
=
[x
T
1
, x
T
2
, . . . , x
T
J
]
T
R
J×K
+
is a matrix representing unknown
nonnegative components x
j
and E = [e
ik
] R
I×K
repre-
sents errors or noise. For simplicity, we use also matrix
B = X
T
= [b
1
, b
2
, . . . , b
J
] R
K×J
+
which allows us to use
only column vectors. Our primary objective is to estimate
the vectors a
j
of the mixing (basis) matrix A and the sources
x
j
= b
T
j
(rows of the matrix X or columns of B), subject to
nonnegativity constraints
.
The simple NMF model (1) can be naturally extended
to the NTF (or nonnegative PARAFAC) as follows: “For
a given N-th order tensor Y R
I
1
×I
2
···×I
N
perform a non-
negative factorization (decomposition) into a set of N un-
known matrices: U
(n)
= [u
(n)
1
, u
(n)
2
, . . . , u
(n)
J
] R
I
n
×J
+
, (n =
1, 2, . . . , N) representing the common (loading) factors”,
i.e., [11],[16]
Y =
J
X
j=1
u
(1)
j
u
(2)
j
··· u
(N)
j
+ E (2)
where means outer product of vectors
††
and
b
Y :=
J
P
j=1
u
(1)
j
u
(2)
j
··· u
(N)
j
is an estimated or approximated
(actual) tensor (see Fig. 1). For simplicity, we use the fol-
lowing notations for the parameters of the estimated tensor
b
Y :=
J
P
j=1
~u
(1)
j
, u
(2)
j
, . . . , u
(N)
j
= ~{U} [16]. A residuum ten-
sor defined as E = Y
b
Y represents noise or errors depend-
ing on applications. This model can be referred to as non-
negative version of CANDECOMP proposed by Carroll and
Chang or equivalently nonnegative PARAFAC proposed in-
dependently by Harshman and Kruskal. In practice, we usu-
ally need to normalize vectors u
(n)
j
R
J
to unit length, i.e.,
Usually, a sparsity constraint is naturally and intrinsically pro-
vided due to nonlinear projected approach (e.g., half-wave recti-
fier or adaptive nonnegative shrinkage with gradually decreasing
threshold [17]).
††
For example, the outer product of two vectors a R
I
, b R
J
builds up a rank-one matrix A = a b = ab
T
R
I×J
and the outer
product of three vectors: a R
I
, b R
K
, c R
Q
builds up third-
order rank-one tensor: Y = a b c R
I×K×Q
, with entries defined
as y
ikq
= a
i
b
k
c
q
.
with ku
(n)
j
k
2
= 1 for n = 1, 2, . . . , N 1, j = 1, 2, . . . , J, or
alternatively apply a Kruskal model:
Y =
b
Y + E =
J
X
j=1
λ
j
(u
(1)
j
u
(2)
j
··· u
(N)
j
) + E, (3)
where λ = [λ
1
, λ
2
, . . . , λ
J
]
T
R
J
+
are scaling factors and the
factors matrices U
(n)
= [u
(n)
1
, u
(n)
2
, . . . , u
(n)
J
] have all vectors
u
(n)
j
normalized to unit length columns in the sense ku
(n)
j
k
2
2
=
u
(n)T
j
u
(n)
j
= 1, j, n. Generally, the scaling vector λ could be
derived as λ
j
= ku
(N)
j
k
2
. However, we often assume that the
weight vector λ can be absorbed the (non-normalized)factor
matrix U
(N)
, and therefore the model can be expressed in the
simplified form (2). The objectiveis to estimatenonnegative
component matrices: U
(n)
or equivalently the set of vectors
u
(n)
j
, (n = 1, 2, . . . , N, j = 1, 2, . . . , J), assuming that the
number of factors J is known or can be estimated.
It is easy to check that for N = 2 and for U
(1)
= A and
U
(2)
= B = X
T
the NTF simplifies to the standard NMF.
However, in order to avoid tedious and quite complex nota-
tions, we will derive most algorithms first for NMF problem
and next attempt to generalize them to the NTF problem,
and present basic concepts in clear and easy understandable
forms.
Most of known algorithms for the NTF/NMF model
are based on alternating least squares (ALS) minimization
of the squared Euclidean distance (Frobenius norm) [13],
[16],[18]. Especially, for NMF we minimize the following
cost function:
D
F
(Y ||
b
Y) =
1
2
kY AXk
2
F
,
b
Y = AX, (4)
and for the NTF model (2)
D
F
(Y ||
b
Y) =
1
2
Y
J
P
j=1
(u
(1)
j
u
(2)
j
··· u
(N)
j
)
2
F
, (5)
subject to nonnegativity constraints and often additional
constraints such as sparsity or smoothness [10]. Such for-
mulated problems can be considered as a natural exten-
sion of the extensively studied NNLS (Nonnegative Least
Squares) formulated as the following optimization problem:
“Given a matrix A R
I×J
and a set of the observed values
given by a vector y R
I
, find a nonnegative vector x R
J
to minimize the cost function J(x) =
1
2
||y Ax||
2
2
, i.e.,
min
x
J(x) =
1
2
||y Ax||
2
2
, (6)
subject to x 0” [13].
A basic approach to the above formulated optimization
problems (4-5) is alternating minimization or alternating
projection: The specified cost function is alternately min-
imized with respect to sets of parameters, each time opti-
mizing one set of arguments while keeping the others fixed.
It should be noted that the cost function (4) is convex with
respect to entries of A or X, but not both. Alternating mini-
mization of the cost function (4) leads to a nonnegativefixed
CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS
3
point ALS algorithm which can be described briefly as fol-
lows:
1. Initialize A randomly or by using the recursive appli-
cation of Perron-Frobenius theory to SVD [13].
2. Estimate X from the matrix equation A
T
AX = A
T
Y
by solving
min
X
D
F
(Y||AX) =
1
2
||Y AX||
2
F
, with fixed A.
3. Set all negative elements of X to zero or a small posi-
tive value.
4. Estimate A from the matrix equation XX
T
A
T
= XY
T
by solving
min
A
D
F
(Y||AX) =
1
2
||Y
T
X
T
A
T
||
2
F
, with fixed X.
5. Set all negative elements of A to zero or a small posi-
tive value ε.
The above ALS algorithm can be written in the following
explicit form
X max{ε, (A
T
A)
1
A
T
Y} := [A
Y]
+
, (7)
A max{ε, YX
T
(XX
T
)
1
} := [Y X
]
+
, (8)
where A
is the Moore-Penrose inverse of A, ε is a small
constant (typically, 10
16
) to enforce positive entries. Note
that max operator is performed component-wise for entries
of matrices. Various additional constraints on A and X can
be imposed [19].
For large scale NMF problem for J << I and J << K
the data matrix Y is usually low rank and in such cases we
do not need to process all elements of the matrix in order to
estimate factor matrices A and X (see Fig. 2). In fact, in-
stead of performing large scale factorization of (1), we can
consider alternating factorization of much smaller dimen-
sion problems:
Y
r
= A
r
X + E
r
, for fixed (known) A
r
, (9)
Y
c
= AX
c
+ E
c
, for fixed (known) X
c
, (10)
where Y
r
R
R×K
+
and Y
c
R
I×C
+
are matrices constructed
form preselected rows and columns of the data matrix Y, re-
spectively. Analogously, we construct reduced dimensions
matrices: A
r
R
R×J
and X
c
R
J×C
by using the same in-
dexes for columns and rows which were used for construc-
tion of the matrices Y
c
and Y
r
, respectively. There are sev-
eral strategies to chose columns and rows of the input matrix
data. The simplest scenario is to chose the first R rows and
the first C columns of data matrix Y. Alternatively, we can
select randomly, e.g., uniformly distributed, i.e. every N
row and column. Another option is to chose such rows and
columns that provide the largest
p
-norm values. For noisy
data with uncorrelated noise, we can construct new columns
and rows as local average (mean values) of some specific
numbers of columns and rows of the raw data. For exam-
ple, the first selected column is created as average of the
first M columns, the second columns is average of the next
@
I
R
C
Y
r
Y
c
X
c
A
r
J
J
(I T)´ (I J)´
(J T)´
C
R
T
T
Y
X
A
Fig.2 Conceptual illustration of processing of data for a large scale
NMF. Instead of processing the whole matrix Y R
I×K
, we process much
smaller dimensional block matrices Y
c
R
I×C
and Y
r
R
R×K
and cor-
responding factor matrices X
c
R
J×C
and A
r
R
R×J
with C << K and
R << I. For simplicity, we have assumed that the first R rows and the first
C columns of the matrices Y, A, X are chosen, respectively.
M columns, and so on. The same procedure is applied for
rows. Another approach is to cluster all columns and rows
in C and R cluster and select one column and one row form
each cluster, respectively. In practice, it is sucient to chose
J < R 4J and J < C 4J. In the special case, for squared
Euclidean distance (Frobenius norm) instead of alternating
minimizing the cost function:
D
F
(Y || AX) =
1
2
kY AXk
2
F
,
we can minimize sequentially two cost functions:
D
F
(Y
r
|| A
r
X) =
1
2
kY
r
A
r
Xk
2
F
, for fixed A
r
,
D
F
(Y
c
|| AX
c
) =
1
2
kY
c
AX
c
k
2
F
, for fixed X
c
.
Minimization of these cost functions with respect to X and
A, subject to nonnegativity constraints leads to simple ALS
update formulas for the large scale NMF:
A
h
Y
c
X
c
i
+
=
h
Y
c
X
T
c
(X
c
X
T
c
)
1
i
+
, (11)
X
h
A
r
Y
r
i
+
=
h
(A
T
r
A
r
)
1
A
T
r
Y
r
i
+
. (12)
The nonnegativeALS algorithm can be generalized for
the NTF problem (2) [16]:
U
(n)
Y
(n)
U
n
U
n
T
U
n
1
+
,
=
"
Y
(n)
U
n
n
U
T
U
o
n
1
#
+
, n = 1, . . . , N. (13)
where Y
(n)
R
I
n
×I
1
···I
n1
I
n+1
···I
N
+
is n-mode unfolded matrix of
the tensor Y R
I
1
×I
2
×···×I
N
+
and
n
U
T
U
o
n
= (U
(N)T
U
(N)
)
··· (U
(n+1)T
U
(n+1)
) (U
(n1)T
U
(n1)
) ··· (U
(1)T
U
(1)
).
At present, ALS algorithms for NMF and NTF are con-
sidered as “workhorse” approaches, however they may take
many iterations to converge. Moreover, they are also not
guaranteed to converge to a global minimum or even a sta-
tionary point, but only to a solution where the cost functions
cease to decrease [13],[16]. However, the ALS method can
be considerably improved and the computational complex-
ity reduced as will be shown in this paper.
4
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
In fact, in this paper, we use a dierent and more so-
phisticated approach. Instead of minimizing one or two cost
functions, we minimize a set of local cost functions with the
same global minima (e.g., squared Euclidean distances and
Alpha or Beta divergences with a single parameter alpha or
beta). The majority of known algorithms for NMF work
only if the following assumption K >> I J is satisfied,
where J is the number of the nonnegative components. The
NMF algorithms developed in this paper are suitable also
for the under-determined case, i.e., for K > J > I, if sources
are sparse enough. Moreover, the proposed algorithms are
robust with respect to noise and suitable for large scale prob-
lems. Furthermore, in this paper we consider the extension
of our approach to NMF/NTF models with optional sparsity
and smoothness constraints.
3. Derivation of Fast HALS NMF Algorithms
Denoting the columns by A = [a
1
, a
2
, . . . , a
J
] and B =
[b
1
, b
2
, . . . , b
J
], we can express the squared Euclidean cost
function as
J(a
1
, . . . , a
J
, b
1
, . . . , b
J
) =
1
2
||Y AB
T
||
2
F
=
1
2
||Y
J
X
j=1
a
j
b
T
j
||
2
F
. (14)
The basic idea is to define residues:
Y
(j)
= Y
X
p, j
a
p
b
T
p
= Y AB
T
+ a
j
b
T
j
,
= Y AB
T
+ a
j1
b
T
j1
a
j1
b
T
j1
+ a
j
b
T
j
(15)
for j = 1, 2, . . . , J and minimize alternatively the set of cost
functions (with respect to set of parameters {a
j
} and {b
j
}):
D
(j)
A
(a
j
) =
1
2
||Y
(j)
a
j
b
T
j
||
2
F
, for a fixed b
j
, (16)
D
(j)
B
(b
j
) =
1
2
||Y
(j)
a
j
b
T
j
||
2
F
, for a fixed a
j
, (17)
for j = 1, 2, . . . , J subject to a
j
0 and b
j
0, respectively.
In other words, we minimize alternatively the set of
cost functions
D
(j)
F
(Y
(j)
||a
j
b
T
j
) =
1
2
||Y
(j)
a
j
b
T
j
||
2
F
, (18)
for j = 1, 2, . . . , J subject to a
j
0 and b
j
0, respectively.
The gradients of the local cost functions (18) with re-
spect to the unknown vectors a
j
and b
j
(assuming that other
vectors are fixed) are expressed by
D
(j)
F
(Y
(j)
||a
j
b
T
j
)
a
j
= a
j
b
T
j
b
j
Y
(j)
b
j
, (19)
D
(j)
F
(Y
(j)
||a
j
b
T
j
)
b
j
= b
j
a
T
j
a
j
Y
(j)T
a
j
. (20)
By equating the gradient components to zero and assuming
Algorithm 1 HALS for NMF: Given Y R
I×K
+
estimate
A R
I×J
+
and X = B
T
R
J×K
+
1: Initialize nonnegative matrix A and/or X = B
T
using ALS
2: Normalize the vectors a
j
(or b
j
) to unit
2
-norm length,
3: E = Y AB
T
;
4: repeat
5: for j = 1 to J do
6: Y
(j)
E + a
j
b
T
j
;
7: b
j
h
Y
(j)T
a
j
i
+
8: a
j
h
Y
(j)
b
j
i
+
9: a
j
a
j
/ka
j
k
2
;
10: E Y
(j)
a
j
b
T
j
;
11: end for
12: until convergence criterion is reached
that we enforce the nonnegativity constraints with a sim-
ple “half-wave rectifying” nonlinear projection, we obtain a
simple set of sequential learning rules:
b
j
1
a
T
j
a
j
h
Y
(j)T
a
j
i
+
, a
j
1
b
T
j
b
j
h
Y
(j)
b
j
i
+
, (21)
for j = 1, 2, . . . , J. We refer to these update rules as the
HALS algorithm which we first introduced in [3]. The same
or similar update rules for the NMF have been proposed
or rediscovered independently in [20]–[23]. However, our
practical implementations of the HALS algorithm are quite
dierent and allow various extensions to sparse and smooth
NMF, and also for the N-order NTF.
First of all, from the formula (15) it follows that we
do not need to compute explicitly the residue matrix Y
(j)
in
each iteration step but just smartly update it [24].
It is interesting to note that such nonlinear projections
can be imposed individuallyfor each source x
j
and/or vector
a
j
, so the algorithm can be directly extended to a semi-NMF
or a semi-NTF model in which some parameters are relaxed
to be bipolar (by removing the half-wave rectifying opera-
tor [·]
+
, if necessary). Furthermore, in practice, it is neces-
sary to normalize in each iteration step the column vectors
a
j
and/or b
j
to unit length vectors (in the sense of
p
-norm
(p = 1, 2, ..., )). In the special case of
2
-norm, the above
algorithm can be further simplified by ignoring denomina-
tors in (21) and imposing normalizationof vectors after each
iteration steps. The standard HALS local updating rules can
be written in a simplified scalar form:
b
kj
I
X
i=1
a
ij
y
(j)
ik
+
, a
ij
K
X
k=1
b
kj
y
(j)
ik
+
, (22)
with a
ij
a
ij
/||a
j
||
2
, where y
(j)
ik
= [Y
(j)
]
ik
= y
ik
P
p, j
a
ip
b
kp
. Ecient implementation of the HALS algo-
rithm (22) is illustrated by detailed pseudo-code given in
Algorithm 1.
3.1 Extensions and Practical Implementations of Fast
HALS
The above simple algorithm can be further extended or im-
proved (in respect of convergence rate and performance
CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS
5
and by imposing additional constraints such as sparsity and
smoothness). First of all, dierent cost functions can be
used for estimation of the rows of the matrix X = B
T
and the
columns of the matrix A (possibly with various additional
regularization terms [19],[25]). Furthermore, the columns
of A can be estimated simultaneously, instead of oneby one.
For example, by minimizing the set of cost functions in (4)
with respect to b
j
, and simultaneously the cost function (18)
with normalization of the columns a
j
to unit
2
-norm, we
obtain a very ecient NMF learning algorithm in which the
individual vectors of B are updated locally (column by col-
umn) and the matrix A is updated globally using nonnega-
tive ALS (all columns a
j
simultaneously) (see also [19]):
b
j
h
Y
(j) T
r
˜a
j
i
+
/(˜a
T
j
˜a
j
), A
h
Y
c
X
T
c
(X
c
X
T
c
)
1
i
+
, (23)
where ˜a
j
is an j-th vector of a reduced matrix A
r
R
R×J
+
.
Matrix A needs to be normalized to the unit length column
vectors in the
2
-norm sense after each iteration.
Alternatively, even more ecient approach is to per-
form factor by factor procedure, instead of updating
column-by column vectors [24]. From (21), we obtain the
following update rule for b
j
= x
T
j
b
j
Y
(j)T
a
j
/(a
T
j
a
j
) =
Y AB
T
+ a
j
b
T
j
T
a
j
/(a
T
j
a
j
)
= (Y
T
a
j
BA
T
a
j
+ b
j
a
T
j
a
j
)/(a
T
j
a
j
),
=
h
Y
T
A
i
j
B
h
A
T
A
i
j
+ b
j
a
T
j
a
j
/(a
T
j
a
j
), (24)
with b
j
h
b
j
i
+
. Due to ka
j
k
2
2
= 1, the learning rule for b
j
has a simplified form
b
j
b
j
+
h
Y
T
A
i
j
B
h
A
T
A
i
j
+
. (25)
Analogously to equation (24), the learning rule for a
j
is
given by
a
j
a
j
b
T
j
b
j
+ [YB]
j
A
h
B
T
B
i
j
+
, (26)
a
j
a
j
/ka
j
k
2
. (27)
Based on these expressions, we have designed and imple-
mented the improved and modified HALS algorithm given
below in the pseudo-code as Algorithm 2. For large scale
data and block-wise strategy, the fast HALS learning rule
for b
j
is rewritten from (24) as follows
b
j
b
j
+
h
Y
T
r
A
r
i
j
/k˜a
j
k
2
2
B
h
A
T
r
A
r
i
j
/k˜a
j
k
2
2
+
=
b
j
+
h
Y
T
r
A
r
D
A
r
i
j
B
h
A
T
r
A
r
D
A
r
i
j
+
(28)
where D
A
r
= diag(k˜a
1
k
2
2
, k˜a
2
k
2
2
, . . . , k˜a
J
k
2
2
) is a diagonal
matrix. The learning rule for a
j
has a similar form
a
j
a
j
+
Y
c
B
c
D
B
c
j
A
h
B
T
c
B
c
D
B
c
i
j
+
(29)
where D
B
c
= diag(k
˜
b
1
k
2
2
, k
˜
b
2
k
2
2
, . . . , k
˜
b
J
k
2
2
) and
˜
b
j
is the
j-th vector of the reduced matrix B
c
= X
T
c
R
C×J
+
.
Algorithm 2 FAST HALS for NMF: Y AB
T
1: Initialize nonnegative matrix A and/or B using ALS
2: Normalize the vectors a
j
(or b
j
) to unit
2
-norm length
3: repeat
4: % Update B;
5: W = Y
T
A;
6: V = A
T
A;
7: for j = 1 to J do
8: b
j
h
b
j
+ w
j
B v
j
i
+
9: end for
10: % Update A;
11: P = YB;
12: Q = B
T
B;
13: for j = 1 to J do
14: a
j
h
a
j
q
jj
+ p
j
A q
j
i
+
15: a
j
a
j
/ka
j
k
2
;
16: end for
17: until convergence criterion is reached
3.2 HALS NMF Algorithm with Sparsity and Smoothness
Constraints
In order to impose sparseness and smoothness constraints
for vectors b
j
(source signals), we can minimize the follow-
ing set of cost functions:
D
(j)
F
(Y
(j)
ka
j
b
T
j
) =
1
2
kY
(j)
a
j
b
T
j
k
2
F
+
+α
sp
kb
j
k
1
+ α
sm
kϕ(L b
j
)k
1
, (30)
for j = 1, 2, . . . , J subject to a
j
0 and b
j
0, where
α
sp
> 0, α
sm
> 0 are regularization parameters control-
ling level of sparsity and smoothness, respectively, L is a
suitably designed matrix (the Laplace operator) which mea-
sures the smoothness (by estimating the dierences between
neighboring samples of b
j
)
and ϕ : R R is an edge-
preserving function applied componentwise. Although this
edge-preserving nonlinear function may take various forms
[26]:
ϕ(t) = |t|
α
/α, 1 α 2, (31)
ϕ(t) =
α + t
2
, (32)
ϕ(t) = 1 + |t| log(1 + |t|), α > 0, (33)
we restrict ourself to simple cases, where ϕ(t) = |t|
α
for
α = 1 or 2, and L is the derivative operator of the first or
second order. For example, the first order derivativeoperator
L with K points can take the form:
L =
1 1
1 1
.
.
.
.
.
.
1 1
(34)
and the cost function (30) becomes similar to the total-
variation (TV) regularization (which is often used in sig-
nal and image recovery ) but with additional sparsity con-
straints:
In the special case for L = I
K
and ϕ(t) = |t|, the smoothness
regularization term becomes sparsity term.
6
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
D
(j)
F
(Y
(j)
ka
j
b
T
j
) =
1
2
Y
(j)
a
j
b
T
j
2
F
+ α
sp
kb
j
k
1
+
+ α
sm
K1
X
k=1
|b
k j
b
(k+1) j
|. (35)
Another important case assumes that ϕ(t) =
1
2
|t|
2
and L is
the second order derivative operator with K points. In such
a case, we obtain the Tikhonov-like regularization:
D
(j)
F
(Y
(j)
ka
j
b
T
j
) =
1
2
kY
(j)
a
j
b
T
j
k
2
F
+ α
sp
kb
j
k
1
+
+
1
2
α
sm
kLb
j
k
2
2
. (36)
In the such case the update rule for a
j
is the same as in (21),
whereas the update rule for b
j
is given by:
b
j
(I + α
sm
L
T
L)
1
(Y
(j)T
a
j
α
sp
1
K
). (37)
where 1
K
R
K
is a vector with all one. This learning rule
is robust to noise, however, it involves a rather high compu-
tational cost due to the calculation of an inverse of a large
matrix in each iteration. To circumvent this problem and
to considerably reduce the complexity of the algorithm we
present a second-order smoothing operator L in the follow-
ing form:
L =
2 2
1 2 1
1 2 1
.
.
.
.
.
.
1 2 1
2 2
=
2
2
2
.
.
.
2
2
+
0 2
1 0 1
1 0 1
.
.
.
.
.
.
.
.
.
1 0 1
2 0
= 2I + 2S. (38)
However, instead of computingdirectly Lb
j
= 2Ib
j
+2Sb
j
,
in the second term we replace b
j
by its estimation
ˆ
b
j
ob-
tained from the previous update. Hence, a new smoothing
regularization term with ϕ(t) = t
2
/8 takes a simplified and
computationally more ecient form
J
sm
= kϕ( 2b
j
+ 2S
ˆ
b
j
)k
1
=
1
2
kb
j
S
ˆ
b
j
k
2
2
. (39)
Finally, the learning rule of the regularized HALS algorithm
takes the following form:
b
j
h
Y
(j)T
a
j
α
sp
1
K
+ α
sm
S
ˆ
b
j
i
+
/(a
T
j
a
j
+ α
sm
)
=
h
Y
(j)T
a
j
α
sp
1
K
+ α
sm
S
ˆ
b
j
i
+
/(1 + α
sm
) . (40)
Alternatively, for a relatively small dimension of matrix A,
an ecient solution is based on a combination of a local
learning rule for the vectors of B and a global one for A,
based on the nonnegative ALS algorithm:
b
j
h
Y
(j)T
a
j
α
sp
1
K
+ α
sm
S
ˆ
b
j
i
+
/(1 + α
sm
),
A
h
Y
c
X
T
c
(X
c
X
T
c
)
1
i
+
, (41)
with the normalization (scaling) of the columns of A to the
unit length
2
-norm.
An importantopenproblemis an optimal choice of reg-
ularization parameters α
sm
. Selection of appropriateregular-
ization parameters plays a key role. Similar to the Tikhonov-
like regularization approach we selected an optimal α
sm
by
applying the L-curve technique [27] to estimate a corner of
the L-curve. However, in the NMF, since both matrices A
and X are unknown, the procedure is slightly dierent: first,
we initiate α
sm
= 0 and perform a preliminary update to ob-
tain A and X; next we set α
sm
by the L-curve corner based
on the preliminary estimated matrix A; then, we continue
updating until convergence is achieved.
4. Fast HALS NTF Algorithm Using Squared Eu-
clidean Distances
The above approaches can be relatively easily extended to
the NTF problem. Let us consider sequential minimization
of a set of local cost functions:
D
(j)
F
(Y
(j)
||
b
Y
(j)
)
=
1
2
Y
(j)
u
(1)
j
u
(2)
j
··· u
(N)
j
2
F
(42)
=
1
2
Y
(j)
(n)
u
(n)
j
n
u
j
o
n
T
2
F
, (43)
for j = 1, 2, . . . , J, subject to the nonnegativity constraints,
where
b
Y
(j)
= u
(1)
j
u
(2)
j
···u
(N)
j
,
n
u
j
o
n
T
= [u
(N)
j
]
T
···
[u
(n+1)
j
]
T
[u
(n1)
j
]
T
··· [u
(1)
j
]
T
and
Y
(j)
= Y
X
p, j
u
(1)
p
u
(2)
p
··· u
(N)
p
(44)
= Y
J
X
p=1
(u
(1)
p
··· u
(N)
p
) + (u
(1)
j
··· u
(N)
j
)
= Y
b
Y + ~{u
j
}. (45)
where ~{u
j
} = u
(1)
j
··· u
(N)
j
is a rank-one tensor. Note
that (43) is the nmode matricized (unfolded) version of
(42). The gradients of (43) with respect to elements u
(n)
j
are given by
D
(j)
F
u
(n)
j
= Y
(j)
(n)
n
u
j
o
n
+ u
(n)
j
n
u
j
o
n
T
n
u
j
o
n
(46)
= Y
(j)
(n)
n
u
j
o
n
+ u
(n)
j
γ
(n)
j
, (47)
where scaling coecients γ
(n)
j
can be computed as follows:
γ
(n)
j
=
n
u
j
o
n
T
n
u
j
o
n
=
n
u
T
j
u
j
o
n
=
n
u
T
j
u
j
o
/
u
(n)T
j
u
(n)
j
=
u
(N)T
j
u
(N)
j
/
u
(n)T
j
u
(n)
j
=
u
(N)T
j
u
(N)
j
, n , N
1, n = N.
(48)
Hence, a new HALS NTF learning rule for u
(n)
j
, (j =
CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS
7
1, 2, . . . , N; n = 1, 2, . . . , N) is obtained by equating the
gradient (47) to zero:
u
(n)
j
Y
(j)
(n)
n
u
j
o
n
. (49)
Note that the scaling factors γ
(n)
j
have been ignored due to
normalization after each iteration step u
(n)
j
= u
(n)
j
/ku
(n)
j
k
2
for n = 1, 2, . . . N 1. The learning rule (49) can be written
in an equivalent form expressed by n mode multiplication of
tensor by vectors:
u
(n)
j
Y
(j)
×
1
u
(1)
j
···×
n1
u
(n1)
j
×
n+1
u
(n+1)
j
···×
N
u
(N)
j
:= Y
(j)
×
n
{u
j
}, j = 1, . . . , J; n = 1, . . . , N. (50)
For simplicity, we use here a short notation Y
(j)
×
n
{u
T
j
} in-
troduced by Kolda and Bader [28] to indicate multiplication
of the tensor Y by vectors in all modes, but n-mode. The
above updating formula is elegant and relatively simple but
involvesrather high computational cost for large scale prob-
lems. In order to derive a more ecient (faster) algorithm
we exploit basic properties the Khatri-Rao and Kronecker
products of two vectors:
h
U
(1)
U
(2)
i
j
=
h
u
(1)
1
u
(2)
1
. . . u
(1)
J
u
(2)
J
i
j
= u
(1)
j
u
(2)
j
or in more general form:
n
u
j
o
n
=
h
U
n
i
j
. (51)
Hence, by replacing Y
(j)
(n)
terms in (49) by those in (45), and
taking into account (51), the update learning rule (49) can
be expressed as
u
(n)
j
Y
(n)
h
U
n
i
j
b
Y
(n)
h
U
n
i
j
+ ~{u
j
}
(n)
n
u
j
o
n
=
Y
(n)
U
n
j
U
(n)
U
n
T
U
n
j
+ u
(n)
j
n
u
j
o
n
T
n
u
j
o
n
=
h
Y
(n)
U
n
i
j
U
(n)
h
U
n
T
U
n
i
j
+ γ
(n)
j
u
(n)
j
=
h
Y
(n)
U
n
i
j
U
(n)
n
U
T
U
o
n
j
+ γ
(n)
j
u
(n)
j
=
Y
(n)
U
n
j
U
(n)
n
U
T
U
o
U
(n)T
U
(n)
j
+ γ
(n)
j
u
(n)
j
, (52)
subject to the normalization of vectors u
(n)
j
for n =
1, 2, . . . , N 1 to unit length. In combination with a compo-
nentwise nonlinear half-wave rectifying operator, we finally
have a new algorithm referred as the Fast HALS NTF algo-
rithm:
u
(n)
j
"
γ
(n)
j
u
(n)
j
+
Y
(n)
U
n
j
U
(n)
n
U
T
U
o
U
(n)T
U
(n)
j
#
+
. (53)
The detailed pseudo-code of this algorithm is given in Al-
gorithm 3. In a special case of N = 2, FAST-HALS NTF
becomes FAST-HALS NMF algorithm described in the pre-
vious section.
For 3-way tensor, direct trilinear decomposition could be used
as initialization.
††
In practice, vectors u
(n)
j
have often fixed sign before rectifying.
Algorithm 3 FAST-HALS NTF
1: Nonnegative random or nonnegative ALS initialization U
(n)
2: Normalize all u
(n)
j
for n = 1, . . . , N 1 to unit length
3: T
1
= (U
(1)T
U
(1)
) . . . (U
(N)T
U
(N)
)
4: repeat
5: γ = diag(U
(N)T
U
(N)
)
6: for n = 1 to N do
7: γ = 1 if n = N
8: T
2
= Y
(n)
{U
n
}
9: T
3
= T
1
(U
(n)T
U
(n)
)
10: for j = 1 to J do
11: u
(n)
j
h
γ
j
u
(n)
j
+ [T
2
]
j
U
(n)
[T
3
]
j
i
+
††
12: u
(n)
j
= u
(n)
j
/ku
(n)
j
k
2
if n , N
13: end for
14: T
1
= T
3
U
(n)T
U
(n)
15: end for
16: until convergence criterion is reached
5. Flexible Local Algorithms Using Alpha Divergence
The algorithms derived in previous sections can be extended
to more robust algorithms by applying a family of general-
ized Alpha and Beta divergences.
For the NMF problem (1) we define the Alpha diver-
gence as follows (similar to [14],[18],[25],[29]):
D
(j)
α
([Y
(j)
]
+
) || a
j
x
j
=
X
ik
z
(j)
ik
α(α + 1)
z
(j)
ik
y
(j)
ik
α
1
z
(j)
ik
y
(j)
ik
α + 1
, α , 1, 0, (54a)
X
ik
(z
(j)
ik
) ln
z
(j)
ik
y
(j)
ik
z
(j)
ik
+ y
(j)
ik
, α=0, (54b)
X
ik
y
(j)
ik
ln
y
(j)
ik
z
(j)
ik
+ z
(j)
ik
y
(j)
ik
, α=-1, (54c)
where y
(j)
ik
= [Y]
ik
P
p, j
a
ip
x
pk
and z
(j)
ik
= a
ij
x
jk
= a
ij
b
kj
for
j = 1, 2, . . . , J.
The choice of parameter α R depends on statistical
distributionsof noise and data. In the special cases of the Al-
pha divergence for α = {1, 0.5, 2}, we obtain respectively
the Pearson’s chi squared, Hellinger’s, and Neyman’s chi-
square distances while for the cases α = 0 and α = 1, the
divergence has to be defined by the limits of (54a) as α 0
and α 1, respectively. When these limits are evaluated
for α 0 we obtain the generalized Kullback-Leibler di-
vergence defined by Eq. (54b) whereas for α 1 we have
the dual generalized Kullback-Leibler divergence given in
Eq. (54c) [1],[14],[19],[25].
The gradient of the Alpha divergence (54) for α , 1
with respect to a
ij
and b
kj
can be expressed in a compact
form as:
D
(j)
α
b
kj
=
1
α
X
i
a
ij
z
(j)
ik
y
(j)
ik
α
1
, (55)
D
(j)
α
a
ij
=
1
α
X
k
b
kj
z
(j)
ik
y
(j)
it
α
1
. (56)
8
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
By equating the gradients to zero, we obtain a new multi-
plicative local α-HALS algorithm:
b
j
h
Y
(j) T
i
.[α]
+
a
j
a
T
j
a
.[α]
j
.[1]
, a
j
h
Y
(j)
i
.[α]
+
b
j
b
T
j
b
.[α]
j
.[1]
, (57)
where the “rise to the power” operations x
.[α]
are performed
componentwise. The above algorithm can be generalized to
the following form
b
j
Ψ
1
Ψ
h
Y
(j)T
i
+
a
j
a
T
j
Ψ(a
j
)
, a
j
Ψ
1
Ψ
h
Y
(j)
i
+
b
j
b
T
j
Ψ(b
j
)
, (58)
where Ψ(x) is suitable chosen function, for example, Ψ(x) =
x
.[α]
, componentwise
.
In a similar way, novel learning rules for the N-order
NTF problem (2) can be derived. For this purpose, we con-
sider the n-mode matricized (unfolded) version of the tensor
Y
Y
(n)
= U
(n)
(U
n
)
T
. (59)
Actually, this can be considered as an NMF model with A
U
(n)
and B U
n
. From (51), we have
b
j
=
h
U
n
i
j
=
n
u
j
o
n
. (60)
Applying directly the learning rule (58) to the model (59)
gives
u
(n)
j
Ψ
1
Ψ
h
Y
(j)
(n)
i
+
b
j
b
T
j
Ψ(b
j
)
, (61)
where Y
(j)
(n)
is an n-mode matricized version of Y
(j)
in (45)
Y
(j)
(n)
= Y
(n)
b
Y
(n)
+ u
(n)
j
b
T
j
= Y
(n)
b
Y
(n)
+ u
(n)
j
n
u
j
o
n
T
= Y
(n)
b
Y
(n)
+ ~{u
j
}
(n)
. (62)
For a specific nonlinear function Ψ(·) (Ψ(x) = x
α
)
Ψ(b
j
) = Ψ({u
j
}
n
)
= Ψ(u
(N)
j
)··· Ψ(u
(n+1)
j
) Ψ(u
(n1)
j
)··· Ψ(u
(1)
j
)
= {Ψ(u
j
)}
n
, (63)
and the denominator in (61) can be simplified as
b
T
j
Ψ(b
j
) = {u
j
}
n
T
{Ψ(u
j
)}
n
= {u
T
j
Ψ(u
j
)}
n
, (64)
this completesthe derivationof a flexible Alpha-HALS NTF
update rule, which in the tensor form is given by
u
(n)
j
Ψ
1
Ψ
[Y
(j)
]
+
×
n
{u
j
}
n
u
T
j
Ψ(u
j
)
o
n
+
, (65)
where all nonlinear operations are componentwise
††
.
For α = 0 instead of Φ(x) = x
α
we used Φ(x) = ln(x) [18].
††
In practice, instead of half-wave rectifying we often use dif-
ferent transformations, e.g., real part of Ψ(x) or adaptive nonneg-
ative shrinkage function with gradually decreasing threshold till
variance of noise σ
2
noise
.
Algorithm 4 Alpha-HALS NTF
1: ALS or random initialization for all nonnegative vectors u
(n)
j
2: Normalize all u
(n)
j
for n = 1, 2, ..., N 1 to unit length,
3: Compute residue tensor E = Y ~{U} = Y
b
Y
4: repeat
5: for j = 1 to J do
6: Compute Y
(j)
= E + u
(1)
j
u
(2)
j
. . . u
(N)
j
7: for n = 1 to N do
8: u
(n)
j
as in (65)
9: Normalize u
(n)
j
to unit length vector if n , N
10: end for
11: Update E = Y
(j)
u
(1)
j
u
(2)
j
. . . u
(N)
j
12: end for
13: until convergence criterion is reached
6. Flexible HALS Algorithms Using Beta Divergence
Beta divergence can be considered as a flexible and com-
plementary cost function to the Alpha divergence. In order
to obtain local NMF algorithms we introduce the following
definition of the Beta divergence (similar to [14],[18],[30]):
D
(j)
β
([Y
(j)
]
+
|| a
j
x
j
) =
X
ik
([y
(j)
ik
]
+
)
[y
(j)
ik
]
β
+
z
(j)β
ik
β
[y
(j)
ik
]
β+1
+
z
(j) β+1
ik
β + 1
, β > 0, (66a)
X
ik
([y
(j)
ik
]
+
) ln
[y
(j)
ik
]
+
z
(j)
ik
[y
(j)
ik
]
+
+ z
(j)
ik
, β=0, (66b)
X
ik
ln
z
(j)
ik
[y
(j)
ik
]
+
+
[y
(j)
ik
]
+
z
(j)
ik
1
, β=-1, (66c)
where y
(j)
ik
= y
ik
P
p, j
a
ip
b
kp
and z
(j)
ik
= a
ij
x
jk
= a
ij
b
kj
for j = 1, 2, . . . , J. The choice of the real-valued parameter
β 1 depends on the statistical distribution of data and
the Beta divergence corresponds to Tweedie models [14],
[19],[25],[30]. For example, if we consider the Maximum
Likelihood (ML) approach (with no a priori assumptions)
the optimal estimation consists of minimization of the Beta
Divergence measure when noise is Gaussian with β = 1.
For the Gamma distribution β = 1, for the Poisson distri-
bution β = 0, and for the compound Poisson β (1, 0).
However, the ML estimation is not optimal in the sense of
a Bayesian approach where a priori information of sources
and mixing matrix (sparsity, nonnegativity)can be imposed.
It is interesting to note that the Beta divergence as special
cases includes the standard squared Euclidean distance (for
β = 1), the Itakura-Saito distance (β = 1), and the general-
ized Kullback-Leibler divergence ( β = 0).
In order to derive a local learning algorithm, we com-
pute the gradient of (66), with respect to elements to b
kj
, a
ij
:
CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS
9
Algorithm 5 Beta-HALS NTF
1: Initialize randomly all nonnegative factors U
(n)
2: Normalize all u
l, j
for l = 1...N 1 to unit length,
3: Compute residue tensor E = Y ~{U} = Y
b
Y
4: repeat
5: for j = 1 to J do
6: Compute Y
(j)
= E + u
(1)
j
u
(2)
j
. . . u
(N)
j
7: for n = 1 to N 1 do
8: u
(n)
j
h
Y
(j)
×
n
{Ψ(u
j
)}
i
+
9: Normalize u
(n)
j
to unit length vector
10: end for
11: u
(N)
j
"
Y
(j)
×
N
{Ψ(u
j
)}
{Ψ(u
j
)
T
u
j
}
n
#
+
12: Update E = Y
(j)
u
(1)
j
u
(2)
j
. . . u
(N)
j
13: end for
14: until convergence criterion is reached
D
(j)
β
b
kj
=
X
i
z
(j) β
ik
([y
(j)
ik
]
+
) z
(j) β1
ik
a
ij
, (67)
D
(j)
β
a
ij
=
X
k
z
(j) β
ik
([y
(j)
ik
]
+
) z
(j) β1
ik
b
kj
. (68)
By equating the gradient components to zero, we obtain a
set of simple HALS updating rules referred to as the Beta-
HALS algorithm:
b
kj
1
P
I
i=1
a
β+1
ij
I
X
i=1
a
β
ij
([y
(j)
ik
]
+
) , (69)
a
ij
1
P
K
k=1
b
β+1
kj
K
X
k=1
b
β
kj
([y
(j)
ik
]
+
). (70)
The above update rules can be written in a generalized com-
pact vector form as
b
j
([Y
(j) T
]
+
)Ψ(a
j
)
Ψ(a
T
j
) a
j
, a
j
([Y
(j)
]
+
) Ψ(b
j
)
Ψ(b
T
j
) b
j
, (71)
where Ψ(b) is a suitably chosen convex function (e.g.,
Ψ(b) = b
.[β]
) and the nonlinear operations are performed
element-wise.
The above learning rules could be generalized for the
N-order NTF problem (2) (using the similar approach as for
the Alpha-HALS NTF):
u
(n)
j
([Y
(j)
(n)
]
+
) Ψ(b
j
)
Ψ(b
T
j
) b
j
, (72)
where b
j
= {u
j
}
n
, and Y
(j)
(n)
are defined in (62) and (45).
By taking into account (63), the learning rule (72) can
be written as follows
u
(n)
j
([Y
(j)
(n)
]
+
) {Ψ(u
j
)}
n
{Ψ(u
j
)}
n
T
{u
j
}
n
=
[Y
(j)
]
+
×
n
{Ψ(u
j
)}
{Ψ(u
j
)
T
u
j
}
n
.(73)
Actually, the update rule (73) can be simplified to reduce
computational cost by performing normalization of vectors
u
(n)
j
for n = 1, . . . , N 1 to unit length vectors after each
iteration step:
u
(n)
j
h
Y
(j)
×
n
{Ψ(u
j
)}
i
+
, u
(n)
j
u
(n)
j
/ku
(n)
j
k
2
.(74)
The detailed pseudo-codeof the Beta-HALS NTF algorithm
is given in Algorithm 5. Once again, this algorithm can be
rewritten in the fast form as follows
u
(n)
j
"
γ
(n)
j
u
(n)
j
+
Y
(n)
{Ψ(U)}
n
j
U
(n)
n
Ψ(U)
T
U
o
n
j
#
+
(75)
where γ
(n)
j
= {Ψ(u
T
j
)u
j
}
n
, n = 1, . . . , N. The Fast HALS
NTF algorithm is a special case with Ψ(x) = x.
In order to avoid local minima we have also developed
a simple heuristic hierarchical Alpha- and Beta- HALS NTF
algorithms combined with multi-start initializations using
the ALS as follows:
1. Perform factorization of a tensor for any value of α or
β parameters (preferably, set the value of the param-
eters to unity due to simplicity and high speed of the
algorithm for this value).
2. If the algorithm has convergedbut has not achieved the
desirable fit value (FIT max), restart the factorization
by keeping the previously estimated factors as the ini-
tial matrices for the ALS initialization.
3. If the algorithm does not converge, alter the values of
α or β parameters incrementally; this may help to over-
step local minima.
4. Repeat the procedure until a desirable fit value is
reached or there is a negligible or no change in the fit
value or a negligible or no change in the factor matri-
ces, or the value of the cost function in negligible or
zero.
7. Simulation Results
Extensive simulations were performed for synthetic and
real-world data on a 2.66 GHz Quad-Core Windows 64-bit
PC with 8GB memory. For tensor factorization, the results
were compared with some existing algorithms: the NMWF
[31], the lsNTF [32] and also with two ecient implementa-
tions of general form of PARAFAC ALS algorithm by Kolda
and Bader [16] (denoted as ALS K) and by Andersson and
Bro [33] (denoted as ALS B). To make a fair comparison
we apply the same stopping criteria and conditions: maxi-
mum dierence of fit value, and we used three performance
indexes: Peak Signal to Noise Ratio (PSNR) for all frontal
slices, Signal to Interference Ratio (SIR)
for each columns
of factors, and the explained variation ratio (i.e., how well
the approximated tensor fit the input data tensor) for a whole
tensor.
The signal to interference ratio is defined as S IR(a
j
, ˆa
j
) =
10log(||a
j
||
2
2
/(||a
j
ˆa
j
||
2
2
)) for normalized and matched vectors.
10
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
7.1 Experiments for NMF
In Example 1 we compare our HALS algorithms with
the multiplicative Lee-Seung algorithm [34] and Chih-Lin
Projected Gradient (PG) algorithm [35] for the benchmark
Xspectra [36] (see Fig.3(b)). Ten mixtures were randomly
generated from 5 sources (Fig.3(a)). We selected α = 1.5
for α-HALS and β = 2 for β-HALS in order to show the dif-
ference in performance in comparison to the standard gen-
eralized Kullback-Leibler (K-L) divergence. Monte Carlo
analysis was also performed with 100 trials and the average
values of SIR for X and running time for each trial were
summarized on Fig.3(c). Fast HALS NMF, α-HALS and β-
HALS achievedhigher performancethan the twoother well-
known NMF algorithms. The simulation results for Exam-
ple 2 presented in Fig.4 were performed for the synthetic
benchmark(Fig.4(a)) with 10 sparse (non-overlapping)non-
negative components. The sources were mixed by the ran-
domly generated full column rank matrix A R
2×10
+
, so
only two mixed signals were available. The typical mixed
signals are shown in Fig.4(b). The estimated components
by the new β-HALS NMF algorithm (69)-(71) with β =
0.1 are illustrated in Fig.4(c). Moreover, the performance
for dierent values of the parameter β are illustrated in
Fig.4(d) and 4(e) with average Signal-to-Interference (SIR)
level greater than 30 [dB]. Since the proposed algorithms
(alternating technique) perform a non-convex optimization,
the estimated components depend on the initial conditions.
To estimate the performance in a statistical sense, we per-
formed a Monte Carlo (MC) analysis. Figures 4(d) and 4(e)
present the histograms of 100 mean-S IR samples for esti-
mations matrices A and X. We also conducted an experi-
ment for the large scale similar problem in which we used
100 very sparse non-overlapped source signals and we mix
them by random generated full column rank mixing ma-
trix A R
2×100
+
(i.e., only two mixtures were used). Us-
ing the same algorithm but with 25 NMF layers, we were
able to recover most of the sources in high probability.
The performance is evaluated through the correlation matrix
R
X
=
ˆ
X X
T
which should be a diagonal matrix for a perfect
estimation (given in Fig. 5(a)). Whereas distribution of the
SIR performance is shown in Fig. 5(b). Detailed results are
omitted due to space limits.
In Example 3 we used five noisy mixtures of three
smooth sources (benchmark signals X 5smooth [36]).
Mixed signals were corrupted by additive Gaussian noise
with SNR = 15 [dB] (Fig.6(a)). Fig.6 (c) illustrates e-
ciency of the HALS NMF algorithm with smoothness con-
straints using updates rules (41), including the Laplace op-
erator L of the second order. The estimated components
by the smooth HALS NMF using 3 layers [14] are depicted
in Fig.6(b), whereas the results of the same algorithm with
the smoothness constraint achievedS IR A = 29.22 [dB] and
S IR X = 15.53 [dB] are shown in Fig.6(c).
7.2 Experiments for NTF
In Example 4, we applied the NTF to a simple denois-
ing of images. At first, a third-order tensor Y R
51×51×40
+
whose each layer was generated by the L-shaped membrane
function (which creates the MATLAB logo) Y[:, :, k] =
kmembrane(1, 25), k = 1, . . . , 40 has been corrupted by ad-
ditive Gaussian noise with SNR 10 [dB] (Fig. 7(a)). Next,
the noisy tensor data has been approximated by NTF model
using our α-HALS and β-HALS algorithms with fit value
96.1%. Fig.7(a), 7(b) and 7(c) are surface visualizations of
the 40-th noisy slice, and its reconstructed slices by α and
β-HALS NTF (α = 2, β = 2), whereas Fig.7(d), 7(e) and 7(f)
are their iso-surface visualizations, respectively. In addition,
the performance for dierent values of parameters α and β
are illustrated in Fig. 7(g) and 7(h) with PSNR in the left
(blue) axis and number of iterations in the right (red) axis.
In Example 5, we constructed a large scale tensor
with size of 500 × 500 ×500 corrupted by additive Gaus-
sian noise with SNR = 0 [dB] by using three benchmarks
X spectra sparse, ACPos24sparse10 and X spectra
[36] (see Fig.8(a)) and successfully reconstructed original
sparse and smooth components using α- and β-HALS NTF
algorithms. The performance is illustrated via volume, iso-
surface and factor visualizations as shown in Fig. 8(b), 8(c)
and 8(f); while running time and distributions of SIR and
PSNR performance factors are depicted in Fig. 8(g). Slice
10 and its reconstructed slice are displayed in Fig.8(d) and
8(e). In comparison to the known NTF algorithms the Fast
HALS NTF algorithm provides a higher accuracy for fac-
tor estimation based on SIR index, and the higher explained
variation with the faster running time.
In Example 6, we tested the Fast HALS NTF algo-
rithm for real-world data: Decomposition of amino acids
fluorescence data (Fig.9(a)) from five samples containing
tryptophan, phenylalanine, and tyrosine (claus.mat) [33],
[37]. The data tensor was additionally corrupted by Gaus-
sian noise with SNR = 0 dB (Fig.9(b)) , and the factors were
estimated with J = 3. The β-HALS NTF was selected with
β = 1.2, where for α-HALS NTF we select α = 0.9. All
algorithms were set to process the data with the same num-
ber of iterations (100 times). The performances and running
times are compared in Fig. 10, and also in Table 3. In this
example, we applied a smoothness constraint for Fast NTF,
α- and β- HALS NTF. Based on fit ratio and PSNR index
we see that, HALS algorithms usually exhibited better per-
formance than standard NTF algorithms. For example, the
first recovered slice (Fig.9(c)) is almost identical to the slice
of the clean original tensor (99.51% Fit value). In compar-
ison, the NMWF, lsNTF, ALS K, ALS B produced some
artifacts as illustrated in Fig.9(d). Fig.9(e) and Fig.9(f).
In Example 7 we used real EEG data: tutorial-
dataset2.zip [38] which was pre-processed by complex
Morlet wavelet. The tensor is represented by the inter-trial
phase coherence (ITPC) for 14 subjects during a proprio-
ceptive pull of left and right hand (28 files) with size 64
CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS
11
0
2
4
y1
0
5
10
0
5
y3
0
5
10
0
5
10
0
5
y6
0
2
4
y7
0
5
10
0
5
10
100 200 300 400 500 600 700 800 900 1000
0
5
10
(a) 10 mixtures of dataset Xspectra
0
10
20
0
20
40
0
10
20
0
10
20
100 200 300 400 500 600 700 800 900 1000
0
20
40
(b) β-HALS (β = 2)
0
0.5
1
1.5
2
2.5
3
Time in second
FastHALS
α-HALS
β-HALS
Lee-Seung
PG
(c) SIR for X and running time
Fig.3 Comparison of the Fast HALS NMF, α-HALS, β-HALS, Lee-
Seung and PG algorithms in Example 1 with the data set Xspectra. (a)
observed mixed signals, (b) reconstructed original spectra (sources) using
the β-HALS algorithm, (c) SIRs for the matrix X and computation time for
dierent NMF algorithms.
0
200
400
0
200
400
0
200
400
0
200
400
0
200
400
0
200
400
0
500
1000
0
200
400
0
200
400
100 200 300 400
0
200
400
(a) 10 sources
0
2
4
6
8
10
12
100 200 300 400
0
5
10
15
20
(b) 2 mixtures
0
10
20
0
10
20
0
20
40
0
20
40
0
20
40
0
20
40
0
10
20
0
20
40
0
10
20
100 200 300 400
0
10
20
(c) β-HALS, β = 0.1
0.1 0.5 0.8 1 1.3
40
60
80
100
120
140
SIR [dB]
beta
Mean SIR for A
(d) SIR for A
0.1 0.5 0.8 1 1.3
50
100
150
200
250
SIR [dB]
Mean SIR for X
beta
(e) SIR for X
Fig.4 Illustration of performance of the β-HALS NMF algorithm (a) 10
sparse sources assumed to be unknown, (b) two mixtures, (c) 10 estimated
sources for β = 0.1. (d) & (e) SIR values for matrix A and sources X
(respectively)