ArticlePDF Available

Fast Local Algorithms for Large Scale Nonnegative Matrix and Tensor Factorizations

Authors:
  • Systems Research Insititute Polish Academy of Science

Abstract and Figures

Nonnegative matrix factorization (NMF) and its extensions such as Nonnegative Tensor Factorization (NTF) have become prominent techniques for blind sources separation (BSS), analysis of image databases, data mining and other information retrieval and clustering applications. In this paper we propose a family of efficient algorithms for NMF/NTF, as well as sparse nonnegative coding and representation, that has many potential applications in computational neuroscience, multi-sensory processing, compressed sensing and multidimensional data analysis. We have developed a class of optimized local algorithms which are referred to as Hierarchical Alternating Least Squares (HALS) algorithms. For these purposes, we have performed sequential constrained minimization on a set of squared Euclidean distances. We then extend this approach to robust cost functions using the alpha and beta divergences and derive flexible update rules. Our algorithms are locally stable and work well for NMF-based blind source separation (BSS) not only for the over-determined case but also for an under-determined (over-complete) case (i.e., for a system which has less sensors than sources) if data are sufficiently sparse. The NMF learning rules are extended and generalized for N-th order nonnegative tensor factorization (NTF). Moreover, these algorithms can be tuned to different noise statistics by adjusting a single parameter. Extensive experimental results confirm the accuracy and computational performance of the developed algorithms, especially, with usage of multi-layer hierarchical NMF approach [3].
Content may be subject to copyright.
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
1
INVITED PAPER
Special Section on Signal Processing
Fast Local Algorithms for Large Scale Nonnegative Matrix and
Tensor Factorizations
Andrzej CICHOCKI
a)
, Member and Anh-Huy PHAN
††b)
, Nonmember
SUMMARY Nonnegative matrix factorization (NMF) and its exten-
sions such as Nonnegative Tensor Factorization (NTF) have become promi-
nent techniques for blind sources separation (BSS), analysis of image
databases, data mining and other information retrieval and clustering ap-
plications. In this paper we propose a family of ecient algorithms
for NMF/NTF, as well as sparse nonnegative coding and representation,
that has many potential applications in computational neuroscience, multi-
sensory processing, compressed sensing and multidimensional data anal-
ysis. We have developed a class of optimized local algorithms which are
referred to as Hierarchical Alternating Least Squares (HALS) algorithms.
For these purposes, we have performed sequential constrained minimiza-
tion on a set of squared Euclidean distances. We then extend this approach
to robust cost functions using the Alpha and Beta divergences and derive
flexible update rules. Our algorithms are locally stable and work well for
NMF-based blind source separation (BSS) not only for the over-determined
case but also for an under-determined (over-complete) case (i.e., for a sys-
tem which has less sensors than sources) if data are suciently sparse. The
NMF learning rules are extended and generalized for N-th order nonneg-
ative tensor factorization (NTF). Moreover, these algorithms can be tuned
to dierent noise statistics by adjusting a single parameter. Extensive ex-
perimental results confirm the accuracy and computational performance of
the developed algorithms, especially, with usage of multi-layer hierarchical
NMF approach [3].
key words: Nonnegative matrix factorization (NMF), nonnegative tensor
factorizations (NTF), nonnegative PARAFAC, model reduction, feature ex-
traction, compression, denoising, multiplicative local learning (adaptive)
algorithms, Alpha and Beta divergences.
1. Introduction
Recent years have seen a surge of interest in nonnegative
and sparse matrix and tensor factorization - decomposi-
tions which provide physically meaningful latent (hidden)
components or features. Nonnegative Matrix Factorization
(NMF) and its extension Nonnegative Tensor Factorization
(NTF) - multidimensional models with nonnegativity con-
straints - have been recently proposed as sparse and ecient
representations of signals, images and in general natural sig-
nals/data. From signal processing point of view and data
analysis, NMF/NTF are very attractive because they take
into account spatial and temporal correlations between vari-
ables and usually provide sparse common factors or hidden
Manuscript received July 30, 2008.
Manuscript revised November 11, 2008.
Final manuscript received December 12, 2008.
RIKEN Brain Science Institute, 2-1 Hirosawa, Wako, 351-
0198 Saitama, Japan and Warsaw University of Technology and
Systems Research Institute, Polish Academy of Science, Poland.
††
RIKEN Brain Science Institute, 2-1 Hirosawa, Wako, 351-
0198 Saitama, Japan.
a)E-mail: cia@brain.riken.jp
b)E-mail: phan@brain.riken.jp
Table 1 Basic tensor operations and notations [16]
outer product
Khatri-Rao product
Kronecker product
Hadamard product
element-wise division
[U]
j
jth column vector of [U]
U
(n)
the n th factor
u
(n)
j
jth column vector of U
(n)
n
u
j
o
n
u
(1)
j
, u
(2)
j
, . . . , u
(N)
j
o
Y
tensor
×
n
n mode product of tensor and matrix
×
n
n mode product of tensor and vector
Y
(n)
n mode matricized version of Y
U
U
(N)
U
(N1)
··· U
(1)
U
n
U
(N)
··· U
(n+1)
U
(n1)
··· U
(1)
U
U
(N)
U
(N1)
··· U
(1)
U
n
U
(N)
··· U
(n+1)
U
(n1)
··· U
(1)
SIR(a,b)10log
10
(kak
2
/ka bk
2
)
PSNR
20log
10
(Range of Signal/RMSE)
Fit(Y
,
b
Y) 100(1 kY
b
Yk
2
F
/kY E(Y)k
2
F
)
(latent) nonnegative components with physical or physio-
logical meaning and interpretations [1]–[5].
In fact, NMF and NTF are emerging techniques for
data mining, dimensionality reduction, pattern recognition,
object detection, classification, gene clustering, sparse non-
negativerepresentation and coding, and blind source separa-
tion (BSS) [5]–[14]. For example, NMF/NTF have already
found a wide spectrum of applications in positron emission
tomography (PET), spectroscopy, chemometrics and envi-
ronmental science where the matrices have clear physical
meanings and some normalization or constraints are im-
posed on them [12],[13],[15].
This paper introduces several alternative approaches
and improved local learning rules (in the sense that vec-
tors and rows of matrices are processed sequentially one
by one) for solving nonnegative matrix and tensor factor-
izations problems. Generally, tensors (i.e., multi-way ar-
rays) are denoted by underlined capital boldface letter, e.g.,
Y R
I
1
×I
2
×···×I
N
. The order of a tensor is the number of
modes, also known as ways or dimensions. In contrast, ma-
trices are denoted by boldface capital letters, e.g., Y; vectors
are denoted by boldface lowercase letters, e.g., columns of
the matrix A by a
j
and scalars are denoted by lowercase
letters, e.g., a
ij
. The i-th entry of a vector a is denoted by
a
i
, and (i, j) element of a matrix A by a
ij
. Analogously, el-
ement (i, k, q) of a third-order tensor Y R
I×K×Q
by y
ikq
.
Indices typically range from 1 to their capital version, e.g.,
i = 1, 2, . . . , I; k = 1, 2, . . . , K; q = 1, 2, . . . , Q. Throughout
this paper,standard notations and basic tensor operationsare
used as indicated in Table 1.
2. Models and Problem Statements
In this paper, we consider at first a simple NMF model de-
scribed as
2
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
=
Y
E
+
+
+
I
2
I
1
I
3
u
(1)
1
u
(2)
1
u
(3)
1
u
(3)
J
u
(2)
J
u
(1)
J
( )I I I
1 2 3
x x
( )I I I
1 2 3
x x
Fig.1 Illustration of a third-order tensor factorization using standard
NTF; Objective is to estimate nonnegative vectors u
(n)
j
for j = 1, 2, . . . , J
and n = 1, 2, 3.
Y = AX + E = AB
T
+ E, (1)
where Y = [y
ik
] R
I×K
is a known input data matrix,
A = [a
1
, a
2
, . . . , a
J
] R
I×J
+
is an unknown basis (mix-
ing) matrix with nonnegative vectors a
j
R
I
+
, X = B
T
=
[x
T
1
, x
T
2
, . . . , x
T
J
]
T
R
J×K
+
is a matrix representing unknown
nonnegative components x
j
and E = [e
ik
] R
I×K
repre-
sents errors or noise. For simplicity, we use also matrix
B = X
T
= [b
1
, b
2
, . . . , b
J
] R
K×J
+
which allows us to use
only column vectors. Our primary objective is to estimate
the vectors a
j
of the mixing (basis) matrix A and the sources
x
j
= b
T
j
(rows of the matrix X or columns of B), subject to
nonnegativity constraints
.
The simple NMF model (1) can be naturally extended
to the NTF (or nonnegative PARAFAC) as follows: “For
a given N-th order tensor Y R
I
1
×I
2
···×I
N
perform a non-
negative factorization (decomposition) into a set of N un-
known matrices: U
(n)
= [u
(n)
1
, u
(n)
2
, . . . , u
(n)
J
] R
I
n
×J
+
, (n =
1, 2, . . . , N) representing the common (loading) factors”,
i.e., [11],[16]
Y =
J
X
j=1
u
(1)
j
u
(2)
j
··· u
(N)
j
+ E (2)
where means outer product of vectors
††
and
b
Y :=
J
P
j=1
u
(1)
j
u
(2)
j
··· u
(N)
j
is an estimated or approximated
(actual) tensor (see Fig. 1). For simplicity, we use the fol-
lowing notations for the parameters of the estimated tensor
b
Y :=
J
P
j=1
~u
(1)
j
, u
(2)
j
, . . . , u
(N)
j
= ~{U} [16]. A residuum ten-
sor defined as E = Y
b
Y represents noise or errors depend-
ing on applications. This model can be referred to as non-
negative version of CANDECOMP proposed by Carroll and
Chang or equivalently nonnegative PARAFAC proposed in-
dependently by Harshman and Kruskal. In practice, we usu-
ally need to normalize vectors u
(n)
j
R
J
to unit length, i.e.,
Usually, a sparsity constraint is naturally and intrinsically pro-
vided due to nonlinear projected approach (e.g., half-wave recti-
fier or adaptive nonnegative shrinkage with gradually decreasing
threshold [17]).
††
For example, the outer product of two vectors a R
I
, b R
J
builds up a rank-one matrix A = a b = ab
T
R
I×J
and the outer
product of three vectors: a R
I
, b R
K
, c R
Q
builds up third-
order rank-one tensor: Y = a b c R
I×K×Q
, with entries defined
as y
ikq
= a
i
b
k
c
q
.
with ku
(n)
j
k
2
= 1 for n = 1, 2, . . . , N 1, j = 1, 2, . . . , J, or
alternatively apply a Kruskal model:
Y =
b
Y + E =
J
X
j=1
λ
j
(u
(1)
j
u
(2)
j
··· u
(N)
j
) + E, (3)
where λ = [λ
1
, λ
2
, . . . , λ
J
]
T
R
J
+
are scaling factors and the
factors matrices U
(n)
= [u
(n)
1
, u
(n)
2
, . . . , u
(n)
J
] have all vectors
u
(n)
j
normalized to unit length columns in the sense ku
(n)
j
k
2
2
=
u
(n)T
j
u
(n)
j
= 1, j, n. Generally, the scaling vector λ could be
derived as λ
j
= ku
(N)
j
k
2
. However, we often assume that the
weight vector λ can be absorbed the (non-normalized)factor
matrix U
(N)
, and therefore the model can be expressed in the
simplified form (2). The objectiveis to estimatenonnegative
component matrices: U
(n)
or equivalently the set of vectors
u
(n)
j
, (n = 1, 2, . . . , N, j = 1, 2, . . . , J), assuming that the
number of factors J is known or can be estimated.
It is easy to check that for N = 2 and for U
(1)
= A and
U
(2)
= B = X
T
the NTF simplifies to the standard NMF.
However, in order to avoid tedious and quite complex nota-
tions, we will derive most algorithms first for NMF problem
and next attempt to generalize them to the NTF problem,
and present basic concepts in clear and easy understandable
forms.
Most of known algorithms for the NTF/NMF model
are based on alternating least squares (ALS) minimization
of the squared Euclidean distance (Frobenius norm) [13],
[16],[18]. Especially, for NMF we minimize the following
cost function:
D
F
(Y ||
b
Y) =
1
2
kY AXk
2
F
,
b
Y = AX, (4)
and for the NTF model (2)
D
F
(Y ||
b
Y) =
1
2
Y
J
P
j=1
(u
(1)
j
u
(2)
j
··· u
(N)
j
)
2
F
, (5)
subject to nonnegativity constraints and often additional
constraints such as sparsity or smoothness [10]. Such for-
mulated problems can be considered as a natural exten-
sion of the extensively studied NNLS (Nonnegative Least
Squares) formulated as the following optimization problem:
“Given a matrix A R
I×J
and a set of the observed values
given by a vector y R
I
, find a nonnegative vector x R
J
to minimize the cost function J(x) =
1
2
||y Ax||
2
2
, i.e.,
min
x
J(x) =
1
2
||y Ax||
2
2
, (6)
subject to x 0” [13].
A basic approach to the above formulated optimization
problems (4-5) is alternating minimization or alternating
projection: The specified cost function is alternately min-
imized with respect to sets of parameters, each time opti-
mizing one set of arguments while keeping the others fixed.
It should be noted that the cost function (4) is convex with
respect to entries of A or X, but not both. Alternating mini-
mization of the cost function (4) leads to a nonnegativefixed
CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS
3
point ALS algorithm which can be described briefly as fol-
lows:
1. Initialize A randomly or by using the recursive appli-
cation of Perron-Frobenius theory to SVD [13].
2. Estimate X from the matrix equation A
T
AX = A
T
Y
by solving
min
X
D
F
(Y||AX) =
1
2
||Y AX||
2
F
, with fixed A.
3. Set all negative elements of X to zero or a small posi-
tive value.
4. Estimate A from the matrix equation XX
T
A
T
= XY
T
by solving
min
A
D
F
(Y||AX) =
1
2
||Y
T
X
T
A
T
||
2
F
, with fixed X.
5. Set all negative elements of A to zero or a small posi-
tive value ε.
The above ALS algorithm can be written in the following
explicit form
X max{ε, (A
T
A)
1
A
T
Y} := [A
Y]
+
, (7)
A max{ε, YX
T
(XX
T
)
1
} := [Y X
]
+
, (8)
where A
is the Moore-Penrose inverse of A, ε is a small
constant (typically, 10
16
) to enforce positive entries. Note
that max operator is performed component-wise for entries
of matrices. Various additional constraints on A and X can
be imposed [19].
For large scale NMF problem for J << I and J << K
the data matrix Y is usually low rank and in such cases we
do not need to process all elements of the matrix in order to
estimate factor matrices A and X (see Fig. 2). In fact, in-
stead of performing large scale factorization of (1), we can
consider alternating factorization of much smaller dimen-
sion problems:
Y
r
= A
r
X + E
r
, for fixed (known) A
r
, (9)
Y
c
= AX
c
+ E
c
, for fixed (known) X
c
, (10)
where Y
r
R
R×K
+
and Y
c
R
I×C
+
are matrices constructed
form preselected rows and columns of the data matrix Y, re-
spectively. Analogously, we construct reduced dimensions
matrices: A
r
R
R×J
and X
c
R
J×C
by using the same in-
dexes for columns and rows which were used for construc-
tion of the matrices Y
c
and Y
r
, respectively. There are sev-
eral strategies to chose columns and rows of the input matrix
data. The simplest scenario is to chose the first R rows and
the first C columns of data matrix Y. Alternatively, we can
select randomly, e.g., uniformly distributed, i.e. every N
row and column. Another option is to chose such rows and
columns that provide the largest
p
-norm values. For noisy
data with uncorrelated noise, we can construct new columns
and rows as local average (mean values) of some specific
numbers of columns and rows of the raw data. For exam-
ple, the first selected column is created as average of the
first M columns, the second columns is average of the next
@
I
R
C
Y
r
Y
c
X
c
A
r
J
J
(I T)´ (I J)´
(J T)´
C
R
T
T
Y
X
A
Fig.2 Conceptual illustration of processing of data for a large scale
NMF. Instead of processing the whole matrix Y R
I×K
, we process much
smaller dimensional block matrices Y
c
R
I×C
and Y
r
R
R×K
and cor-
responding factor matrices X
c
R
J×C
and A
r
R
R×J
with C << K and
R << I. For simplicity, we have assumed that the first R rows and the first
C columns of the matrices Y, A, X are chosen, respectively.
M columns, and so on. The same procedure is applied for
rows. Another approach is to cluster all columns and rows
in C and R cluster and select one column and one row form
each cluster, respectively. In practice, it is sucient to chose
J < R 4J and J < C 4J. In the special case, for squared
Euclidean distance (Frobenius norm) instead of alternating
minimizing the cost function:
D
F
(Y || AX) =
1
2
kY AXk
2
F
,
we can minimize sequentially two cost functions:
D
F
(Y
r
|| A
r
X) =
1
2
kY
r
A
r
Xk
2
F
, for fixed A
r
,
D
F
(Y
c
|| AX
c
) =
1
2
kY
c
AX
c
k
2
F
, for fixed X
c
.
Minimization of these cost functions with respect to X and
A, subject to nonnegativity constraints leads to simple ALS
update formulas for the large scale NMF:
A
h
Y
c
X
c
i
+
=
h
Y
c
X
T
c
(X
c
X
T
c
)
1
i
+
, (11)
X
h
A
r
Y
r
i
+
=
h
(A
T
r
A
r
)
1
A
T
r
Y
r
i
+
. (12)
The nonnegativeALS algorithm can be generalized for
the NTF problem (2) [16]:
U
(n)
Y
(n)
U
n
U
n
T
U
n
1
+
,
=
"
Y
(n)
U
n
n
U
T
U
o
n
1
#
+
, n = 1, . . . , N. (13)
where Y
(n)
R
I
n
×I
1
···I
n1
I
n+1
···I
N
+
is n-mode unfolded matrix of
the tensor Y R
I
1
×I
2
×···×I
N
+
and
n
U
T
U
o
n
= (U
(N)T
U
(N)
)
··· (U
(n+1)T
U
(n+1)
) (U
(n1)T
U
(n1)
) ··· (U
(1)T
U
(1)
).
At present, ALS algorithms for NMF and NTF are con-
sidered as “workhorse” approaches, however they may take
many iterations to converge. Moreover, they are also not
guaranteed to converge to a global minimum or even a sta-
tionary point, but only to a solution where the cost functions
cease to decrease [13],[16]. However, the ALS method can
be considerably improved and the computational complex-
ity reduced as will be shown in this paper.
4
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
In fact, in this paper, we use a dierent and more so-
phisticated approach. Instead of minimizing one or two cost
functions, we minimize a set of local cost functions with the
same global minima (e.g., squared Euclidean distances and
Alpha or Beta divergences with a single parameter alpha or
beta). The majority of known algorithms for NMF work
only if the following assumption K >> I J is satisfied,
where J is the number of the nonnegative components. The
NMF algorithms developed in this paper are suitable also
for the under-determined case, i.e., for K > J > I, if sources
are sparse enough. Moreover, the proposed algorithms are
robust with respect to noise and suitable for large scale prob-
lems. Furthermore, in this paper we consider the extension
of our approach to NMF/NTF models with optional sparsity
and smoothness constraints.
3. Derivation of Fast HALS NMF Algorithms
Denoting the columns by A = [a
1
, a
2
, . . . , a
J
] and B =
[b
1
, b
2
, . . . , b
J
], we can express the squared Euclidean cost
function as
J(a
1
, . . . , a
J
, b
1
, . . . , b
J
) =
1
2
||Y AB
T
||
2
F
=
1
2
||Y
J
X
j=1
a
j
b
T
j
||
2
F
. (14)
The basic idea is to define residues:
Y
(j)
= Y
X
p, j
a
p
b
T
p
= Y AB
T
+ a
j
b
T
j
,
= Y AB
T
+ a
j1
b
T
j1
a
j1
b
T
j1
+ a
j
b
T
j
(15)
for j = 1, 2, . . . , J and minimize alternatively the set of cost
functions (with respect to set of parameters {a
j
} and {b
j
}):
D
(j)
A
(a
j
) =
1
2
||Y
(j)
a
j
b
T
j
||
2
F
, for a fixed b
j
, (16)
D
(j)
B
(b
j
) =
1
2
||Y
(j)
a
j
b
T
j
||
2
F
, for a fixed a
j
, (17)
for j = 1, 2, . . . , J subject to a
j
0 and b
j
0, respectively.
In other words, we minimize alternatively the set of
cost functions
D
(j)
F
(Y
(j)
||a
j
b
T
j
) =
1
2
||Y
(j)
a
j
b
T
j
||
2
F
, (18)
for j = 1, 2, . . . , J subject to a
j
0 and b
j
0, respectively.
The gradients of the local cost functions (18) with re-
spect to the unknown vectors a
j
and b
j
(assuming that other
vectors are fixed) are expressed by
D
(j)
F
(Y
(j)
||a
j
b
T
j
)
a
j
= a
j
b
T
j
b
j
Y
(j)
b
j
, (19)
D
(j)
F
(Y
(j)
||a
j
b
T
j
)
b
j
= b
j
a
T
j
a
j
Y
(j)T
a
j
. (20)
By equating the gradient components to zero and assuming
Algorithm 1 HALS for NMF: Given Y R
I×K
+
estimate
A R
I×J
+
and X = B
T
R
J×K
+
1: Initialize nonnegative matrix A and/or X = B
T
using ALS
2: Normalize the vectors a
j
(or b
j
) to unit
2
-norm length,
3: E = Y AB
T
;
4: repeat
5: for j = 1 to J do
6: Y
(j)
E + a
j
b
T
j
;
7: b
j
h
Y
(j)T
a
j
i
+
8: a
j
h
Y
(j)
b
j
i
+
9: a
j
a
j
/ka
j
k
2
;
10: E Y
(j)
a
j
b
T
j
;
11: end for
12: until convergence criterion is reached
that we enforce the nonnegativity constraints with a sim-
ple “half-wave rectifying” nonlinear projection, we obtain a
simple set of sequential learning rules:
b
j
1
a
T
j
a
j
h
Y
(j)T
a
j
i
+
, a
j
1
b
T
j
b
j
h
Y
(j)
b
j
i
+
, (21)
for j = 1, 2, . . . , J. We refer to these update rules as the
HALS algorithm which we first introduced in [3]. The same
or similar update rules for the NMF have been proposed
or rediscovered independently in [20]–[23]. However, our
practical implementations of the HALS algorithm are quite
dierent and allow various extensions to sparse and smooth
NMF, and also for the N-order NTF.
First of all, from the formula (15) it follows that we
do not need to compute explicitly the residue matrix Y
(j)
in
each iteration step but just smartly update it [24].
It is interesting to note that such nonlinear projections
can be imposed individuallyfor each source x
j
and/or vector
a
j
, so the algorithm can be directly extended to a semi-NMF
or a semi-NTF model in which some parameters are relaxed
to be bipolar (by removing the half-wave rectifying opera-
tor [·]
+
, if necessary). Furthermore, in practice, it is neces-
sary to normalize in each iteration step the column vectors
a
j
and/or b
j
to unit length vectors (in the sense of
p
-norm
(p = 1, 2, ..., )). In the special case of
2
-norm, the above
algorithm can be further simplified by ignoring denomina-
tors in (21) and imposing normalizationof vectors after each
iteration steps. The standard HALS local updating rules can
be written in a simplified scalar form:
b
kj
I
X
i=1
a
ij
y
(j)
ik
+
, a
ij
K
X
k=1
b
kj
y
(j)
ik
+
, (22)
with a
ij
a
ij
/||a
j
||
2
, where y
(j)
ik
= [Y
(j)
]
ik
= y
ik
P
p, j
a
ip
b
kp
. Ecient implementation of the HALS algo-
rithm (22) is illustrated by detailed pseudo-code given in
Algorithm 1.
3.1 Extensions and Practical Implementations of Fast
HALS
The above simple algorithm can be further extended or im-
proved (in respect of convergence rate and performance
CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS
5
and by imposing additional constraints such as sparsity and
smoothness). First of all, dierent cost functions can be
used for estimation of the rows of the matrix X = B
T
and the
columns of the matrix A (possibly with various additional
regularization terms [19],[25]). Furthermore, the columns
of A can be estimated simultaneously, instead of oneby one.
For example, by minimizing the set of cost functions in (4)
with respect to b
j
, and simultaneously the cost function (18)
with normalization of the columns a
j
to unit
2
-norm, we
obtain a very ecient NMF learning algorithm in which the
individual vectors of B are updated locally (column by col-
umn) and the matrix A is updated globally using nonnega-
tive ALS (all columns a
j
simultaneously) (see also [19]):
b
j
h
Y
(j) T
r
˜a
j
i
+
/(˜a
T
j
˜a
j
), A
h
Y
c
X
T
c
(X
c
X
T
c
)
1
i
+
, (23)
where ˜a
j
is an j-th vector of a reduced matrix A
r
R
R×J
+
.
Matrix A needs to be normalized to the unit length column
vectors in the
2
-norm sense after each iteration.
Alternatively, even more ecient approach is to per-
form factor by factor procedure, instead of updating
column-by column vectors [24]. From (21), we obtain the
following update rule for b
j
= x
T
j
b
j
Y
(j)T
a
j
/(a
T
j
a
j
) =
Y AB
T
+ a
j
b
T
j
T
a
j
/(a
T
j
a
j
)
= (Y
T
a
j
BA
T
a
j
+ b
j
a
T
j
a
j
)/(a
T
j
a
j
),
=
h
Y
T
A
i
j
B
h
A
T
A
i
j
+ b
j
a
T
j
a
j
/(a
T
j
a
j
), (24)
with b
j
h
b
j
i
+
. Due to ka
j
k
2
2
= 1, the learning rule for b
j
has a simplified form
b
j
b
j
+
h
Y
T
A
i
j
B
h
A
T
A
i
j
+
. (25)
Analogously to equation (24), the learning rule for a
j
is
given by
a
j
a
j
b
T
j
b
j
+ [YB]
j
A
h
B
T
B
i
j
+
, (26)
a
j
a
j
/ka
j
k
2
. (27)
Based on these expressions, we have designed and imple-
mented the improved and modified HALS algorithm given
below in the pseudo-code as Algorithm 2. For large scale
data and block-wise strategy, the fast HALS learning rule
for b
j
is rewritten from (24) as follows
b
j
b
j
+
h
Y
T
r
A
r
i
j
/k˜a
j
k
2
2
B
h
A
T
r
A
r
i
j
/k˜a
j
k
2
2
+
=
b
j
+
h
Y
T
r
A
r
D
A
r
i
j
B
h
A
T
r
A
r
D
A
r
i
j
+
(28)
where D
A
r
= diag(k˜a
1
k
2
2
, k˜a
2
k
2
2
, . . . , k˜a
J
k
2
2
) is a diagonal
matrix. The learning rule for a
j
has a similar form
a
j
a
j
+
Y
c
B
c
D
B
c
j
A
h
B
T
c
B
c
D
B
c
i
j
+
(29)
where D
B
c
= diag(k
˜
b
1
k
2
2
, k
˜
b
2
k
2
2
, . . . , k
˜
b
J
k
2
2
) and
˜
b
j
is the
j-th vector of the reduced matrix B
c
= X
T
c
R
C×J
+
.
Algorithm 2 FAST HALS for NMF: Y AB
T
1: Initialize nonnegative matrix A and/or B using ALS
2: Normalize the vectors a
j
(or b
j
) to unit
2
-norm length
3: repeat
4: % Update B;
5: W = Y
T
A;
6: V = A
T
A;
7: for j = 1 to J do
8: b
j
h
b
j
+ w
j
B v
j
i
+
9: end for
10: % Update A;
11: P = YB;
12: Q = B
T
B;
13: for j = 1 to J do
14: a
j
h
a
j
q
jj
+ p
j
A q
j
i
+
15: a
j
a
j
/ka
j
k
2
;
16: end for
17: until convergence criterion is reached
3.2 HALS NMF Algorithm with Sparsity and Smoothness
Constraints
In order to impose sparseness and smoothness constraints
for vectors b
j
(source signals), we can minimize the follow-
ing set of cost functions:
D
(j)
F
(Y
(j)
ka
j
b
T
j
) =
1
2
kY
(j)
a
j
b
T
j
k
2
F
+
+α
sp
kb
j
k
1
+ α
sm
kϕ(L b
j
)k
1
, (30)
for j = 1, 2, . . . , J subject to a
j
0 and b
j
0, where
α
sp
> 0, α
sm
> 0 are regularization parameters control-
ling level of sparsity and smoothness, respectively, L is a
suitably designed matrix (the Laplace operator) which mea-
sures the smoothness (by estimating the dierences between
neighboring samples of b
j
)
and ϕ : R R is an edge-
preserving function applied componentwise. Although this
edge-preserving nonlinear function may take various forms
[26]:
ϕ(t) = |t|
α
/α, 1 α 2, (31)
ϕ(t) =
α + t
2
, (32)
ϕ(t) = 1 + |t| log(1 + |t|), α > 0, (33)
we restrict ourself to simple cases, where ϕ(t) = |t|
α
for
α = 1 or 2, and L is the derivative operator of the first or
second order. For example, the first order derivativeoperator
L with K points can take the form:
L =
1 1
1 1
.
.
.
.
.
.
1 1
(34)
and the cost function (30) becomes similar to the total-
variation (TV) regularization (which is often used in sig-
nal and image recovery ) but with additional sparsity con-
straints:
In the special case for L = I
K
and ϕ(t) = |t|, the smoothness
regularization term becomes sparsity term.
6
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
D
(j)
F
(Y
(j)
ka
j
b
T
j
) =
1
2
Y
(j)
a
j
b
T
j
2
F
+ α
sp
kb
j
k
1
+
+ α
sm
K1
X
k=1
|b
k j
b
(k+1) j
|. (35)
Another important case assumes that ϕ(t) =
1
2
|t|
2
and L is
the second order derivative operator with K points. In such
a case, we obtain the Tikhonov-like regularization:
D
(j)
F
(Y
(j)
ka
j
b
T
j
) =
1
2
kY
(j)
a
j
b
T
j
k
2
F
+ α
sp
kb
j
k
1
+
+
1
2
α
sm
kLb
j
k
2
2
. (36)
In the such case the update rule for a
j
is the same as in (21),
whereas the update rule for b
j
is given by:
b
j
(I + α
sm
L
T
L)
1
(Y
(j)T
a
j
α
sp
1
K
). (37)
where 1
K
R
K
is a vector with all one. This learning rule
is robust to noise, however, it involves a rather high compu-
tational cost due to the calculation of an inverse of a large
matrix in each iteration. To circumvent this problem and
to considerably reduce the complexity of the algorithm we
present a second-order smoothing operator L in the follow-
ing form:
L =
2 2
1 2 1
1 2 1
.
.
.
.
.
.
1 2 1
2 2
=
2
2
2
.
.
.
2
2
+
0 2
1 0 1
1 0 1
.
.
.
.
.
.
.
.
.
1 0 1
2 0
= 2I + 2S. (38)
However, instead of computingdirectly Lb
j
= 2Ib
j
+2Sb
j
,
in the second term we replace b
j
by its estimation
ˆ
b
j
ob-
tained from the previous update. Hence, a new smoothing
regularization term with ϕ(t) = t
2
/8 takes a simplified and
computationally more ecient form
J
sm
= kϕ( 2b
j
+ 2S
ˆ
b
j
)k
1
=
1
2
kb
j
S
ˆ
b
j
k
2
2
. (39)
Finally, the learning rule of the regularized HALS algorithm
takes the following form:
b
j
h
Y
(j)T
a
j
α
sp
1
K
+ α
sm
S
ˆ
b
j
i
+
/(a
T
j
a
j
+ α
sm
)
=
h
Y
(j)T
a
j
α
sp
1
K
+ α
sm
S
ˆ
b
j
i
+
/(1 + α
sm
) . (40)
Alternatively, for a relatively small dimension of matrix A,
an ecient solution is based on a combination of a local
learning rule for the vectors of B and a global one for A,
based on the nonnegative ALS algorithm:
b
j
h
Y
(j)T
a
j
α
sp
1
K
+ α
sm
S
ˆ
b
j
i
+
/(1 + α
sm
),
A
h
Y
c
X
T
c
(X
c
X
T
c
)
1
i
+
, (41)
with the normalization (scaling) of the columns of A to the
unit length
2
-norm.
An importantopenproblemis an optimal choice of reg-
ularization parameters α
sm
. Selection of appropriateregular-
ization parameters plays a key role. Similar to the Tikhonov-
like regularization approach we selected an optimal α
sm
by
applying the L-curve technique [27] to estimate a corner of
the L-curve. However, in the NMF, since both matrices A
and X are unknown, the procedure is slightly dierent: first,
we initiate α
sm
= 0 and perform a preliminary update to ob-
tain A and X; next we set α
sm
by the L-curve corner based
on the preliminary estimated matrix A; then, we continue
updating until convergence is achieved.
4. Fast HALS NTF Algorithm Using Squared Eu-
clidean Distances
The above approaches can be relatively easily extended to
the NTF problem. Let us consider sequential minimization
of a set of local cost functions:
D
(j)
F
(Y
(j)
||
b
Y
(j)
)
=
1
2
Y
(j)
u
(1)
j
u
(2)
j
··· u
(N)
j
2
F
(42)
=
1
2
Y
(j)
(n)
u
(n)
j
n
u
j
o
n
T
2
F
, (43)
for j = 1, 2, . . . , J, subject to the nonnegativity constraints,
where
b
Y
(j)
= u
(1)
j
u
(2)
j
···u
(N)
j
,
n
u
j
o
n
T
= [u
(N)
j
]
T
···
[u
(n+1)
j
]
T
[u
(n1)
j
]
T
··· [u
(1)
j
]
T
and
Y
(j)
= Y
X
p, j
u
(1)
p
u
(2)
p
··· u
(N)
p
(44)
= Y
J
X
p=1
(u
(1)
p
··· u
(N)
p
) + (u
(1)
j
··· u
(N)
j
)
= Y
b
Y + ~{u
j
}. (45)
where ~{u
j
} = u
(1)
j
··· u
(N)
j
is a rank-one tensor. Note
that (43) is the nmode matricized (unfolded) version of
(42). The gradients of (43) with respect to elements u
(n)
j
are given by
D
(j)
F
u
(n)
j
= Y
(j)
(n)
n
u
j
o
n
+ u
(n)
j
n
u
j
o
n
T
n
u
j
o
n
(46)
= Y
(j)
(n)
n
u
j
o
n
+ u
(n)
j
γ
(n)
j
, (47)
where scaling coecients γ
(n)
j
can be computed as follows:
γ
(n)
j
=
n
u
j
o
n
T
n
u
j
o
n
=
n
u
T
j
u
j
o
n
=
n
u
T
j
u
j
o
/
u
(n)T
j
u
(n)
j
=
u
(N)T
j
u
(N)
j
/
u
(n)T
j
u
(n)
j
=
u
(N)T
j
u
(N)
j
, n , N
1, n = N.
(48)
Hence, a new HALS NTF learning rule for u
(n)
j
, (j =
CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS
7
1, 2, . . . , N; n = 1, 2, . . . , N) is obtained by equating the
gradient (47) to zero:
u
(n)
j
Y
(j)
(n)
n
u
j
o
n
. (49)
Note that the scaling factors γ
(n)
j
have been ignored due to
normalization after each iteration step u
(n)
j
= u
(n)
j
/ku
(n)
j
k
2
for n = 1, 2, . . . N 1. The learning rule (49) can be written
in an equivalent form expressed by n mode multiplication of
tensor by vectors:
u
(n)
j
Y
(j)
×
1
u
(1)
j
···×
n1
u
(n1)
j
×
n+1
u
(n+1)
j
···×
N
u
(N)
j
:= Y
(j)
×
n
{u
j
}, j = 1, . . . , J; n = 1, . . . , N. (50)
For simplicity, we use here a short notation Y
(j)
×
n
{u
T
j
} in-
troduced by Kolda and Bader [28] to indicate multiplication
of the tensor Y by vectors in all modes, but n-mode. The
above updating formula is elegant and relatively simple but
involvesrather high computational cost for large scale prob-
lems. In order to derive a more ecient (faster) algorithm
we exploit basic properties the Khatri-Rao and Kronecker
products of two vectors:
h
U
(1)
U
(2)
i
j
=
h
u
(1)
1
u
(2)
1
. . . u
(1)
J
u
(2)
J
i
j
= u
(1)
j
u
(2)
j
or in more general form:
n
u
j
o
n
=
h
U
n
i
j
. (51)
Hence, by replacing Y
(j)
(n)
terms in (49) by those in (45), and
taking into account (51), the update learning rule (49) can
be expressed as
u
(n)
j
Y
(n)
h
U
n
i
j
b
Y
(n)
h
U
n
i
j
+ ~{u
j
}
(n)
n
u
j
o
n
=
Y
(n)
U
n
j
U
(n)
U
n
T
U
n
j
+ u
(n)
j
n
u
j
o
n
T
n
u
j
o
n
=
h
Y
(n)
U
n
i
j
U
(n)
h
U
n
T
U
n
i
j
+ γ
(n)
j
u
(n)
j
=
h
Y
(n)
U
n
i
j
U
(n)
n
U
T
U
o
n
j
+ γ
(n)
j
u
(n)
j
=
Y
(n)
U
n
j
U
(n)
n
U
T
U
o
U
(n)T
U
(n)
j
+ γ
(n)
j
u
(n)
j
, (52)
subject to the normalization of vectors u
(n)
j
for n =
1, 2, . . . , N 1 to unit length. In combination with a compo-
nentwise nonlinear half-wave rectifying operator, we finally
have a new algorithm referred as the Fast HALS NTF algo-
rithm:
u
(n)
j
"
γ
(n)
j
u
(n)
j
+
Y
(n)
U
n
j
U
(n)
n
U
T
U
o
U
(n)T
U
(n)
j
#
+
. (53)
The detailed pseudo-code of this algorithm is given in Al-
gorithm 3. In a special case of N = 2, FAST-HALS NTF
becomes FAST-HALS NMF algorithm described in the pre-
vious section.
For 3-way tensor, direct trilinear decomposition could be used
as initialization.
††
In practice, vectors u
(n)
j
have often fixed sign before rectifying.
Algorithm 3 FAST-HALS NTF
1: Nonnegative random or nonnegative ALS initialization U
(n)
2: Normalize all u
(n)
j
for n = 1, . . . , N 1 to unit length
3: T
1
= (U
(1)T
U
(1)
) . . . (U
(N)T
U
(N)
)
4: repeat
5: γ = diag(U
(N)T
U
(N)
)
6: for n = 1 to N do
7: γ = 1 if n = N
8: T
2
= Y
(n)
{U
n
}
9: T
3
= T
1
(U
(n)T
U
(n)
)
10: for j = 1 to J do
11: u
(n)
j
h
γ
j
u
(n)
j
+ [T
2
]
j
U
(n)
[T
3
]
j
i
+
††
12: u
(n)
j
= u
(n)
j
/ku
(n)
j
k
2
if n , N
13: end for
14: T
1
= T
3
U
(n)T
U
(n)
15: end for
16: until convergence criterion is reached
5. Flexible Local Algorithms Using Alpha Divergence
The algorithms derived in previous sections can be extended
to more robust algorithms by applying a family of general-
ized Alpha and Beta divergences.
For the NMF problem (1) we define the Alpha diver-
gence as follows (similar to [14],[18],[25],[29]):
D
(j)
α
([Y
(j)
]
+
) || a
j
x
j
=
X
ik
z
(j)
ik
α(α + 1)
z
(j)
ik
y
(j)
ik
α
1
z
(j)
ik
y
(j)
ik
α + 1
, α , 1, 0, (54a)
X
ik
(z
(j)
ik
) ln
z
(j)
ik
y
(j)
ik
z
(j)
ik
+ y
(j)
ik
, α=0, (54b)
X
ik
y
(j)
ik
ln
y
(j)
ik
z
(j)
ik
+ z
(j)
ik
y
(j)
ik
, α=-1, (54c)
where y
(j)
ik
= [Y]
ik
P
p, j
a
ip
x
pk
and z
(j)
ik
= a
ij
x
jk
= a
ij
b
kj
for
j = 1, 2, . . . , J.
The choice of parameter α R depends on statistical
distributionsof noise and data. In the special cases of the Al-
pha divergence for α = {1, 0.5, 2}, we obtain respectively
the Pearson’s chi squared, Hellinger’s, and Neyman’s chi-
square distances while for the cases α = 0 and α = 1, the
divergence has to be defined by the limits of (54a) as α 0
and α 1, respectively. When these limits are evaluated
for α 0 we obtain the generalized Kullback-Leibler di-
vergence defined by Eq. (54b) whereas for α 1 we have
the dual generalized Kullback-Leibler divergence given in
Eq. (54c) [1],[14],[19],[25].
The gradient of the Alpha divergence (54) for α , 1
with respect to a
ij
and b
kj
can be expressed in a compact
form as:
D
(j)
α
b
kj
=
1
α
X
i
a
ij
z
(j)
ik
y
(j)
ik
α
1
, (55)
D
(j)
α
a
ij
=
1
α
X
k
b
kj
z
(j)
ik
y
(j)
it
α
1
. (56)
8
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
By equating the gradients to zero, we obtain a new multi-
plicative local α-HALS algorithm:
b
j
h
Y
(j) T
i
.[α]
+
a
j
a
T
j
a
.[α]
j
.[1]
, a
j
h
Y
(j)
i
.[α]
+
b
j
b
T
j
b
.[α]
j
.[1]
, (57)
where the “rise to the power” operations x
.[α]
are performed
componentwise. The above algorithm can be generalized to
the following form
b
j
Ψ
1
Ψ
h
Y
(j)T
i
+
a
j
a
T
j
Ψ(a
j
)
, a
j
Ψ
1
Ψ
h
Y
(j)
i
+
b
j
b
T
j
Ψ(b
j
)
, (58)
where Ψ(x) is suitable chosen function, for example, Ψ(x) =
x
.[α]
, componentwise
.
In a similar way, novel learning rules for the N-order
NTF problem (2) can be derived. For this purpose, we con-
sider the n-mode matricized (unfolded) version of the tensor
Y
Y
(n)
= U
(n)
(U
n
)
T
. (59)
Actually, this can be considered as an NMF model with A
U
(n)
and B U
n
. From (51), we have
b
j
=
h
U
n
i
j
=
n
u
j
o
n
. (60)
Applying directly the learning rule (58) to the model (59)
gives
u
(n)
j
Ψ
1
Ψ
h
Y
(j)
(n)
i
+
b
j
b
T
j
Ψ(b
j
)
, (61)
where Y
(j)
(n)
is an n-mode matricized version of Y
(j)
in (45)
Y
(j)
(n)
= Y
(n)
b
Y
(n)
+ u
(n)
j
b
T
j
= Y
(n)
b
Y
(n)
+ u
(n)
j
n
u
j
o
n
T
= Y
(n)
b
Y
(n)
+ ~{u
j
}
(n)
. (62)
For a specific nonlinear function Ψ(·) (Ψ(x) = x
α
)
Ψ(b
j
) = Ψ({u
j
}
n
)
= Ψ(u
(N)
j
)··· Ψ(u
(n+1)
j
) Ψ(u
(n1)
j
)··· Ψ(u
(1)
j
)
= {Ψ(u
j
)}
n
, (63)
and the denominator in (61) can be simplified as
b
T
j
Ψ(b
j
) = {u
j
}
n
T
{Ψ(u
j
)}
n
= {u
T
j
Ψ(u
j
)}
n
, (64)
this completesthe derivationof a flexible Alpha-HALS NTF
update rule, which in the tensor form is given by
u
(n)
j
Ψ
1
Ψ
[Y
(j)
]
+
×
n
{u
j
}
n
u
T
j
Ψ(u
j
)
o
n
+
, (65)
where all nonlinear operations are componentwise
††
.
For α = 0 instead of Φ(x) = x
α
we used Φ(x) = ln(x) [18].
††
In practice, instead of half-wave rectifying we often use dif-
ferent transformations, e.g., real part of Ψ(x) or adaptive nonneg-
ative shrinkage function with gradually decreasing threshold till
variance of noise σ
2
noise
.
Algorithm 4 Alpha-HALS NTF
1: ALS or random initialization for all nonnegative vectors u
(n)
j
2: Normalize all u
(n)
j
for n = 1, 2, ..., N 1 to unit length,
3: Compute residue tensor E = Y ~{U} = Y
b
Y
4: repeat
5: for j = 1 to J do
6: Compute Y
(j)
= E + u
(1)
j
u
(2)
j
. . . u
(N)
j
7: for n = 1 to N do
8: u
(n)
j
as in (65)
9: Normalize u
(n)
j
to unit length vector if n , N
10: end for
11: Update E = Y
(j)
u
(1)
j
u
(2)
j
. . . u
(N)
j
12: end for
13: until convergence criterion is reached
6. Flexible HALS Algorithms Using Beta Divergence
Beta divergence can be considered as a flexible and com-
plementary cost function to the Alpha divergence. In order
to obtain local NMF algorithms we introduce the following
definition of the Beta divergence (similar to [14],[18],[30]):
D
(j)
β
([Y
(j)
]
+
|| a
j
x
j
) =
X
ik
([y
(j)
ik
]
+
)
[y
(j)
ik
]
β
+
z
(j)β
ik
β
[y
(j)
ik
]
β+1
+
z
(j) β+1
ik
β + 1
, β > 0, (66a)
X
ik
([y
(j)
ik
]
+
) ln
[y
(j)
ik
]
+
z
(j)
ik
[y
(j)
ik
]
+
+ z
(j)
ik
, β=0, (66b)
X
ik
ln
z
(j)
ik
[y
(j)
ik
]
+
+
[y
(j)
ik
]
+
z
(j)
ik
1
, β=-1, (66c)
where y
(j)
ik
= y
ik
P
p, j
a
ip
b
kp
and z
(j)
ik
= a
ij
x
jk
= a
ij
b
kj
for j = 1, 2, . . . , J. The choice of the real-valued parameter
β 1 depends on the statistical distribution of data and
the Beta divergence corresponds to Tweedie models [14],
[19],[25],[30]. For example, if we consider the Maximum
Likelihood (ML) approach (with no a priori assumptions)
the optimal estimation consists of minimization of the Beta
Divergence measure when noise is Gaussian with β = 1.
For the Gamma distribution β = 1, for the Poisson distri-
bution β = 0, and for the compound Poisson β (1, 0).
However, the ML estimation is not optimal in the sense of
a Bayesian approach where a priori information of sources
and mixing matrix (sparsity, nonnegativity)can be imposed.
It is interesting to note that the Beta divergence as special
cases includes the standard squared Euclidean distance (for
β = 1), the Itakura-Saito distance (β = 1), and the general-
ized Kullback-Leibler divergence ( β = 0).
In order to derive a local learning algorithm, we com-
pute the gradient of (66), with respect to elements to b
kj
, a
ij
:
CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS
9
Algorithm 5 Beta-HALS NTF
1: Initialize randomly all nonnegative factors U
(n)
2: Normalize all u
l, j
for l = 1...N 1 to unit length,
3: Compute residue tensor E = Y ~{U} = Y
b
Y
4: repeat
5: for j = 1 to J do
6: Compute Y
(j)
= E + u
(1)
j
u
(2)
j
. . . u
(N)
j
7: for n = 1 to N 1 do
8: u
(n)
j
h
Y
(j)
×
n
{Ψ(u
j
)}
i
+
9: Normalize u
(n)
j
to unit length vector
10: end for
11: u
(N)
j
"
Y
(j)
×
N
{Ψ(u
j
)}
{Ψ(u
j
)
T
u
j
}
n
#
+
12: Update E = Y
(j)
u
(1)
j
u
(2)
j
. . . u
(N)
j
13: end for
14: until convergence criterion is reached
D
(j)
β
b
kj
=
X
i
z
(j) β
ik
([y
(j)
ik
]
+
) z
(j) β1
ik
a
ij
, (67)
D
(j)
β
a
ij
=
X
k
z
(j) β
ik
([y
(j)
ik
]
+
) z
(j) β1
ik
b
kj
. (68)
By equating the gradient components to zero, we obtain a
set of simple HALS updating rules referred to as the Beta-
HALS algorithm:
b
kj
1
P
I
i=1
a
β+1
ij
I
X
i=1
a
β
ij
([y
(j)
ik
]
+
) , (69)
a
ij
1
P
K
k=1
b
β+1
kj
K
X
k=1
b
β
kj
([y
(j)
ik
]
+
). (70)
The above update rules can be written in a generalized com-
pact vector form as
b
j
([Y
(j) T
]
+
)Ψ(a
j
)
Ψ(a
T
j
) a
j
, a
j
([Y
(j)
]
+
) Ψ(b
j
)
Ψ(b
T
j
) b
j
, (71)
where Ψ(b) is a suitably chosen convex function (e.g.,
Ψ(b) = b
.[β]
) and the nonlinear operations are performed
element-wise.
The above learning rules could be generalized for the
N-order NTF problem (2) (using the similar approach as for
the Alpha-HALS NTF):
u
(n)
j
([Y
(j)
(n)
]
+
) Ψ(b
j
)
Ψ(b
T
j
) b
j
, (72)
where b
j
= {u
j
}
n
, and Y
(j)
(n)
are defined in (62) and (45).
By taking into account (63), the learning rule (72) can
be written as follows
u
(n)
j
([Y
(j)
(n)
]
+
) {Ψ(u
j
)}
n
{Ψ(u
j
)}
n
T
{u
j
}
n
=
[Y
(j)
]
+
×
n
{Ψ(u
j
)}
{Ψ(u
j
)
T
u
j
}
n
.(73)
Actually, the update rule (73) can be simplified to reduce
computational cost by performing normalization of vectors
u
(n)
j
for n = 1, . . . , N 1 to unit length vectors after each
iteration step:
u
(n)
j
h
Y
(j)
×
n
{Ψ(u
j
)}
i
+
, u
(n)
j
u
(n)
j
/ku
(n)
j
k
2
.(74)
The detailed pseudo-codeof the Beta-HALS NTF algorithm
is given in Algorithm 5. Once again, this algorithm can be
rewritten in the fast form as follows
u
(n)
j
"
γ
(n)
j
u
(n)
j
+
Y
(n)
{Ψ(U)}
n
j
U
(n)
n
Ψ(U)
T
U
o
n
j
#
+
(75)
where γ
(n)
j
= {Ψ(u
T
j
)u
j
}
n
, n = 1, . . . , N. The Fast HALS
NTF algorithm is a special case with Ψ(x) = x.
In order to avoid local minima we have also developed
a simple heuristic hierarchical Alpha- and Beta- HALS NTF
algorithms combined with multi-start initializations using
the ALS as follows:
1. Perform factorization of a tensor for any value of α or
β parameters (preferably, set the value of the param-
eters to unity due to simplicity and high speed of the
algorithm for this value).
2. If the algorithm has convergedbut has not achieved the
desirable fit value (FIT max), restart the factorization
by keeping the previously estimated factors as the ini-
tial matrices for the ALS initialization.
3. If the algorithm does not converge, alter the values of
α or β parameters incrementally; this may help to over-
step local minima.
4. Repeat the procedure until a desirable fit value is
reached or there is a negligible or no change in the fit
value or a negligible or no change in the factor matri-
ces, or the value of the cost function in negligible or
zero.
7. Simulation Results
Extensive simulations were performed for synthetic and
real-world data on a 2.66 GHz Quad-Core Windows 64-bit
PC with 8GB memory. For tensor factorization, the results
were compared with some existing algorithms: the NMWF
[31], the lsNTF [32] and also with two ecient implementa-
tions of general form of PARAFAC ALS algorithm by Kolda
and Bader [16] (denoted as ALS K) and by Andersson and
Bro [33] (denoted as ALS B). To make a fair comparison
we apply the same stopping criteria and conditions: maxi-
mum dierence of fit value, and we used three performance
indexes: Peak Signal to Noise Ratio (PSNR) for all frontal
slices, Signal to Interference Ratio (SIR)
for each columns
of factors, and the explained variation ratio (i.e., how well
the approximated tensor fit the input data tensor) for a whole
tensor.
The signal to interference ratio is defined as S IR(a
j
, ˆa
j
) =
10log(||a
j
||
2
2
/(||a
j
ˆa
j
||
2
2
)) for normalized and matched vectors.
10
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
7.1 Experiments for NMF
In Example 1 we compare our HALS algorithms with
the multiplicative Lee-Seung algorithm [34] and Chih-Lin
Projected Gradient (PG) algorithm [35] for the benchmark
Xspectra [36] (see Fig.3(b)). Ten mixtures were randomly
generated from 5 sources (Fig.3(a)). We selected α = 1.5
for α-HALS and β = 2 for β-HALS in order to show the dif-
ference in performance in comparison to the standard gen-
eralized Kullback-Leibler (K-L) divergence. Monte Carlo
analysis was also performed with 100 trials and the average
values of SIR for X and running time for each trial were
summarized on Fig.3(c). Fast HALS NMF, α-HALS and β-
HALS achievedhigher performancethan the twoother well-
known NMF algorithms. The simulation results for Exam-
ple 2 presented in Fig.4 were performed for the synthetic
benchmark(Fig.4(a)) with 10 sparse (non-overlapping)non-
negative components. The sources were mixed by the ran-
domly generated full column rank matrix A R
2×10
+
, so
only two mixed signals were available. The typical mixed
signals are shown in Fig.4(b). The estimated components
by the new β-HALS NMF algorithm (69)-(71) with β =
0.1 are illustrated in Fig.4(c). Moreover, the performance
for dierent values of the parameter β are illustrated in
Fig.4(d) and 4(e) with average Signal-to-Interference (SIR)
level greater than 30 [dB]. Since the proposed algorithms
(alternating technique) perform a non-convex optimization,
the estimated components depend on the initial conditions.
To estimate the performance in a statistical sense, we per-
formed a Monte Carlo (MC) analysis. Figures 4(d) and 4(e)
present the histograms of 100 mean-S IR samples for esti-
mations matrices A and X. We also conducted an experi-
ment for the large scale similar problem in which we used
100 very sparse non-overlapped source signals and we mix
them by random generated full column rank mixing ma-
trix A R
2×100
+
(i.e., only two mixtures were used). Us-
ing the same algorithm but with 25 NMF layers, we were
able to recover most of the sources in high probability.
The performance is evaluated through the correlation matrix
R
X
=
ˆ
X X
T
which should be a diagonal matrix for a perfect
estimation (given in Fig. 5(a)). Whereas distribution of the
SIR performance is shown in Fig. 5(b). Detailed results are
omitted due to space limits.
In Example 3 we used five noisy mixtures of three
smooth sources (benchmark signals X 5smooth [36]).
Mixed signals were corrupted by additive Gaussian noise
with SNR = 15 [dB] (Fig.6(a)). Fig.6 (c) illustrates e-
ciency of the HALS NMF algorithm with smoothness con-
straints using updates rules (41), including the Laplace op-
erator L of the second order. The estimated components
by the smooth HALS NMF using 3 layers [14] are depicted
in Fig.6(b), whereas the results of the same algorithm with
the smoothness constraint achievedS IR A = 29.22 [dB] and
S IR X = 15.53 [dB] are shown in Fig.6(c).
7.2 Experiments for NTF
In Example 4, we applied the NTF to a simple denois-
ing of images. At first, a third-order tensor Y R
51×51×40
+
whose each layer was generated by the L-shaped membrane
function (which creates the MATLAB logo) Y[:, :, k] =
kmembrane(1, 25), k = 1, . . . , 40 has been corrupted by ad-
ditive Gaussian noise with SNR 10 [dB] (Fig. 7(a)). Next,
the noisy tensor data has been approximated by NTF model
using our α-HALS and β-HALS algorithms with fit value
96.1%. Fig.7(a), 7(b) and 7(c) are surface visualizations of
the 40-th noisy slice, and its reconstructed slices by α and
β-HALS NTF (α = 2, β = 2), whereas Fig.7(d), 7(e) and 7(f)
are their iso-surface visualizations, respectively. In addition,
the performance for dierent values of parameters α and β
are illustrated in Fig. 7(g) and 7(h) with PSNR in the left
(blue) axis and number of iterations in the right (red) axis.
In Example 5, we constructed a large scale tensor
with size of 500 × 500 ×500 corrupted by additive Gaus-
sian noise with SNR = 0 [dB] by using three benchmarks
X spectra sparse, ACPos24sparse10 and X spectra
[36] (see Fig.8(a)) and successfully reconstructed original
sparse and smooth components using α- and β-HALS NTF
algorithms. The performance is illustrated via volume, iso-
surface and factor visualizations as shown in Fig. 8(b), 8(c)
and 8(f); while running time and distributions of SIR and
PSNR performance factors are depicted in Fig. 8(g). Slice
10 and its reconstructed slice are displayed in Fig.8(d) and
8(e). In comparison to the known NTF algorithms the Fast
HALS NTF algorithm provides a higher accuracy for fac-
tor estimation based on SIR index, and the higher explained
variation with the faster running time.
In Example 6, we tested the Fast HALS NTF algo-
rithm for real-world data: Decomposition of amino acids
fluorescence data (Fig.9(a)) from five samples containing
tryptophan, phenylalanine, and tyrosine (claus.mat) [33],
[37]. The data tensor was additionally corrupted by Gaus-
sian noise with SNR = 0 dB (Fig.9(b)) , and the factors were
estimated with J = 3. The β-HALS NTF was selected with
β = 1.2, where for α-HALS NTF we select α = 0.9. All
algorithms were set to process the data with the same num-
ber of iterations (100 times). The performances and running
times are compared in Fig. 10, and also in Table 3. In this
example, we applied a smoothness constraint for Fast NTF,
α- and β- HALS NTF. Based on fit ratio and PSNR index
we see that, HALS algorithms usually exhibited better per-
formance than standard NTF algorithms. For example, the
first recovered slice (Fig.9(c)) is almost identical to the slice
of the clean original tensor (99.51% Fit value). In compar-
ison, the NMWF, lsNTF, ALS K, ALS B produced some
artifacts as illustrated in Fig.9(d). Fig.9(e) and Fig.9(f).
In Example 7 we used real EEG data: tutorial-
dataset2.zip [38] which was pre-processed by complex
Morlet wavelet. The tensor is represented by the inter-trial
phase coherence (ITPC) for 14 subjects during a proprio-
ceptive pull of left and right hand (28 files) with size 64
CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS
11
0
2
4
y1
0
5
10
0
5
y3
0
5
10
0
5
10
0
5
y6
0
2
4
y7
0
5
10
0
5
10
100 200 300 400 500 600 700 800 900 1000
0
5
10
(a) 10 mixtures of dataset Xspectra
0
10
20
0
20
40
0
10
20
0
10
20
100 200 300 400 500 600 700 800 900 1000
0
20
40
(b) β-HALS (β = 2)
0
0.5
1
1.5
2
2.5
3
Time in second
FastHALS
α-HALS
β-HALS
Lee-Seung
PG
(c) SIR for X and running time
Fig.3 Comparison of the Fast HALS NMF, α-HALS, β-HALS, Lee-
Seung and PG algorithms in Example 1 with the data set Xspectra. (a)
observed mixed signals, (b) reconstructed original spectra (sources) using
the β-HALS algorithm, (c) SIRs for the matrix X and computation time for
dierent NMF algorithms.
0
200
400
0
200
400
0
200
400
0
200
400
0
200
400
0
200
400
0
500
1000
0
200
400
0
200
400
100 200 300 400
0
200
400
(a) 10 sources
0
2
4
6
8
10
12
100 200 300 400
0
5
10
15
20
(b) 2 mixtures
0
10
20
0
10
20
0
20
40
0
20
40
0
20
40
0
20
40
0
10
20
0
20
40
0
10
20
100 200 300 400
0
10
20
(c) β-HALS, β = 0.1
0.1 0.5 0.8 1 1.3
40
60
80
100
120
140
SIR [dB]
beta
Mean SIR for A
(d) SIR for A
0.1 0.5 0.8 1 1.3
50
100
150
200
250
SIR [dB]
Mean SIR for X
beta
(e) SIR for X
Fig.4 Illustration of performance of the β-HALS NMF algorithm (a) 10
sparse sources assumed to be unknown, (b) two mixtures, (c) 10 estimated
sources for β = 0.1. (d) & (e) SIR values for matrix A and sources X
(respectively) obtained by the β-HALS NMF for β = 0.1, 0.5, 0.8, 1, 1.3 in
the MC analysis of 100 trials.
× 4392 × 28. Exemplary results are shown in Fig.11 with
scalp topographic maps and their corresponding IPTC time-
frequency measurements and performance comparisons are
10 20 30 40 50 60 70 80 90 100
10
20
30
40
50
60
70
80
90
100
X
ˆ
X
(a) correlation matrix R =
ˆ
X X
T
0 50 100 150 200
0
10
20
30
40
50
60
SIR (dB)
No. sources
(b) SIR distribution
Fig.5 Visualization of performance of extraction 100 sparse sources
from only two linear mixtures for Example 2.
−5
0
5
y1
−5
0
5
y2
−5
0
5
y3
−5
0
5
y4
200 400 600 800 1000
−5
0
5
y5
(a) Noisy mixtures, 10
dB Gaussian noise
0
5
x1
0
5
x2
200 400 600 800 1000
0
5
x3
(b) Noisyestimated com-
ponents
0
5
x1
0
2
4
x2
200 400 600 800 1000
0
2
4
x3
(c) Smoothed compo-
nents
Fig.6 Illustration of performance of the regularized HALS NMF algo-
rithm for Example 3.
given in Table 3. The components of the first factor U
(1)
are
relative to location of electrodes, and they are used to illus-
trate the scalp topographic maps (the first row in Fig.11);
whereas the 2-nd factor U
(2)
represents the frequency-time
spectral maps which were vectorized, presented in the sec-
ond row. Each component of these factors corresponds to a
specific stimulus (left, right and both hands actions).
In Example 8 we performed feature extraction for the
CBCL face data set. The tensor was formed using the first
100 images of dimension 19 × 19 and then factorized by
using 49 components and 100 components. The β-HALS
NTF was selected with β = 1 to compare the HALS NTF
algorithms with the NMWF and the lsNTF algorithm. For
the case of 100 components, the reconstruction tensors ex-
plained 98.24 %, 97.83 % and 74.47% of the variation of the
original tensor, for the β-HALS NTF, NMWF and lsNTF,
respectively (Table 3). Note that the estimated components
by using β-HALS NTF (Fig.12(b)) are relatively sparse and
their reconstruction images are very similar to the original
sources (Fig.12(a)).
Computer simulation for the above illustrated exam-
ples confirmed that the proposed algorithms give consistent
and similar results to that obtained using the known “state of
the arts” NMF/NTF algorithms, but our algorithms seem to
be faster and more ecient. In other words, through exten-
sive simulations we have confirmed that the FAST HALS
NTF, α-HALS NTF and β-HALS NTF algorithms are ro-
bust to noise and produce generally better performance and
provide faster convergence speed than existing recently de-
12
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
(a) The40thnoisy slice (b) α = 2,
PSNR = 24.17[dB]
(c) β = 2,
PSNR = 27.19[dB]
(d) Noisy tensor (e) α = 2 (f) β = 2
10
15
20
25
30
35
PSNR (dB)
0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
10
15
20
25
30
35
Iteration (times)
PSNR Number of iterations
(g) α-NTF
10
15
20
25
30
35
PSNR (dB)
0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0
0
5
10
15
20
25
Iteration (times)
PSNR Number of iterations
(h) β-NTF
Fig.7 Illustration of data reconstruction for noisy tensor Y R
51×51×40
+
for Example 4: (a), (b) & (c) surface visualizations of the 40th noisy slice
and its reconstructed slices by α- and β-HALS NTF algorithms (α = 2,
β = 2), respectively; (d)-(f) iso-surface visualizations of noisy tensor and
its reconstructed tensors by α- and β-HALS-NTF algorithms; (g) & (h)
Performance of the HALS NTF algorithms for dierent values of α and β
but for the same desired fit value 96.1%.
(a) Noisy data (b) Volume 99.9% (c)
Iso-surface 99.99%
100 200 300 400 500
100
200
300
400
500
−0.1
0
0.1
0.2
(d) Slice 10
100 200 300 400 500
100
200
300
400
500
0
0.05
0.1
(e) Reconstructed slice
U
(1 )
U
(2 )
U
(3 )
(f) Factors
10
0
10
1
10
2
10
3
10
4
Time in second
FastNTF NMWF lsNTF ALS_B ALS_K
32
40
50
60
70
PSNR in dB
(g) Performance comparison
Fig.8 Illustration of tensor reconstruction by Fast HALS NTF for Ex-
ample 5 with tensor Y R
500×500×500
+
degraded by Gaussian noise with
SNR = 0[dB].
(a) Slice ofamino
acid tensor
(b) Gaussian noise
SNR = 0 [dB]
(c) HALS NTF
99.51%
(d) NMWF
98.76%
U
(1)
U
(2)
U
(3)
(e) Smoothed factors by HALS NTF
U
(1)
U
(2)
U
(3)
(f) Factors by NMWF
Fig.9 Illustration of estimated factors by the FAST-HALS NTF in com-
parison to the NMWF algorithm for three-way decomposition of amino
acid data in Example 6. (a) The first slice of original tensor, (b) The same
slice with hugeGaussian noise, (c)-(d) the reconstructed slices using HALS
NTF and NMWF, (e)-(f) three estimated factors using HALS and NMWF
algorithms (The estimated factors should be as smooth as possible).
0
1
2
3
4
5
6
7
Time in second
98
98.2
98.4
98.6
98.8
99
99.2
99.4
99.6
Variance Explained (%)
Runnning time Variance Explained
Fig.10 Comparison of performance and running time for amino acid
data with tensor Y R
5×201×61
+
corrupted by Gaussian noise with SNR =
0[dB].
ms
Hz
0 50 100 150 200 250 300
20
30
40
50
60
70
(a) Left hand stimuli
ms
Hz
0 50 100 150 200 250 300
20
30
40
50
60
70
(b) Gamma activity of
both stimuli
ms
Hz
0 50 100 150 200 250 300
20
30
40
50
60
70
(c) Right hand stimuli
Fig.11 EEG analysis using the FAST HALS NTF for Example 7 with
factor matrices for U
(1)
for a scalp topographic map (first row), factor U
(2)
for spectral (time-frequency) map (second row) (see [38] for details). Re-
sults are consistent with previous analysis [38] but run time is almost 8
times shorter and fit is slightly better.
CICHOCKI and PHAN: ALGORITHMS FOR NONNEGATIVE MATRIX AND TENSOR FACTORIZATIONS
13
(a) 6 original CBCL images (top) and their reconstructions by 49 compo-
nents (94.81%) (center) and 100 components (98.24%) (bottom).
(b) 49 basis components estimated by β-HALS NTF, 94.95 % (Fit).
Fig.12 Illustration of factorization of 100 CBCL face images into 49
and 100 basis components by using the β-HALS NTF algorithm.
veloped NMF/NTF algorithms.
8. Conclusions and Discussion
The main objective and motivationsof this paper is to derive
fast and ecient algorithms for NMF/NTF problems. The
extended algorithms are verified for many dierent bench-
marks. The developed algorithms are robust to noisy data
and have many potential applications. These algorithms are
also suitable to large scale dataset due to their local learning
rules, and fast processing speed. The algorithms can be ex-
tended to semi-NTF and to sparse PARAFAC using suitable
nonlinear projections and regularization terms [17]. These
are the unique extensions of the standard NMF HALS al-
gorithm, and to the authors’ best knowledge, the first time
such algorithms have been applied and practically imple-
mented to multi-way NTF models. We have implemented
the proposed algorithms in MATLAB in our toolboxes NM-
FLAB/NTFLAB and they will be available soon free for re-
searchers [5]. The performance of the developed algorithms
are compared with some of the existing NMF and NTF algo-
rithms. The proposed algorithms are shown to be superior in
terms of performance,speed and convergenceproperties. Of
Table 2 Description of data sets and notation of Examples
No. Data set Size J
4
L-shaped membrane function, MATLAB
logo
51 ×51 × 40 4
5
X spectra sparse,
ACPos24sparse10 and X
spectra [36]
500 ×500 × 500 4
6
Amino acids fluorescence data,
claus.mat [37]
5 ×201 × 61 5
7
ITPC of 14 subjects during a propriocep-
tive pull of left and right hand (28 datasets),
64channels × (61frequency 72time) ×
28subjects, tutorialdataset2.set[38]
64 × 4392 × 28 3
8
MIT CBCL face images
190 ×19 × 100
49
100
Table 3 Comparison of Performance of NTF Algorithms for Ex-
amples 5-9
Fit (%) Time (seconds)
Example No. 5 6 7 8 5 6 7
FastNTF 99.9955 99.51 52.41 51.73 0.93 7.08
α-NTF 98.77 6.33
β-NTF 99.9947 99.39 98.24 470.53 1.85
NMWF
99.9918 98.76 52.38 97.83 513.37 3.16 58.19
lsNTF
††
98.06 51.33 74.47 3.30 4029.84
ALS B 99.9953 98.53 53.17 145.73 2.52 67.24
ALS K 99.9953 98.53 53.13 965.76 1.78 66.39
course, there are still many open theoretical problems like
global convergence of the algorithms and optimal choice of
α and β parameters.
Acknowledgment
The authors would like to thank the associate editor Pro-
fessor Kazushi Ikeda and anonymous reviewers for their
valuable comments and helpful suggestions that greatly im-
proves this paper’s quality.
References
[1] S. Amari, Dierential-Geometrical Methods in Statistics, Springer
Verlag, 1985.
[2] D.D. Lee and H.S. Seung, “Learning the parts of objects by non-
negative matrix factorization, Nature, vol.401, pp.788–791, 1999.
[3] A. Cichocki, R. Zdunek, and S.I. Amari, “Hierarchical ALS algo-
rithms for nonnegative matrix and 3D tensor factorization, Springer
LNCS, vol.4666, pp.169–176, 2007.
[4] A. Cichocki, R. Zdunek, and S. Amari, “Csiszar’s divergences
for non-negative matrix factorization: Family of new algorithms,
Springer LNCS, vol.3889, pp.32–39, 2006.
[5] A. Cichocki, R. Zdunek, A.H. Phan, and S. Amari, Nonnegative Ma-
trix and Tensor Facorizarions and Beyond, Wiely, Chichester, 2009.
[6] M. Mørup, L.K. Hansen, C.S. Herrmann, J. Parnas, and S.M. Arn-
fred, “Parallel factor analysis as an exploratory tool for wavelet
transformed event-related EEG, NeuroImage, vol.29, no.3, pp.938–
947, 2006.
[7] F. Miwakeichi, E. Martnez-Montes, P. Valds-Sosa, N. Nishiyama,
H. Mizuhara, and Y. Yamaguchi, “Decomposing EEG data into
In fact, the NMWF failed for very noisy data due to large
negative entries. We enforced the estimated components to have
nonnegative values by half-wave rectifying.
††
lsNTF failed for large scale example with tensor of 500 × 500
× 500. However, for the same problem with a reduced dimension
of tensor: 300 × 300 × 300, lsNTF needed 2829.9800 seconds and
achieved 99.9866% of fit value, so our algorithm was at least 50
times faster.
14
IEICE TRANS. FUNDAMENTALS, VOL.Exx–??, NO.xx XXXX 200x
spacetimefrequency components using parallel factor analysis,
NeuroImage, vol.22, no.3, pp.1035–1045, 2004.
[8] A. Shashua, R. Zass, and T. Hazan, “Multi-way clustering using
super-symmetric non-negative tensor factorization, European Con-
ference on Computer Vision (ECCV), Graz, Austria, May 2006.
[9] J. Sun, D. Tao, and C. Faloutsos, “Beyond streams and graphs:
dynamic tensor analysis, Proc.of the 12th ACM SIGKDD Inter-
national Conference on Knowledge Discovery and Data Mining,
pp.374–383, 2006.
[10] M. Heiler and C. Schnoerr, “Controlling sparseness in non-negative
tensor factorization, Springer LNCS, vol.3951, pp.56–67, 2006.
[11] T. Hazan, S. Polak, and A. Shashua, “Sparse image coding using a
3D non-negative tensor factorization, International Conference of
Computer Vision (ICCV), pp.50–57, 2005.
[12] A. Smilde, R. Bro, and P. Geladi, Multi-way Analysis: Applications
in the Chemical Sciences, John Wiley and Sons, New York, 2004.
[13] M. Berry, M. Browne, A. Langville, P. Pauca, and R. Plemmons,
Algorithms and applications for approximate nonnegative matrix
factorization, Computational Statistics and Data Analysis, vol.52,
no.1, pp.155–173, 2007.
[14] A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S. Amari,
“Nonnegative tensor factorization using Alpha and Beta divergen-
cies, Proc. IEEE International Conference on Acoustics, Speech,
and Signal Processing (ICASSP07), Honolulu, Hawaii, USA,
pp.1393–1396, April 15–20 2007.
[15] P. Sajda, S. Du, T. Brown, L. Parra, and R. Stoyanova, “Recovery
of constituent spectra in 3d chemical shift imaging using nonnega-
tive matrix factorization, 4th International Symposium on Indepen-
dent Component Analysis and Blind Signal Separation, Nara, Japan,
pp.71–76, April 2003.
[16] T.G. Kolda and B. Bader, “Tensor decompositions and applications,
SIAM Review, June 2008.
[17] A. Cichocki, A.H.Phan, R. Zdunek, and L.Q. Zhang, “Flexible com-
ponent analysis for sparse, smooth, nonnegative coding or represen-
tation, Lecture Notes in Computer Science, pp.811–820, Springer,
2008.
[18] A. Cichocki, S. Amari, R. Zdunek, R. Kompass, G. Hori, and Z. He,
“Extended SMART algorithms for non-negative matrix factoriza-
tion, Springer LNAI, vol.4029, pp.548–562, 2006.
[19] A. Cichocki and R. Zdunek, “Regularized alternating least squares
algorithms for non-negative matrix/tensor factorizations, Springer
LNCS, vol.4493, pp.793–802, June 3–7 2007.
[20] N.D. Ho, Nonnegative Matrix Factorization - Algorithms and Ap-
plications, thse/dissertation, FSA/INMA - Dpartement d’ingnierie
mathmatique, 2008.
[21] N.D. Ho, P.V. Dooren, and V. Blondel, “Descent algorithms for non-
negative matrix factorization, Numerical Linear Algebra in Signals,
Systems and Control, 2008. to appear.
[22] M. Biggs, A. Ghodsi, and S. Vavasis, “Nonnegative matrix factor-
ization via rank-one downdate, ICML-2008, Helsinki, July 2008.
[23] N. Gillis and F. Glineur, “Nonnegative matrix factorization and
underapproximation, SIAM conference on Optimization, Boston,
May 2008. Preprint.
[24] A.H. Phan andA. Cichocki, “Multi-way Nonnegative Tensor Factor-
ization UsingFastHierarchical Alternating LeastSquares Algorithm
(HALS),” Proc. of The 2008 International Symposium on Nonlinear
Theory and its Applications, Budapest, Hungary, 2008.
[25] A. Cichocki, R. Zdunek, S. Choi, R. Plemmons, and S.I. Amari,
“Novel multi-layer nonnegative tensor factorization with sparsity
constraints, Springer LNCS, vol.4432, pp.271–280, April 11–14
2007.
[26] M. Nikolova, “Minimizers of cost-functions involving nonsmooth
data-fidelity terms. application to the processing of outliers, SIAM
J. Numer. Anal., vol.40, no.3, pp.965–994, 2002.
[27] P.C. Hansen, “Regularization tools version 3.0 for matlab 5.2, Nu-
merical Algorithms, vol.20, pp.195–196, 1999.
[28] B.W. Bader and T.G. Kolda, Algorithm 862: Matlab tensor classes
for fast algorithm prototyping, ACM Trans. Math. Softw., vol.32,
no.4, pp.635–653, 2006.
[29] A. Cichocki, A.H. Phan, and C. Caiafa, “Flexible HALS algorithms
for sparse non-negative matrix/tensor factorization, Proc. of The
eighteenth of a series of IEEE workshops on Machine Learning for
Signal Processing, Cancun, Mexico, 16–19, October 2008.
[30] M. Minami and S. Eguchi, “Robust blind source separation by Beta-
divergence, Neural Computation, vol.14, pp.1859–1886, 2002.
[31] M. Mørup, L.K. Hansen, J. Parnas, and S.M. Arnfred, “Decompos-
ing the time-frequency representation of EEG using non-negative
matrix and multi-way factorization, tech. rep., 2006.
[32] M.P. Friedlander and K. Hatz, “Computing nonnegative tensor fac-
torizations, Tech. Rep. TR-200621, Dept. Computer Science, Uni-
versity of British Columbia, Vancouver, December 2007. To appear
in Optimization Methods and Software.
[33] C.A. Andersson and R. Bro, “The N-way Toolbox for MATLAB,
Chemometrics Intell. Lab. Systems, vol.52, pp.1–4, 2000.
[34] D.D. Lee and H.S. Seung, Algorithms for nonnegative matrix fac-
torization, NIPS, MIT Press, 2001.
[35] C.J. Lin, “Projected gradient methods for non-negative matrix fac-
torization, Neural Computation, vol.19, no.10, pp.2756–2779, Oc-
tober 2007.
[36] A. Cichocki and R. Zdunek, “NMFLAB for Signal and Image Pro-
cessing, tech. rep., Laboratory for Advanced Brain Signal Process-
ing, BSI, RIKEN, Saitama, Japan, 2006.
[37] R. Bro, “PARAFAC. Tutorial and applications, Special Issue
2nd Internet Conf. in Chemometrics (INCINC’96), pp.149–171,
Chemom. Intell. Lab. Syst, 1997.
[38] M. Mørup, L.K. Hansen, and S.M. Arnfred, “ERPWAVELAB a
toolbox for multi-channel analysis of time-frequency transformed
event related potentials, Journal of Neuroscience Methods, vol.161,
pp.361–368, 2007.
Andrzej Cichocki was born in Poland. He
received his M.Sc. (with honors), Ph.D. and Ha-
bilitate Doctorate (Dr.Sc.) degrees, all in elec-
trical engineering, from the Warsaw University
of Technology (Poland). He is the co-author of
four international books and monographs (two
of them translated to Chinese): Nonnegative
Matrix and Tensor Factorizations and Beyond,
J. Wiley 2009, Adaptive Blind Signal and Im-
age Processing, J. Wiley 2002, MOS Switched-
Capacitor and Continuous-Time Integrated Cir-
cuits and Systems (Springer-Verlag, 1989) and Neural Networks for Op-
timization and Signal Processing (J. Wiley and Teubner Verlag, 1993/94)
and author or co-author of more than two hundreds papers. He is Editor-in-
Chief of Journal Computational Intelligence and Neuroscience. Currently,
he is the head of the laboratory for Advanced Brain Signal Processing in
the RIKEN Brain Science Institute, Japan.
Anh Huy Phan received his B.E. and M.Sc.
degrees from the Ho-Chi-Minh City University
of Technologies, in area of Electronic Engineer-
ing, He worked as Deputy Head of Research
and Development Department, Broadcast Re-
search and Application Center, Vietnam Tele-
vision; also, taught as lecturer part-time at Van
Hien University, Hong Bang University, Elec-
tronic and Computer Center of University of
Natural Sciences in Ho-Chi-Minh City, Viet-
nam, in areas of Probability and Statistics, Nu-
merical Algorithms, MATLAB Programming. Actually he is working as
the technical sta in the Laboratory for Advanced Brain Signal Process-
ing and he is doing research towards his Ph.D. degree under supervision of
Professor Cichocki.
... Or alternatively, a hierarchical tensor factorization is a special case of a deep circuit with a particular tensorized architecture. When it comes to PCs, this implies decomposing probability distributions represented as non-negative tensors (Cichocki & Phan, 2009). At the same time, classical tensor factorizations can be exactly encoded as (shallow) circuits. ...
... However, tensor factorizations that are tailored for non-negative data (e.g. images), called non-negative tensor factorizations, factorize tensors into non-negative factors that can be easily interpreted (Cichocki & Phan, 2009). In Section 3, we connect non-negative tensor factorizations to the literature of circuits for probabilistic modeling, which allows us to interpret them as deep latent-variable models. ...
... For monotonic PCs, one can still recover the Tucker layer factorization above by replacing the CP factorization (Eq. (15)) with its non-negative version (Cichocki & Phan, 2009), which ensures the factors A, B, C and hence Q p1q , Q p2q to be non-negative matrices. Furthermore, a folded version of Eq. (CP-layer) can be obtained similarly to the one for Eq. ...
Article
Full-text available
This paper establishes a rigorous connection between circuit representations and tensor factorizations, two seemingly distinct yet fundamentally related areas. By connecting these fields, we highlight a series of opportunities that can benefit both communities. Our work generalizes popular tensor factorizations within the circuit language, and unifies various circuit learning algorithms under a single, generalized hierarchical factorization framework. Specifically, we introduce a modular "Lego block" approach to build tensorized circuit architectures. This, in turn, allows us to systematically construct and explore various circuit and tensor factorization models while maintaining tractability. This connection not only clarifies similarities and differences in existing models, but also enables the development of a comprehensive pipeline for building and optimizing new circuit/tensor factorization architectures. We show the effectiveness of our framework through extensive empirical evaluations, and highlight new research opportunities for tensor factorizations in probabilistic modeling.
... Model-based CF RSs learn latent features for items and users, and then use these features to construct the interaction matrix. For example, Pure Singular Value Decomposition (SVD) [24] and Non-negative Matrix Factorization (NMF) [25] are CF RSs that decompose the interaction matrix into two low-rank matrices for users and items. In NMF, the user and item learned matrices contain only non-negative values. ...
... • SVD: Singular Value Decomposition (SVD) [24] can be applied to decompose the interaction matrix to two low-rank matrices for users and items. • NMF: Non-negative Matrix Factorization (NMF) [25] is similar to SVD but the learned user and item matrices contain non-negative values. • WRMF: weighted regularized matrix factorization (WRMF) [43] is a model-based CF method that utilizes the alternating-least-squares optimization algorithm to learn its parameters. ...
Preprint
Full-text available
Massive Open Online Courses (MOOCs) are emerging as a popular alternative to traditional education, offering learners the flexibility to access a wide range of courses from various disciplines, anytime and anywhere. Despite this accessibility, a significant number of enrollments in MOOCs result in dropouts. To enhance learner engagement, it is crucial to recommend courses that align with their preferences and needs. Course Recommender Systems (RSs) can play an important role in this by modeling learners' preferences based on their previous interactions within the MOOC platform. Time-to-dropout and time-to-completion in MOOCs, like other time-to-event prediction tasks, can be effectively modeled using survival analysis (SA) methods. In this study, we apply SA methods to improve collaborative filtering recommendation performance by considering time-to-event in the context of MOOCs. Our proposed approach demonstrates superior performance compared to collaborative filtering methods trained based on learners' interactions with MOOCs, as evidenced by two performance measures on three publicly available datasets. The findings underscore the potential of integrating SA methods with RSs to enhance personalization in MOOCs.
... Below, we provide a brief review of some of the relevant NMF approaches in the literature. Cichocki et al [29] proposed a hierarchical alternative least squares (HALS) algorithm that uses a similar Frobenius Norm to Lin's ALS method as a distance measure between the estimated and input matrix, but estimates the factors hierarchically. Here, hierarchically means estimating the factors in strict order, presuming the orderings of factors also embeds their importance. ...
Article
Full-text available
We have developed a robust method for efficiently tracking multiple co-occurring mutations in a sequence database. Evolution often hinges on the interaction of several mutations to produce significant phenotypic changes that lead to the proliferation of a variant. However, identifying numerous simultaneous mutations across a vast database of sequences poses a significant computational challenge. Our approach leverages a matrix factorization technique to automatically and efficiently pinpoint subsets of positions where co-mutations occur, appearing in a substantial number of sequences within the database. We validated our method using SARS-CoV-2 receptor-binding domains, comprising approximately seven hundred thousand sequences of the Spike protein, demonstrating superior performance compared to a reasonably exhaustive brute-force method. Furthermore, we explore the biological significance of the identified co-mutational positions (CMPs) and their potential impact on the virus’s evolution and functionality, identifying key mutations in Delta and Omicron variants. This analysis underscores the significant role of identified CMPs in understanding the evolutionary trajectory. By tracking the “birth" and “death" of CMPs, we can elucidate the persistence and impact of specific groups of mutations across different viral strains, providing valuable insights into the virus’ adaptability and thus, possibly aiding vaccine design strategies.
... (https://scikit-learn.org/stable/) 133,134 . Springer Nature journal content, brought to you courtesy of Springer Nature Customer Service Center GmbH ("Springer Nature"). ...
Article
Full-text available
Multifaceted evidence has shown that psychiatric disorders share common neurobiological mechanisms. However, the tremendous inter-individual heterogeneity among patients with psychiatric disorders limits trans-diagnostic studies with case-control designs, aimed at identifying clinically promising neuroimaging biomarkers. This study aims to identify neuroanatomical differential factors (ND factors) underlying gray matter volume variations in five psychiatric disorders. We leverage 4 independent datasets of 878 patients diagnosed with psychiatric disorders and 585 healthy controls (HCs) to identify shared ND factors underlying individualized gray matter volume variations. Individualized gray matter volume variations are represented with the linear weighted sum of ND factors, and each case is assigned a unique factor composition, thus preserving interindividual variation. We identify four robust ND factors that can be generalized to unseen disorders. ND factors show significant association with group-level morphological abnormalities, reconciling individual- and group-level morphological abnormalities, and are characterized by dissociable cognitive processes, molecular signatures, and connectome-informed epicenters. Moreover, using factor compositions as features, we discover two robust transdiagnostic subtypes with opposite gray matter volume variations relative to HCs. In conclusion, we identify four reproducible and shared neuroanatomical factors that underlie the highly heterogeneous morphological abnormalities in psychiatric disorders.
Article
Omics technologies, including genomics, transcriptomics, proteomics, and metabolomics, have revolutionized biological research by enabling comprehensive, high-throughput analysis of molecular components within cells and organisms. The resulting high-dimensional datasets pose significant analytical challenges, particularly in integrating diverse data types and uncovering complex biological relationships. Tensor-based approaches have emerged as powerful tools for analyzing these high-dimensional omics datasets, offering advantages over traditional matrix-based methods in capturing complex, multi-way relationships. This review provides an overview of tensor decomposition techniques and their applications in omics data analysis, with a focus on multi-sample gene expression data, multi-omics integration, data imputation, and inference of cell-cell interactions from single-cell RNA sequencing. We discuss how tensors can naturally represent multidimensional omics datasets and how tensor factorization methods enable dimensionality reduction while preserving important structural information. A comprehensive biological background and an overview of relevant public databases and resources are provided to contextualize the computational methods. Case studies are presented to illustrate the application of tensor methods for tasks such as identifying gene expression modules, integrating multiple types of omics data, imputing missing values, and uncovering ligand-receptor interaction patterns. We highlight how tensor approaches can reveal higher-order interactions and context-dependent relationships that may be missed by traditional analyses. Challenges and future directions for tensor-based omics data analysis are also discussed, emphasizing the potential of these methods to extract meaningful biological insights from complex, heterogeneous datasets and advance our understanding of biological systems.
Article
Soft sensor technology is essential for achieving precise control and improving product quality in industrial processes, with broad application potential in chemical engineering as well. In industrial soft sensor modelling, while most models can capture the nonlinear and dynamic characteristics of time series, they often neglect the potential influence of spatial features. Additionally, due to factors such as signal instability, equipment failure, and sensor data packet loss, missing values are common in industrial data, which can compromise model accuracy. To address these issues, this paper proposes a soft sensor modelling framework based on a spatiotemporal attention network for quality prediction with missing data. The method first utilizes a generative adversarial imputation network (GAIN) to impute in the missing data. Then, a bidirectional long short‐term memory (BiLSTM) encoder integrated with a spatial attention module is employed to more precisely capture spatial correlations among variables in industrial processes, enhancing the capacity of the model to handle complex spatial dependencies. Furthermore, a temporal attention mechanism is incorporated to strengthen the extraction of dynamic dependencies across different time steps, further improving the ability of the model to capture nonlinear and dynamic features in industrial processes. Extensive experiments on debutanizer and steam flow processes validate the superior performance of the proposed method, laying a foundation for its application in chemical engineering and other complex industrial processes.
Article
We use non-negative matrix factorization for source separation on ultra-low frequency passive-acoustic data from a single-channel recording acquired in deep sea. Non-negative matrix factorization decomposes the spectrogram into a spectral-component matrix and a time-encoding matrix. Detectors use known time-frequency features to group components from the same source and reconstruct spectrograms of blue whale calls, seismic sounds, and ship noise. Data are separated at low computational cost and without learning step. The separation assessment using scale-invariant signal-to-distortion ratio on spectrograms of simulated reference data is satisfying. Source separation on ocean-bottom seismometer data from the Southern Indian Ocean provides convincing results.
Article
Full-text available
The majority of classic tensor CP decomposition models are designed for squared loss, utilizing Euclidean distance as a local proximal term. However, the Euclidean distance is unsuitable for the generalized loss function applicable to diverse types of real-world data, such as integer and binary data. Consequently, algorithms developed under the squared loss are not easily adaptable to handle these generalized losses, partially due to the absence of the gradient Lipschitz continuity. This paper explores generalized tensor CP decomposition, employing the Bregman distance as the proximal term and introducing an inertial accelerated block randomized stochastic mirror descent algorithm (iTableSMD). Within a broader multi-block variance reduction and inertial acceleration framework, we demonstrate the sublinear convergence rate for the subsequential sequence produced by the iTableSMD algorithm. We further show that iTableSMD requires at most O(ε2)\mathcal {O}(\varepsilon ^{-2}) iterations in expectation to attain an ε\varepsilon -stationary point and establish the global convergence of the sequence. Numerical experiments on real datasets demonstrate that our proposed algorithm is efficient and achieves better performance than the existing state-of-the-art methods.
Article
Full-text available
In this paper we describe a non-negative matrix factoriza- tion (NMF) for recovering constituent spectra in 3D chem- ical shift imaging (CSI). The method is based on the NMF algorithm of Lee and Seung (1), extending it to include a constraint on the minimum amplitude of the recovered spectra. This constrained NMF (cNMF) algorithm can be viewed as a maximum likelihood approach for finding ba- sis vectors in a bounded subspace. In this case the opti- mal basis vectors are the ones that envelope the observed data with a minimum deviation from the boundaries. Re- sults for P human brain data are compared to Bayesian Spectral Decomposition (BSD) (2) which considers a full Bayesian treatment of the source recovery problem and re- quires computationally expensive Monte Carlo methods. The cNMF algorithm is shown to recover the same con- stituent spectra as BSD, however in about less com- putational time.
Article
Full-text available
In this paper a constrained non-negative matrix factorization (cNMF) algorithm for recovering constituent spectra is described together with experiments demonstrating the broad utility of the approach. The algorithm is based on the NMF algorithm of Lee and Seung, extending it to include a constraint on the minimum amplitude of the recovered spectra. This constraint enables the algorithm to deal with observations having negative values by assuming they arise from the noise distribution. The cNMF algorithm does not explicitly enforce independence or sparsity, instead only requiring the source and mixing matrices to be non-negative. The algorithm is very fast compared to other "blind" methods for recovering spectra. cNMF can be viewed as a maximum likelihood approach for finding basis vectors in a bounded subspace. In this case the optimal basis vectors are the ones that envelope the observed data with a minimum deviation from the boundaries. Results for Raman spectral data, hyperspectral images, and 31P human brain data are provided to illustrate the algorithm's performance.
Article
Tensors (also known as multidimensional arrays or N -way arrays) are used in a variety of applications ranging from chemometrics to psychometrics. We describe four MATLAB classes for tensor manipulations that can be used for fast algorithm prototyping. The tensor class extends the functionality of MATLAB's multidimensional arrays by supporting additional operations such as tensor multiplication. The tensorlaslmatrix class supports the “matricization” of a tensor, that is, the conversion of a tensor to a matrix (and vice versa), a commonly used operation in many algorithms. Two additional classes represent tensors stored in decomposed formats: cpltensor and tuckerltensor. We describe all of these classes and then demonstrate their use by showing how to implement several tensor algorithms that have appeared in the literature.
Article
The open source toolbox ‘ERPWAVELAB’ is developed for multi-channel time–frequency analysis of event related activity of EEG and MEG data. The toolbox provides tools for data analysis and visualization of the most commonly used measures of time–frequency transformed event related data as well as data decomposition through non-negative matrix and multi-way (tensor) factorization. The decompositions provided can accommodate additional dimensions like subjects, conditions or repeats and as such they are perfected for group analysis. Furthermore, the toolbox enables tracking of phase locked activity from one channel–time–frequency instance to another as well as tools for artifact rejection in the time–frequency domain. Several other features are highlighted. ERPWAVELAB can freely be downloaded from www.erpwavelab.org, requires EEGLAB [Delorme A, Makeig S. EEGLAB: an open source toolbox for analysis of single-trial EEG dynamics including independent component analysis. J Neurosci Meth 2004;134:9–21] and runs under MATLAB (The Mathworks, Inc.).
Conference Paper
Non-negative tensor factorization (NTF) has recently been proposed as sparse and efficient image representation (Welling and We- ber, Patt. Rec. Let., 2001). Until now, sparsity of the tensor factoriza- tion has been empirically observed in many cases, but there was no systematic way to control it. In this work, we show that a sparsity measure recently proposed for non-negative matrix factorization (Hoyer, J. Mach. Learn. Res., 2004) applies to NTF and allows precise control over sparseness of the resulting factorization. We devise an algorithm based on sequential conic programming and show improved performance over classical NTF codes on artificial and on real-world data sets.
Article
This communication describes Version 3.0 of Regularization Tools, a Matlab package for analysis and solution of discrete ill-posed problems.
Article
This book is an introduction to the field of multi-way analysis for chemists and chemometricians. Its emphasis is on the ideas behind the method and its pratical applications. Sufficient mathematical background is given to provide a solid understanding of the ideas behind the method. There are currently no other books on the market which deal with this method from the viewpoint of its applications in chemistry. Applicable in many areas of chemistry. No comparable volume currently available. The field is becoming increasingly important.