ArticlePDF Available

Abstract and Figures

In this letter, we propose a new identification criterion that guarantees the recovery of the low-rank latent factors in the nonnegative matrix factorization (NMF) model, under mild conditions. Specifically, using the proposed criterion, it suffices to identify the latent factors if the rows of one factor are \emph{sufficiently scattered} over the nonnegative orthant, while no structural assumption is imposed on the other factor except being full-rank. This is by far the mildest condition under which the latent factors are provably identifiable from the NMF model.
Content may be subject to copyright.
On Identifiability of Nonnegative Matrix Factorization
Xiao Fu, Kejun Huang, and Nicholas D. Sidiropoulos
Department of Electrical and Computer Engineering, University of Minnesota,
Minneapolis, 55455, MN, United States
Email: (xfu,huang663,nikos)@umn.edu
September 5, 2017
Abstract
In this letter, we propose a new identification criterion that guarantees the recovery of
the low-rank latent factors in the nonnegative matrix factorization (NMF) model, under mild
conditions. Specifically, using the proposed criterion, it suffices to identify the latent factors if
the rows of one factor are sufficiently scattered over the nonnegative orthant, while no structural
assumption is imposed on the other factor except being full-rank. This is by far the mildest
condition under which the latent factors are provably identifiable from the NMF model.
1 Introduction
Nonnegative matrix factorization (NMF) [1,2] aims to decompose a data matrix into low-rank latent
factor matrices with nonnegativity constraints on (one or both of) the latent matrices. In other
words, given a data matrix XRM×Nand a targeted rank r, NMF tries to find a factorization
model X=W H>, where WRM×rand/or HRN×rtake only nonnegative values and
rmin{M, N }.
One notable trait of NMF is model identifiability – the latent factors are uniquely identifiable
under some conditions (up to some trivial ambiguities). Identifiability is critical in parameter
estimation and model recovery. In signal processing, many NMF-based approaches have there-
fore been proposed to handle problems such as blind source separation [3], spectrum sensing [4],
and hyperspectral unmixing [5, 6], where model identifiability plays an essential role. In machine
learning, identifiability of NMF is also considered essential for applications such as latent mixture
model recovery [7], topic mining [8], and social network clustering [9], where model identifiability
is entangled with interpretability of the results.
Despite the importance of identifiability in NMF, the analytical understanding of this aspect
is still quite limited and many existing identifiability conditions for NMF are not satisfactory in
some sense. Donoho et al. [10], Laurberg et al. [11], and Huang et al. [12] have proven different
sufficient conditions for identifiability of NMF, but these conditions all require that both of the
generative factors Wand Hexhibit certain sparsity patterns or properties. The machine learning
The two authors contributed equally.
1
arXiv:1709.00614v1 [cs.LG] 2 Sep 2017
and remote sensing communities have proposed several factorization criteria and algorithms that
have identifiability guarantees, but these methods heavily rely on the so-called separability condition
[8,13–18]. The separability condition essentially assumes that there is a (scaled) permutation matrix
in one of the two latent factors as a submatrix, which is clearly restrictive in practice. Recently, Fu
et al. [3] and Lin et al. [19] proved that the so-called volume minimization (VolMin) criterion can
identify Wand Hwithout any assumption on one factor (say, W) except being full-rank when
the other (H) satisfies a condition which is much milder than separability. However, the caveat is
that VolMin also requires that each row of the nonnegative factor sums up to one. This assumption
implies loss of generality, and is not satisfied in many applications.
In this letter, we reveal a new identifiablity result for NMF, which is obtained from a delicate
tweak of the VolMin identification criterion. Specifically, we ‘shift’ the sum-to-one constraint on H
from its rows to its columns. As a result, we show that this ‘constraint-altered VolMin criterion’
identifies Wand Hwith provable guarantees under conditions that are much more easily satisfied
relative to VolMin. This interesting tweak is seemingly slight yet the result is significant: putting
sum-to-one constraints on the columns (instead of rows) of His without loss of generality, since
the bilinear model X=W H>can always be re-written as X=W D1(H D)
>, where Dis a
full-rank diagonal matrix satisfying Dr,r = 1/kH:,rk1. Our new result is the only identifiability
condition that does not assume any other structure beyond the target rank on W(e.g., zero pattern
or nonnegativity) and has natural assumptions on H(relative to the restrictive row sum-to-one
assumption as in VolMin).
2 Background
To facilitate our discussion, let us formally define identifiability of constrained matrix factorization.
Definition 1. (Identifiability) Consider a data matrix generated from the model X=W\H>
\,
where W\and H\are the ground-truth factors. Let (W?,H?) be an optimal solution from an
identification criterion,
(W?,H?) = arg min
X=W H>
g(W,H).
If W\and/or H\satisfy some condition such that for all (W?,H?), we have that W?=W\ΠD
and H?=H\ΠD1, where Πis a permutation matrix and Dis a full-rank diagonal matrix, then
we say that the matrix factorization model is identifiable under that condition1.
For the ‘plain NMF’ model [1, 10, 12, 20], the identification criterion g(W,H) is 1 (or, ) if W
or Hhas a negative element, and 0 otherwise. Assuming that Xcan be perfectly factored under
the postulated model, the above is equivalent to the popular least-squares NMF formulation:
min
W0,H0,
XW H>
2
F.(1)
Several sufficient conditions for identifiability of (1) have been proposed. Early results in [10, 11]
require that one factor (say, H) satisfies the so-called the separability condition:
Definition 2 (Separability).A matrix HRN×r
+is separable if for every k= 1, ..., r, there exists
a row index nksuch that Hnk,:=αke
>
k, where αk>0 is a scalar and ekis the kth coordinate
vector in Rr.
1Whereas identifiability is usually understood as a property of a given model that is independent of the identifi-
cation criterion, NMF can be identifiable under a suitable identification criterion, but not under another, as we will
soon see.
2
With the separability assumption, the works in [10, 11] first revealed the reason behind the
success of NMF in many applications – NMF is unique under some conditions. The downside is that
separability is easily violated in practice – see discussions in [5]. In addition, the conditions in [10,11]
also need that Wto exhibit a certain zero pattern on top of Hsatisfying separability. This is also
considered restrictive in practice – e.g., in hyperspectral unmixing, W:,r’s are spectral signatures,
which are always dense. The remote sensing and machine learning communities have come up
with many different separability-based identification methods without assuming zero patterns on
W, e.g., the volume maximization (VolMax) criterion [8, 13] and self-dictionary sparse regression
[8, 15, 16,21, 22], respectively. However, the separability condition was not relaxed in those works.
The stringent separability condition was considerably relaxed by Huang et al. [12] based on a
so-called sufficiently scattered condition from a geometric interpretation of NMF.
Definition 3 (Sufficiently Scattered).A matrix HRN×r
+is sufficiently scattered if 1) cone{H>} ⊇
C, 2) cone{H>}bdC={λek|λ0, k = 1, ..., r}, where C={x|x
>1r1kxk2},
C={x|x
>1≥ kxk2}, cone{H>}={x|x=H>θ,θ0}and cone{H>}={y|x
>y0,x
cone{H>}} are the conic hull of H>and its dual cone, respectively, and bd denotes the boundary
of a set.
The main result in [12] is that if both Wand Hsatisfy the sufficiently scattered condition, then
the criterion in (1) has identifiability. This is a notable result since it was the first provable result in
which separability was relaxed for both Wand H. The sufficiently scattered condition essentially
means that cone{H>}contains Cas its subset, which is much more relaxed than separability that
needs cone{H>}to contain the entire nonnegative orthant; see Fig. 1.
On the other hand, the zero-pattern assumption on Wand Hare still needed in [12]. Another
line of work removed the zero pattern assumption from one factor (say, W) by using a different
identification criterion [3, 19]:
min
WRM×r,HRN×rdet W>W(2a)
s.t.X=W H>,(2b)
H1=1,H0,(2c)
where 1is an all-one vector with proper length. Criterion (2) aims at finding the minimum-volume
(measured by determinant) data-enclosing convex hull (or simplex). The main result in [3] is that
if the ground-truth H∈ {YRN×r|Y1=1,Y0}and His sufficiently scattered, then,
the volume minimization (VolMin) criterion identifies the ground-truth Wand H. This very
intuitive result is illustrated in Fig. 2: if His sufficiently scattered in the nonnegative orthant,
X:,n’s are sufficiently spread in the convex hull 2spanned by the columns of W. Then, finding the
minimum-volume data-enclosing convex hull recovers the ground-truth W. This result resolves the
long-standing Craig’s conjecture in remote sensing [23] proposed in the 1990s.
The VolMin identifiability condition is intriguing since it completely sets Wfree – there is
no assumption on the ground-truth Wexcept for being full-column rank, and it has a very mild
assumption on H. There is a caveat, however: The VolMin criterion needs an extra condition on
the ground-truth H, namely H1=1, so that the columns of Xall live in the convex hull (not conic
hull as in the general NMF case) spanned by the columns of W– otherwise, the geometric intuition
of VolMin in Fig. 2 does not make sense. Many NMF problem instances stemming from applications
2The convex hull of Wis defined as conv{W:,1,...,W:,r}={x|x=W θ ,θ0,1
>
θ= 1}.
3
do not naturally satisfy this assumption. The common trick is to normalize the columns of Xusing
their `1-norms [14] so that an equivalent model with this sum-to-one assumption holding is enforced
– but normalization only works when the ground-truth Wis also nonnegative. This raises a natural
question: can we essentially keep the advantages of VolMin identifiability (namely, no structural
assumption on W(other than low-rank) and no separability requirement on H) without assuming
sum-to-one on the rows of the ground-truth H?
3 Main Result
Our main result in this letter fixes the issues with the VolMin identifiability. Specifically, we
show that, with a careful and delicate tweak to the VolMin criterion, one can identify the model
X=W H>without assuming the sum-to-one condition on the rows of H:
Theorem 1. Assume that X=W\H>
\where W\RM×rand H\RN×rand that rank(X) =
rank(W\) = r. Also, assume that H\is sufficiently scattered. Let (W?,H?)be the optimal solution
of the following identification criterion:
min
WRM×r,HRN×rdet W>W(3a)
s.t.X=W H>,(3b)
H>1=1,H0.(3c)
Then, W?=W\ΠDand H?=H\ΠD1must hold, where Πand Ddenotes a permutation
matrix and a full-rank diagonal matrix, respectively,
At first glance, the identification criterion in (3) looks similar to VolMin in (2). The difference
lies between (2c) and (3c). In (3c), we ‘shift’ the sum-to-one condition to the columns of H, rather
than enforcing it on the rows of H. This simple modification makes a big difference in terms of
generality: Enforcing columns of Hto be sum-to-one entails no loss in generality, since in bilinear
factorization models like X=W H>there is always an intrinsic scaling ambiguity of the columns.
In other words, one can always assume the columns of Hare scaled by a diagonal matrix and then
counter scale the corresponding columns of W, which will not affect the factorization model; i.e.,
X= (W D1)(HD)
>still holds. Therefore, there is no need for data normalization to enforce this
constraint, as opposed to the VolMin case. In fact, the identifiability of (3) holds for H>1=ρ1for
any ρ > 0 – we use ρ= 1 only for notational simplicity.
We should mention that avoiding normalization is a significant advantage in practice even when
W0holds, especially when there is noise – since normalization may amplify noise. It was
also reported in the literature that normalization degrades performance of text mining significantly
since it usually worsens the conditioning of the data matrix [24]. In addition, as mentioned, in
applications where Wnaturally contains negative elements (e.g., channel identification in MIMO
communications), even normalization cannot enforce the VolMin model.
It is worth noting that the criterion in Theorem 1 has by far the most relaxed identifiability
conditions for nonnegative matrix factorization. A detailed comparison of different NMF conditions
are listed in Table 1, where one can see that Criterion (3) works under the mildest conditions on
both Hand W. Specifically, compared to plain NMF, the new criterion does not assume any
structure on W; compared to VolMin, it does not need the sum-to-one assumption on the rows of
Hor nonnegativity of W; it also does not need separability, which is inherited from the advantage
of VolMin.
4
Figure 1: Illustration of the separability (left) and sufficiently scattered (right) conditions by as-
suming that the viewer stands in the nonnegative orthant and faces the origin. The dots are rows of
H; the triangle is the nonnegative orthant; the circle is C; the shaded region is cone{H>}. Clearly,
separability is special case of the sufficiently scattered condition.
Table 1: Different assumptions on Wand Hfor identifiability of NMF.
plain [12] Self-dict [15, 16, 22] VolMax [8, 13] VolMin [3, 19] Proposed
WNN, Suff. NN, Full-rank NN, Full-rank NN, Full-rank Full-rank
(Full-rank) ( Full-rank) (Full-rank)
HNN. Suff. NN. Sep. NN. Sep NN., Suff. NN. Suff.
(NN., Sep., row sto) (NN., Sep., row sto) (NN., Suff., row sto)
Note: ‘NN’ means nonnegativity, ‘Sep.’ means separability, ‘Suff.’ denotes the sufficiently scattered
condition, and ‘sto’ denotes sum-to-one. The conditions in ‘(·)’ give an alternative set of conditions
for the corresponding approach.
In the next section, we will show the proof of Theorem 1. We should remark that the although
it seems that shifting the sum-to-one constraint to the columns of His a ‘small’ modification to
VolMin, the result in Theorem 1 was not obvious at all before we proved it: by this modification, the
clear geometric intuition of VolMin no longer holds – the objective in (3) no longer corresponds to
the volume of a data-enclosing convex hull and has no geometric interpretation any more. Indeed,
our proof for the new criterion is purely algebraic rather than geometric.
4 Proof of Theorem 1
The major insights of the proof are evolved from the VolMin work of the authors and variants
[3, 25, 26], with proper modifications to show Theorem 1. To proceed, let us first introduce the
following classic lemma in convex analysis:
Lemma 1. [27] If K1and K2are convex cones and K1⊆ K2, then, K
2⊆ K
1,where K
1and K
2
denote the dual cones of K1and K2, respectively.
Our purpose is to show that the optimization criterion in (3) outputs W?and H?that are the
column-scaled and permuted versions of the ground-truth W\and H\. To this end, let us denote
(c
WRM×r,c
HRN×r) as a feasible solution of Problem (3) that satisfies the constraints in (3),
i.e.,
X=c
Wc
H>,c
H>1=1,c
H0.(4)
5
Figure 2: The intuition of VolMin. The shaded region is conv {W:,1,...,W:,r}; the dots are X:,n ’s;
the dash lines are enclosing convex hulls; the bold dashed lines comprise the minimum-volume
data-enclosing convex hull.
Note that X=W\H>
\and that W\has full-column rank. In addition, since H\is sufficiently
scattered, rank(H\) = ralso holds [26, Lemma1]. Consequently, there exists an invertible ARr×r
such that c
H=H\A,c
W=W\A>.(5)
This is because c
Wand c
Hhas to have full column-rank and thus H\and c
Hspan the same subspace.
Otherwise, rank(X) = rcannot hold. Since (4) holds, one can see that
c
H>1=A
>H>
\1=A
>1=1.(6)
By (4), we also have H\A0.By the definition of a dual cone, H\A0 means that ai
cone{H>
\}, where aiis the i-column of A, for all i= 1, ..., r. Because H\is sufficiently scattered,
we have that C ⊆ cone{H\}which, together with Lemma 1, leads to cone{H>
\}⊆C. This further
implies that ai∈ C, which means kaik21
>ai,by the definition of C. Then we have the following
chain
|det(A)| ≤
k
Y
i=1 kaik2(7a)
k
Y
i=1
1
>ai(7b)
= 1,(7c)
where (7a) is Hadamard’s inequality, and (7b) is due to aiC.
Now, suppose the equality is attained, i.e., |det(A)|= 1, then all the inequalities in (7) hold
as equality, and specifically (7b) means that the columns of Alie on the boundary of C. Recall
that aicone{H>
\}, and H\being sufficiently scattered, according to the second requirement in
Definition 3, shows that cone{H>
\}bdC={λek|λ0, k = 1, ..., r},therefore ai’s can only be
the ek’s. In other words, Acan only be a permutation matrix.
Suppose that an optimal solution H?of (3) is not a column permutation of H\. Since W\and
H\are clearly feasible for (3), this means that det(W>
?W?)det(W>
\W\). We also know that for
6
every feasible solution, including W?and H?, Eq. (5) holds, which means we have H?=H\Aand
W?=W\A> hold for a certain invertible ARr×r. Since H\is sufficiently scattered, according
to (7b), and our assumption that Ais not a permutation matrix, we have |det(A)|<1.However,
the optimal objective of (3) is
det(W>
?W?) = det(A1W>
\W\A>)
= det(A1) det(W>
\W\) det(A>)
=|det(A)|2det(W>
\W\)
>det(W>
\W\),
which contradicts our first assumption that (W?,H?) is an optimal solution for (3). Therefore, H?
must be a column permutation of H\.Q.E.D.
As a remark, the proof of Theorem 1 follows the same rationale of that of the VolMin identifia-
bility as in [3]. The critical change is that we have made use of the relationship between sufficiently
scattered Hand the inequality in (7) here. This inequality appeared in [25,26] but was not related
to the bilinear matrix factorization criterion in (3) – which might be by far the most important
application of this inequality. The interesting and surprising point is that, by this simple yet deli-
cate tweak , the identifiability criterion can cover a substantially wider range of applications which
naturally involve W’s that are not nonnegative.
5 Validation and Discussion
The identification criterion in (3) is a nonconvex optimization problem. In particular, the bilinear
constraint X=W H>is not easy to handle. However, the existing work-arounds for handling
VolMin can all be employed to deal with Problem (3). One popular method for VolMin is to first
take the singular value decomposition (SVD) of the data X=UΣV>, where URM×r,ΣRr×r
and VRN×r. Then, V>=f
W H>holds where f
WRr×ris invertible, because Vand Hspan
the same range space. One can use (3) to identify Hfrom the data model f
X=V>=f
W H>.
Since f
Wis square and nonsingular, it has an inverse Q=f
W1. The identification criterion in
(3) can be recast as maxQRr×r|det (Q)|,s.t.Qf
X1=1,Qf
X0.This reformulated problem is
much more handy from an optimization point of view. To be specific, one can fix all the columns
in Qexcept one, e.g., qi. Then the optimization w.r.t. qiis a linear function, i.e., det(Q) =
Pr
i=1(1)i+kQk,i det(Qk,i) = p
>qi, where p= [p1, . . . , pr]
>,pk= (1)i+kdet(Qk,i),k= 1, ..., r,
and Qk,i is a submatrix of Qwithout the kth row and ith column of Q. Maximizing |p
>qi|subject
to linear constraints can be solved via maximizing both p
>qiand p
>qi, followed by picking the
solution that gives larger absolute objective. Then, cyclically updating the columns of Mresults
in an alternating optimization (AO) algorithm. Similar SVD and AO based solvers were proposed
to handle VolMin and its variants in [25, 26, 28], and empirically good results have been observed.
Note that the AO procedure is not the only possible solver here. When the data is very noisy, one
can reformulate the problem in (3) as minW,H>
1=1,H0
XW H>
2
F+λdet(W>W),where
λ > 0 balances the determinant term and the data fidelity. Many algorithms for regularized NMF
can be employed and modified to handle the above.
An illustrative simulation is shown in Table 2 to showcase the soundness of the theorem. In this
simulation, we generate X=W\H>
\with r= 5,10 and M=N= 200. We tested several cases.
1) W\0,H\0, and both W\and H\are sufficiently scattered; 2) W\0,H\0, and H\
7
Table 2: MSEs of the estimated c
H.
Method
MSE of H
case 1 (sp. W) case 2 (den. W) case 3 (Gauss. W)
Plain (r= 5) 5.49E-05 0.0147 0.7468
VolMin (r= 5) 1.36E-08 7.31E-10 1.0406
Proposed (r= 5) 7.32E-18 7.78E-18 8.44E-18
Plain (r= 10) 4.82E-04 0.0403 0.8003
VolMin (r= 10) 8.64E-09 8.66E-09 1.2017
Proposed (r= 10) 6.54E-18 5.02E-18 6.38E-18
is sufficiently scattered but W\is completely dense; 3) W\follows the i.i.d. normal distribution,
and H\0is sufficiently scattered. We generate sufficiently scattered factors following [29] – i.e.,
we generate the elements of a factor following the unifom distribution between zero and one and
zero out 35% of its elements, randomly. This way, the obtained factor is empirically sufficiently
scattered with an overwhelming probability. We employ the algorithm for fitting-based NMF in [20],
the VolMin algorithm in [30], and the described algorithm to handle the new criterion, respectively.
We measure the performance of different approaches by measuring the mean-squared-error (MSE)
of the estimated c
H, which is defined as MSE = minπΠ1
rPr
k=1
H\:,k/kH\:,k k2
c
H:k/k
c
H:kk2
2
2,
where Π is the set of all permutations of {1,2, . . . , r}. The results are obtained by averaging 50
random trials.
Table 2 matches our theoretical analysis. All the algorithms work very well on case 1, where
both W\and H\are sparse (sp.) and sufficiently scattered. In case 2, since Wis nonegative yet
dense (den.), plain NMF fails as expected, but VolMin still works, since normalization can help
enforce its model when W0. In case 3, when Wfollows the i.i.d. normal distribution, VolMin
fails since normalization does not help – while the proposed method still works perfectly.
To conclude, in this letter we discussed the identifiability issues with the current NMF ap-
proaches. We proposed a new NMF identification criterion that is a simple yet careful tweak of
the existing volume minimization criterion. We show that, by slightly modifying the constraints
of VolMin, the identifiability of the proposed criterion holds under the same sufficiently scattered
condition in VolMin, but the modified criterion covers a much wider range of applications including
the cases where one factor is not nonnegative. This new criterion offers identifiability to the largest
variety of cases amongst the known results.
8
References
[1] D. Lee and H. Seung, “Learning the parts of objects by non-negative matrix factorization,”
Nature, vol. 401, no. 6755, pp. 788–791, 1999.
[2] N. Gillis, “The why and how of nonnegative matrix factorization,” Regularization, Optimiza-
tion, Kernels, and Support Vector Machines, vol. 12, p. 257, 2014.
[3] X. Fu, W.-K. Ma, K. Huang, and N. D. Sidiropoulos, “Blind separation of quasi-stationary
sources: Exploiting convex geometry in covariance domain,” IEEE Trans. Signal Process.,
vol. 63, no. 9, pp. 2306–2320, May 2015.
[4] X. Fu, W.-K. Ma, and N. Sidiropoulos, “Power spectra separation via structured matrix fac-
torization,” IEEE Trans. Signal Process., vol. 64, no. 17, pp. 4592–4605, 2016.
[5] W.-K. Ma, J. Bioucas-Dias, T.-H. Chan, N. Gillis, P. Gader, A. Plaza, A. Ambikapathi, and
C.-Y. Chi, “A signal processing perspective on hyperspectral unmixing,” IEEE Signal Process.
Mag., vol. 31, no. 1, pp. 67–81, Jan 2014.
[6] X. Fu, K. Huang, B. Yang, W.-K. Ma, and N. Sidiropoulos, “Robust volume-minimization
based matrix factorization for remote sensing and document clustering,” IEEE Trans. Signal
Process., vol. 64, no. 23, pp. 6254–6268, 2016.
[7] A. Anandkumar, Y.-K. Liu, D. J. Hsu, D. P. Foster, and S. M. Kakade, “A spectral algorithm
for latent Dirichlet allocation,” in Advances in Neural Information Processing Systems, 2012,
pp. 917–925.
[8] S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu, “A
practical algorithm for topic modeling with provable guarantees,” in International Conference
on Machine Learning (ICML), 2013.
[9] X. Mao, P. Sarkar, and D. Chakrabarti, “On mixed memberships and symmetric nonnegative
matrix factorizations,” in International Conference on Machine Learning, 2017, pp. 2324–2333.
[10] D. Donoho and V. Stodden, “When does non-negative matrix factorization give a correct
decomposition into parts?” in NIPS, vol. 16, 2003.
[11] H. Laurberg, M. G. Christensen, M. D. Plumbley, L. K. Hansen, and S. Jensen, “Theorems
on positive data: On the uniqueness of NMF,” Computational Intelligence and Neuroscience,
vol. 2008, 2008.
[12] K. Huang, N. Sidiropoulos, and A. Swami, “Non-negative matrix factorization revisited:
Uniqueness and algorithm for symmetric decomposition,” IEEE Trans. Signal Process., vol. 62,
no. 1, pp. 211–224, 2014.
[13] T.-H. Chan, W.-K. Ma, A. Ambikapathi, and C.-Y. Chi, “A simplex volume maximization
framework for hyperspectral endmember extraction,” IEEE Trans. Geosci. Remote Sens.,
vol. 49, no. 11, pp. 4177 –4193, Nov. 2011.
9
[14] N. Gillis and S. Vavasis, “Fast and robust recursive algorithms for separable nonnegative
matrix factorization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 4, pp. 698–714,
April 2014.
[15] X. Fu, W.-K. Ma, T.-H. Chan, and J. M. Bioucas-Dias, “Self-dictionary sparse regression
for hyperspectral unmixing: Greedy pursuit and pure pixel search are related,” IEEE J. Sel.
Topics Signal Process., vol. 9, no. 6, pp. 1128–1141, Sep. 2015.
[16] B. Recht, C. Re, J. Tropp, and V. Bittorf, “Factoring nonnegative matrices with linear pro-
grams,” in Advances in Neural Information Processing Systems, 2012, pp. 1214–1222.
[17] E. Elhamifar and R. Vidal, “Sparse subspace clustering: Algorithm, theory, and applications,”
Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 35, no. 11, pp. 2765–
2781, 2013.
[18] E. Esser, M. Moller, S. Osher, G. Sapiro, and J. Xin, “A convex model for nonnegative matrix
factorization and dimensionality reduction on physical space,” IEEE Trans. Image Process.,
vol. 21, no. 7, pp. 3239 –3252, July 2012.
[19] C.-H. Lin, W.-K. Ma, W.-C. Li, C.-Y. Chi, and A. Ambikapathi, “Identifiability of the simplex
volume minimization criterion for blind hyperspectral unmixing: The no-pure-pixel case,”
IEEE Trans. Geosci. Remote Sens., vol. 53, no. 10, pp. 5530–5546, Oct 2015.
[20] K. Huang and N. Sidiropoulos, “Putting nonnegative matrix factorization to the test: a tutorial
derivation of pertinent Cramer-Rao bounds and performance benchmarking,” IEEE Signal
Process. Mag., vol. 31, no. 3, pp. 76–86, 2014.
[21] X. Fu and W.-K. Ma, “Robustness analysis of structured matrix factorization via self-
dictionary mixed-norm optimization,” IEEE Signal Process. Lett., vol. 23, no. 1, pp. 60–64,
2016.
[22] N. Gillis, “Robustness analysis of hottopixx, a linear programming model for factoring non-
negative matrices,” SIAM Journal on Matrix Analysis and Applications, vol. 34, no. 3, pp.
1189–1212, 2013.
[23] M. D. Craig, “Minimum-volume transforms for remotely sensed data,” IEEE Trans. Geosci.
Remote Sens., vol. 32, no. 3, pp. 542–552, 1994.
[24] A. Kumar, V. Sindhwani, and P. Kambadur, “Fast conical hull algorithms for near-separable
non-negative matrix factorization,” pp. 231–239, 2013.
[25] K. Huang, N. Sidiropoulos, E. Papalexakis, C. Faloutsos, P. Talukdar, and T. Mitchell, “Prin-
cipled neuro-functional connectivity discovery,” in Proc. SIAM SDM 2015, 2015.
[26] K. Huang, X. Fu, and N. D. Sidiropoulos, “Anchor-free correlated topic modeling: Identifia-
bility and algorithm,” in Advances in Neural Information Processing Systems, 2016.
[27] R. Rockafellar, Convex analysis. Princeton university press, 1997, vol. 28.
10
[28] T.-H. Chan, C.-Y. Chi, Y.-M. Huang, and W.-K. Ma, “A convex analysis-based minimum-
volume enclosing simplex algorithm for hyperspectral unmixing,” IEEE Trans. Signal Process.,
vol. 57, no. 11, pp. 4418 –4432, Nov. 2009.
[29] H. Kim and H. Park, “Nonnegative matrix factorization based on alternating nonnegativ-
ity constrained least squares and active set method,” SIAM journal on matrix analysis and
applications, vol. 30, no. 2, pp. 713–730, 2008.
[30] J. M. Bioucas-Dias, “A variable splitting augmented lagrangian approach to linear spectral
unmixing,” in Proc. IEEE WHISPERS’09, 2009, pp. 1–4.
11
... . [17] Let X ∈ R m×n admit the decomposition ...
... Note that the first variant of identifiability for min-vol NMF under the SSC was proved in [21,39] under the constraint He = e, which is not w.l.o.g. since normalizing the rows of H makes the column of X belong to the convex hull of the columns of W . Later it was relaxed to the normalization H ⊤ e = e in [17], and to W ⊤ e = e in [38]. ...
... In fact, given two vectors u, v and A = uv ⊤ with A(i, j) ̸ = 0 for some (i, j), all rank-one approximation of A = u ′ v ′⊤ have the form u ′ = αA(:, j) and v ′ = βA(i, :) for any α, β such that αβ = A(i, j) −1 . Since U ⊤ i e = e for i ∈ [1,2], the scaling degree of freedom is absent, and the columns of U 1 and U 2 are uniquely determined using (17). Another way to see uniqueness is to notice that mat(X :,(i,j) )e = U 1 (:, i), and mat(X :,(i,j) ) ⊤ e = U 2 (:, j). ...
Preprint
Full-text available
Tensor decompositions have become a central tool in data science, with applications in areas such as data analysis, signal processing, and machine learning. A key property of many tensor decompositions, such as the canonical polyadic decomposition, is identifiability: the factors are unique, up to trivial scaling and permutation ambiguities. This allows one to recover the groundtruth sources that generated the data. The Tucker decomposition (TD) is a central and widely used tensor decomposition model. However, it is in general not identifiable. In this paper, we study the identifiability of the nonnegative TD (nTD). By adapting and extending identifiability results of nonnegative matrix factorization (NMF), we provide uniqueness results for nTD. Our results require the nonnegative matrix factors to have some degree of sparsity (namely, satisfy the separability condition, or the sufficiently scattered condition), while the core tensor only needs to have some slices (or linear combinations of them) or unfoldings with full column rank (but does not need to be nonnegative). Under such conditions, we derive several procedures, using either unfoldings or slices of the input tensor, to obtain identifiable nTDs by minimizing the volume of unfoldings or slices of the core tensor.
... Theorem 2 (Fu et al. [2015(Fu et al. [ , 2018, Leplat et al. [2020]). Let X = W * H * be a rank-r NMF of X, where rank(W * ) = r and H * satisfies the SSC. ...
... Let X = W * H * be a rank-r NMF of X, where rank(W * ) = r and H * satisfies the SSC. Then any optimal solution of the following problem Fu et al., 2015] or He = e [Fu et al., 2018] or W ⊤ e = e [Leplat et al., 2020], where e denotes the vector of all ones of appropriate dimension, corresponds to (W * , H * ), up to permutation and scaling of the columns of W * and rows of H * . ...
... For example, in hyperspectral imaging, we might impose H ⊤ ℓ e = e for all ℓ which is known as the sum-to-one constraint of the abundances, and, in topic modeling, where the columns of W ℓ correspond to topics, we might impose W ⊤ ℓ e = e as the entries in each column of W ℓ correspond to probabilities of the words to belong to the corresponding topic. In this paper, we choose H ℓ e = e, that is, the sum of the entries in each row of H ℓ equals one, as by Fu et al. [2018]. The main reason is that this normalization can be made w.l.o.g. by the scaling degree of freedom. ...
Article
Full-text available
Deep nonnegative matrix factorization (deep NMF) has recently emerged as a valuable technique for extracting multiple layers of features across different scales. However, all existing deep NMF models and algorithms have primarily centered their evaluation on the least squares error, which may not be the most appropriate metric for assessing the quality of approximations on diverse data sets. For instance, when dealing with data types such as audio signals and documents, it is widely acknowledged that ß-divergences offer a more suitable alternative. In this article, we develop new models and algorithms for deep NMF using some ß-divergences, with a focus on the Kullback-Leibler divergence. Subsequently, we apply these techniques to the extraction of facial features, the identification of topics within document collections, and the identification of materials within hyperspectral images.
... This shows that matrix factorization in general is not identifiable without any structural constraints. On the other hand, people have discovered a plethora of structural constraints that do guarantee identifiability, i.e., restricting the choice of to be the product of a permutation matrix and a diagonal matrix, such as nonnegative [9]- [11], simplicial [12], [13], or bounded [14]- [16]. A breakthrough result regarding identifiability of DL has been shown in [7] using the following formulation: ...
... The analysis given in the previous section gives an exact characterization of when dictionary learning is identifiable using the proposed formulation (1) with a matrix volume identification criterion. The analysis is inspired by the line of work from nonnegative matrix factorization and simplicial component analysis [9], [11], [13] using geometric interpretations, but with the introduction of cellular hulls in the complex domain, the geometric interpretation becomes harder to visualize. Nevertheless, there is a simple algebraic representation: cell( ) is sufficiently scattered in the complex hypercube if the solution set of the following optimization problem is { | | | = 1, = 1, . . . ...
... where we pick an arbitrary ∈ [ ] in the second line since this event implies that the maximum ℓ 1 norm of the rows are lowerbounded. Combining (11), (11), and (12) with = ( − ), = ( + ), and ...
... This approach offers a more flexible mechanism for inducing sparsity without sacrificing model performance. Similarly, we can demonstrate the partial identifiability of Φ, as it is often considered to be the transpose of θ (Fu et al., 2017;HaoChen et al., 2021). In conclusion, BNDL follows the aforementioned assumptions, and its optimization objective promotes the partial identifiability of the learned features and decision layer, thereby enhancing their disentanglement capability. ...
Preprint
Although deep neural networks have demonstrated significant success due to their powerful expressiveness, most models struggle to meet practical requirements for uncertainty estimation. Concurrently, the entangled nature of deep neural networks leads to a multifaceted problem, where various localized explanation techniques reveal that multiple unrelated features influence the decisions, thereby undermining interpretability. To address these challenges, we develop a Bayesian Non-negative Decision Layer (BNDL), which reformulates deep neural networks as a conditional Bayesian non-negative factor analysis. By leveraging stochastic latent variables, the BNDL can model complex dependencies and provide robust uncertainty estimation. Moreover, the sparsity and non-negativity of the latent variables encourage the model to learn disentangled representations and decision layers, thereby improving interpretability. We also offer theoretical guarantees that BNDL can achieve effective disentangled learning. In addition, we developed a corresponding variational inference method utilizing a Weibull variational inference network to approximate the posterior distribution of the latent variables. Our experimental results demonstrate that with enhanced disentanglement capabilities, BNDL not only improves the model's accuracy but also provides reliable uncertainty estimation and improved interpretability.
... Because single-phase XRD patterns are much easier to label, a common approach is to process XRD patterns in large batches and leverage source separation tools, such as graph cutting methods 27 and convolutional nonnegative matrix factorization 28 , to separate single phase bases and their corresponding activations, i.e., intensity factors proportional to the fraction of the material that has crystallized into each phase. However, the conditions that guarantee basis separation cannot always be met 29 , especially for materials discovery efforts where the XRD data may be sparse or where changes in lattice constants cause nonlinear shifts. ...
... From the literature [1], we observe that many optimization problems are multi-convex programming, and many published studies on practical multi-convex programming focus on special practical models, like [2][3][4][5][6][7][8][9][10][11][12][13]. In particular, Wen, Yin and Zhang [12] pointed out that multi-convex programming is an NP-hard problem. ...
Article
Full-text available
In this paper, we delve into the realm of biconvex optimization problems, introducing an adaptive Douglas–Rachford algorithm and presenting related convergence theorems in the setting of finite-dimensional real Hilbert spaces. It is worth noting that our approach to proving the convergence theorem differs significantly from those in the literature.
Article
Dictionary learning (DL) is a pivotal task in machine learning and signal processing, involving extracting representative features from a given dataset. However, conventional DL models are known to be highly sensitive to outliers. To circumvent this issue, we introduce a new and robust DL model based on unbalanced optimal transport (UOT). Compared to DL models based on conventional robust distances and the Wasserstein distance, our model not only captures and leverages the structural information within the data but also demonstrates strong resilience to outliers. By employing the structure of the proposed robust DL model, we develop a novel hybrid block coordinate descent (BCD) algorithm. The proposed algorithm maintains computational tractability by exploiting special block structures of the subproblems. In addition, we establish the convergence of our algorithm without the Lipschitz smooth condition. Through extensive experimentation, we validate our theoretical results and demonstrate the effectiveness of the proposed method on synthetic data, MNIST data, Olivetti faces dataset, and hyperspectral images (HSIs) datasets.
Preprint
Hyperspectral super-resolution refers to the problem of fusing a hyperspectral image (HSI) and a multispectral image (MSI) to produce a super-resolution image (SRI) that has fine spatial and spectral resolution. State-of-the-art methods approach the problem via low-rank matrix approximations to the matricized HSI and MSI. These methods are effective to some extent, but a number of challenges remain. First, HSIs and MSIs are naturally third-order tensors (data "cubes") and thus matricization is prone to loss of structural information--which could degrade performance. Second, it is unclear whether or not these low-rank matrix-based fusion strategies can guarantee identifiability or exact recovery of the SRI. However, identifiability plays a pivotal role in estimation problems and usually has a significant impact on performance in practice. Third, the majority of the existing methods assume that there are known (or easily estimated) degradation operators applied to the SRI to form the corresponding HSI and MSI--which is hardly the case in practice. In this work, we propose to tackle the super-resolution problem from a tensor perspective. Specifically, we utilize the multidimensional structure of the HSI and MSI to propose a coupled tensor factorization framework that can effectively overcome the aforementioned issues. The proposed approach guarantees the identifiability of the SRI under mild and realistic conditions. Furthermore, it works with little knowledge of the degradation operators, which is clearly an advantage over the existing methods. Semi-real numerical experiments are included to show the effectiveness of the proposed approach.
Article
Full-text available
In topic modeling, many algorithms that guarantee identifiability of the topics have been developed under the premise that there exist anchor words -- i.e., words that only appear (with positive probability) in one topic. Follow-up work has resorted to three or higher-order statistics of the data corpus to relax the anchor word assumption. Reliable estimates of higher-order statistics are hard to obtain, however, and the identification of topics under those models hinges on uncorrelatedness of the topics, which can be unrealistic. This paper revisits topic modeling based on second-order moments, and proposes an anchor-free topic mining framework. The proposed approach guarantees the identification of the topics under a much milder condition compared to the anchor-word assumption, thereby exhibiting much better robustness in practice. The associated algorithm only involves one eigen-decomposition and a few small linear programs. This makes it easy to implement and scale up to very large problem instances. Experiments using the TDT2 and Reuters-21578 corpus demonstrate that the proposed anchor-free approach exhibits very favorable performance (measured using coherence, similarity count, and clustering accuracy metrics) compared to the prior art.
Article
Full-text available
We are interested in a low-rank matrix factorization problem where one of the matrix factors has a special structure; specifically, its columns live in the unit simplex. This problem finds applications in diverse areas such as hyperspectral unmixing, video summarization, spectrum sensing, and blind speech separation. Prior works showed that such a factorization problem can be formulated as a self-dictionary sparse optimization problem under some assumptions that are considered realistic in many applications, and convex mixed norms were employed as optimization surrogates to realize the factorization in practice. Numerical results have shown that the mixed-norm approach demonstrates promising performance. In this letter, we conduct performance analysis of the mixed-norm approach under noise perturbations. Our result shows that using a convex mixed norm can indeed yield provably good solutions. More importantly, we also show that using nonconvex mixed (quasi) norms is more advantageous in terms of robustness against noise.
Article
Full-text available
This paper revisits blind source separation of instantaneously mixed quasi-stationary sources (BSS-QSS), motivated by the observation that in certain applications (e.g., speech) there exist time frames during which only one source is active, or locally dominant. Combined with nonnegativity of source powers, this endows the problem with a nice convex geometry that enables elegant and efficient BSS solutions. Local dominance is tantamount to the so-called pure pixel/separability assumption in hyperspectral unmixing/nonnegative matrix factorization, respectively. Building on this link, a very simple algorithm called successive projection algorithm (SPA) is considered for estimating the mixing system in closed form. To complement SPA in the specific BSS-QSS context, an algebraic preprocessing procedure is proposed to suppress short-term source cross-correlation interference. The proposed procedure is simple, effective, and supported by theoretical analysis. Solutions based on volume minimization (VolMin) are also considered. By theoretical analysis, it is shown that VolMin guarantees perfect mixing system identifiability under an assumption more relaxed than (exact) local dominance—which means wider applicability in practice. Exploiting the specific structure of BSS-QSS, a fast VolMin algorithm is proposed for the overdetermined case. Careful simulations using real speech sources showcase the simplicity, efficiency, and accuracy of the proposed algorithms.
Article
Full-text available
This paper considers a recently emerged hyperspectral unmixing formulation based on sparse regression of a self-dictionary multiple measurement vector (SD-MMV) model, wherein the measured hyperspectral pixels are used as the dictionary. Operating under the pure pixel assumption, this SD-MMV formalism is special in enabling simultaneous identification of the endmember spectral signatures and the number of endmembers. Previous SD-MMV studies mainly focus on convex relaxations. In this study, we explore the alternative of greedy pursuit, which generally provides efficient and simple algorithms. In particular, we design a greedy SD-MMV algorithm using simultaneous orthogonal matching pursuit. Intriguingly, the proposed greedy algorithm is shown to be closely related to some existing pure pixel search algorithms, especially, the successive projection algorithm (SPA). Thus, a link between SD-MMV and pure pixel search is revealed. We then perform exact recovery analyses, and prove that the proposed greedy algorithm is robust to noise---including its identification of the (unknown) number of endmembers---under a sufficiently low noise level. The identification performance of the proposed greedy algorithm is demonstrated through both synthetic and real-data experiments.
Article
Full-text available
In blind hyperspectral unmixing (HU), the pure-pixel assumption is well-known to be powerful in enabling simple and effective blind HU solutions. However, the pure-pixel assumption is not always satisfied in an exact sense, especially for scenarios where pixels are all intimately mixed. In the no pure-pixel case, a good blind HU approach to consider is the minimum volume enclosing simplex (MVES). Empirical experience has suggested that MVES algorithms can perform well without pure pixels, although it was not totally clear why this is true from a theoretical viewpoint. This paper aims to address the latter issue. We develop an analysis framework wherein the perfect identifiability of MVES is studied under the noiseless case. We prove that MVES is indeed robust against lack of pure pixels, as long as the pixels do not get too heavily mixed and too asymmetrically spread. Also, our analysis reveals a surprising and counter-intuitive result, namely, that MVES becomes more robust against lack of pure pixels as the number of endmembers increases. The theoretical results are verified by numerical simulations.
Article
This paper considers \emph{volume minimization} (VolMin)-based structured matrix factorization (SMF). VolMin is a factorization criterion that decomposes a given data matrix into a basis matrix times a structured coefficient matrix via finding the minimum-volume simplex that encloses all the columns of the data matrix. Recent work showed that VolMin guarantees the identifiability of the factor matrices under mild conditions that are realistic in a wide variety of applications. This paper focuses on both theoretical and practical aspects of VolMin. On the theory side, exact equivalence of two independently developed sufficient conditions for VolMin identifiability is proven here, thereby providing a more comprehensive understanding of this aspect of VolMin. On the algorithm side, computational complexity and sensitivity to outliers are two key challenges associated with real-world applications of VolMin. These are addressed here via a new VolMin algorithm that handles volume regularization in a computationally simple way, and automatically detects and {iteratively downweights} outliers, simultaneously. Simulations and real-data experiments using a remotely sensed hyperspectral image and the Reuters document corpus are employed to showcase the effectiveness of the proposed algorithm.
Article
Power spectra separation aims at extracting the individual power spectra of multiple emitters from the received mixtures. Traditional spectrum sensing for dynamic spectrum sharing is mostly concerned with detecting or estimating the aggregate spectrum. Spectra separation can be considered as a further step towards full awareness of the radio frequency (RF) environment, which may enable judicious routing, scheduling and beamforming with more effective interference avoidance. In other applications such as geoscience, astronomy, and chemometrics, separating the spectra of the objects/analytes from the sensed mixtures is also of great interest. Our prior work tackled this problem from a tensor decomposition point of view, but this requires delicate and careful receiver setups, and the algorithms are computationally heavy and difficult to decentralize. In this work, we propose to solve the power spectra separation problem using a structured matrix factorization model, where the columns of one of the two factor matrices live in the unit simplex. The salient features of this new framework are that 1) the receivers can be far simpler in terms of hardware, 2) an algebraically very simple algorithm can be employed for the centralized case, 3) and that effective decentralized algorithms can be devised under this framework. Numerical simulations and a laboratory experiment using real software-defined radios are presented to demonstrate the effectiveness of the proposed algorithms.
Chapter
How can we reverse-engineer the brain connectivity, given the input stimulus, and the corresponding brain-activity measurements, for several experiments? We show how to solve the problem in a principled way, modeling the brain as a linear dynamical system (LDS), and solving the resulting “system identification” problem after imposing sparsity and non-negativity constraints on the appropriate matrices. These are reasonable assumptions in some applications, including magnetoencephalography (MEG). There are three contributions: (a) Proof: We prove that this simple condition resolves the ambiguity of similarity transformation in the LDS identification problem; (b) Algorithm: we propose an effective algorithm which further induces sparse connectivity in a principled way; and (c) Validation: our experiments on semi-synthetic (C. elegans), as well as real MEG data, show that our method recovers the neural connectivity, and it leads to interpretable results.
Article
Nonnegative matrix factorization (NMF) is a useful tool in a broad range of applications, from signal separation to computer vision and machine learning. NMF is a hard (NP-hard) computational problem for which various approximate solutions have been developed over the years. Given the widespread interest in NMF and its applications, it is perhaps surprising that the pertinent Cram?r?Rao lower bound (CRLB) on the accuracy of the nonnegative latent factor estimates has not been worked out in the literature. In hindsight, one reason may be that the required computations are more subtle than usual: the problem involves constraints and ambiguities that must be dealt with, and the Fisher information matrix is always singular. We provide a concise tutorial derivation of the CRLB for both symmetric NMF and asymmetric NMF, using the latest CRLB tools, which should be of broad interest for analogous derivations in related factor analysis problems. We illustrate the behavior of these bounds with respect to model parameters and put some of the best NMF algorithms to the test against one another and the CRLB. The results help illuminate what can be expected from the current state of art in NMF algorithms, and they are reassuring in that the gap to optimality is small in relatively sparse and low rank scenarios.