ArticlePDF Available

Successive Nonnegative Projection Algorithm for Robust Nonnegative Blind Source Separation

Authors:

Abstract and Figures

In this paper, we propose a new fast and robust recursive algorithm for near-separable nonnegative matrix factorization, a particular nonnegative blind source separation problem. This algorithm, which we refer to as the successive nonnegative projection algorithm (SNPA), is closely related to the popular successive projection algorithm (SPA), but takes advantage of the nonnegativity constraint in the decomposition. We prove that SNPA is more robust than SPA and can be applied to a broader class of nonnegative matrices. This is illustrated on some synthetic data sets, and on a real-world hyperspectral image.
Content may be subject to copyright.
arXiv:1310.7529v2 [stat.ML] 26 Nov 2013
Successive Nonnegative Projection Algorithm
for Robust Nonnegative Blind Source Separation
Nicolas Gillis
Department of Mathematics and Operational Research
Facult´e Polytechnique, Universit´e de Mons
Rue de Houdain 9, 7000 Mons, Belgium
nicolas.gillis@umons.ac.be
Abstract
In this paper, we propose a new fast and robust recursive algorithm for near-separable nonnega-
tive matrix factorization, a particular nonnegative blind source separation problem. This algorithm,
which we refer to as the successive nonnegative projection algorithm (SNPA), is closely related to
the popular successive projection algorithm (SPA), but takes advantage of the nonnegativity con-
straint in the decomposition. We prove that SNPA is more robust than SPA and can be applied to
a broader class of nonnegative matrices. This is illustrated on some synthetic data sets, and on a
real-world hyperspectral image.
Keywords. Nonnegative matrix factorization, nonnegative blind source separation, separability,
robustness to noise, hyperspectral unmixing, pure-pixel assumption.
1 Introduction
Nonnegative matrix factorization (NMF) has become a widely used tool for analysis of high-dimensional
data. NMF decomposes approximately a nonnegative input data matrix MRm×n
+into the product
of two nonnegative matrices WRm×r
+and HRr×n
+so that MW H . Although NMF is NP-hard
in general [30] and ill-posed (see [14] and the references therein), it has been used in many different
areas such as image processing [23], document classification [29], hyperspectral unmixing [27], com-
munity detection [31], and computational biology [9]. Recently, Arora et al. [3] introduced a subclass
of nonnegative matrices, referred to as separable, for which NMF can be solved efficiently (that is,
in polynomial time), even in the presence of noise. This subclass of NMF problems are referred to
as near-separable NMF, and has been shown to be useful in several applications such as document
classification [4, 2, 22, 10], blind source separation [8], video summarization and image classification
[11], and hyperspectral unmixing (see Section 1.1 below).
1.1 Near-Separable NMF
A matrix Mis r-separable if there exists and index set Kof cardinality rand a nonnegative matrix
HRr×n
+with M=M(:,K)H. Equivalently, Mis r-separable if
M=W[Ir, H] Π,
1
where Iris the r-by-ridentity matrix, His a nonnegative matrix and Π is a permutation. Given
a separable matrix, the goal is to identify the rcolumns of Mallowing to reconstruct it perfectly,
that is, to identify the columns of Mcorresponding the columns of W. In the presence of noise, the
problem is referred to as near-separable NMF and can be stated as follows.
Near-Separable NMF: Given the noisy r-separable matrix ˜
M=W H +NRm×nwhere Nis the
noise, WRm×r
+,H= [Ir, H]Π with H0 and Π is a permutation, recover approximately the
columns of Wamong the columns of ˜
M.
An important application of near-separable NMF is blind hyperspectral unmixing in the presence of
pure pixels [18, 24]: A hyperspectral image is a set of images taken at different wavelengths. It can
be associated with a nonnegative matrix MRm×n
+where mis the number of wavelengths and n
the number of pixels. Each column of Mis equal to the spectral signature of a given pixel, that is,
M(i, j) is the fraction of incident light reflected by the jth pixel at the ith wavelength. Under the
linear mixing model, the spectral signature of a pixel is equal to a linear combination of the spectral
signatures of the constitutive materials present in the image, referred to as endmembers. The weights
in that linear combination are nonnegative and sum to one, and correspond to the abundances of the
endmembers in that pixel. If for each endmember, there exists a pixel in the image containing only
that endmember, then the pure-pixel assumption is satisfied. This assumption is equivalent to the
separability assumption: each column of Wis the spectral signature of an endmember and is equal to
a column of Mcorresponding to a pure pixel; see the survey [5] for more details.
Several provably robust algorithms have been proposed to solve the near-separable NMF problem
using, e.g., geometric constructions [3, 2], linear programming [12, 6, 15, 17], or semidefinite program-
ming [25, 19]. In the next section, we briefly describe the successive projection algorithm (SPA) which
is closely related to the algorithm we propose in this paper.
1.2 Successive Projection Algorithm
The successive projection algorithm is a simple but fast and robust recursive algorithm for solving
near-separable NMF; see Algorithm SPA. At each step of the algorithm, the column of the input
matrix ˜
Mwith maximum 2norm is selected, and then ˜
Mis updated by projecting each column onto
the orthogonal complement of the columns selected so far. It was first introduced in [1], and later
proved to be robust in [18].
Algorithm SPA Successive Projection Algorithm [1, 18]
Input: Near-separable matrix ˜
M=W H +NRm×nsatisfying Assumption 1, the number rof
columns to be extracted.
Output: Set of indices Ksuch that M(:,K)W(up to permutation).
1: Let R=˜
M,K={}.
2: for k= 1 : rdo
3: p= argmaxj||R:j||2.
4: R=IR:pRT
:p
||R:p||2
2R.
5: K=K ∪ {p}.
6: end for
2
Theorem 1 ([18], Th. 3).Let ˜
M=W H +Nbe near-separable matrix (see Assumption 1) where W
is full rank and maxi||N(:, i)||2ǫ. If ǫ≤ Oσmin(W)
2(W), then SPA identifies the columns of Wup
to error Oǫ κ2(W), that is, the index set Kidentified by SPA satisfies
max
1jrmin
k∈K
W(:, j)˜
M(:, k)
2≤ Oǫ κ2(W),
where κ(W) = σmax(W)
σmin(W)is the condition number of W.
Moreover, SPA can be generalized replacing the 2norm (step 2 of Algorithm SPA) with any strongly
convex function with Lipschitz continuous gradient [18].
SPA is closely related to several hyperspectral unmixing algorithms such as the automatic target
generation process (ATGP) [28] and the successive volume maximization algorithm (SVMAX) [7];
see [18, 24, 13] and the references therein. Although SPA has many advantages (in particular, it is
very fast and rather effective in practice), a drawback is that it requires the matrix Wto be full rank.
If the matrix Wis ill-conditioned (or worse rank deficient), it will most likely fail even for very small
noise levels.
1.3 Contribution and Outline of the Paper
The main contribution of this paper are
The introduction of a new fast and robust recursive algorithm for near-separable NMF, referred
to as the successive nonnegative projection algorithm (SNPA), which overcomes the drawback
of SPA that the matrix Whas to be full rank (Section 2).
The robustness analysis of SNPA (Section 3). First, we will show that Theorem 1 applies to
SNPA as well, that is, we show that SNPA is robust to noise when Wis full rank. Second, given
a matrix W, we define a new parameter β(W)σmin(W) which is in general positive even if
Wis rank deficient. We also define κβ(W) = maxi||W(:,i)||2
β(W)and show that
Theorem 7. Let ˜
Mbe near-separable matrix satisfying Assumption 1 with β(W)>0.
If ǫ≤ Oβ(W)
κ3
β(W), then SNPA with f(.) = ||.||2
2identifies the columns of Wup to
error Oǫ κ3
β(W).
This proves that SNPA applies to a broader class of matrices (Wdoes not need to be full rank).
Finally, we illustrate the effectiveness of SNPA on several synthetic data sets and a real-world
hyperspectral image in Section 4.
1.4 Notations
The unit simplex is defined as ∆m=nxRmx0,Pm
i=1 xi1o, and the dimension mwill be
dropped when it is clear from the context. Given a matrix WRm×r,W(:, j), W:jor wjdenotes
its jth column. The zero vector is denoted 0, its dimension will be clear from the context. We also
denote ||W||1,2= maxx,||x||11||W x||2= maxi||W(:, i)||2.
3
2 Successive Nonnegative Projection Algorithm
In this paper, we propose a new family of fast and robust recursive algorithms to solve near-separable
NMF problems; see Algorithm SNPA. At each step of the algorithm, the column of the input matrix
˜
Mmaximizing the function fis selected, and then each column of ˜
Mis projected onto the convex hull
of the columns extracted so far using the semi-metric induced by f. (A natural choice for the function
fin SNPA is f(x) = ||x||2
2.) Hence the difference with SPA is the way the projection is performed.
In this work, we perform the projections at step 5 of SNPA (which are convex optimization problems)
Algorithm SNPA Successive Nonnegative Projection Algorithm
Input: Near-separable matrix ˜
M=W H +NRm×nsatisfying Assumption 1, the number rof
columns to be extracted, and a strongly convex function fsatisfying Assumption 2.
Output: Set of indices Ksuch that ˜
M(:,K)Wup to permutation.
1: Let R=˜
M,K={}.
2: for k= 1 : rdo
3: p= argmaxjf(R:j).
4: K=K ∪ {p}.
5: R(:, j) = ˜
M(:, j)˜
M(:,K)H(:, j) for all j, where
H(:, j) = argminxf˜
M(:, j)˜
M(:,K)x; see Appendix A.
6: end for
using a fast gradient method, which is an optimal first-order method for minimizing convex functions
with Lipschitz continuous gradient [26]; see Appendix A for the implementation details. Although
SNPA is computationally more expensive than SPA, it has the same asymptotic complexity, requiring
a total of O(mnr) operations.
SNPA is also closely related to the fast canonical hull algorithm, referred to as XRAY, from [22].
XRAY is a recursive algorithm for near-separable NMF and projects, at each step, the data points
onto the convex cone of the columns extracted so far. The main differences between XRAY and SNPA
are that (i) XRAY uses another criterion to identify a column of Wat each step, (ii) XRAY projects
the data matrix onto the convex cone of the columns extracted so far while SNPA projects onto their
convex hull, and (iii) XRAY performs the projection step with respect to the 2norm, while SNPA
performs the projection with respect to the function f.
3 Robustness of SNPA
In this section, we prove robustness of SNPA for any sufficiently small noise. The proofs are closely
related to the robustness analysis of SPA developed in [18].
In Section 3.1, we give the assumptions and definitions needed throughout the paper. In Section 3.2,
we prove that SNPA identifies the columns of Wamong the columns of Mexactly in the noiseless
case, which explains the intuition behind SNPA. In Section 3.3, we derive our key lemmas which allow
us to show that the robustness analysis of SPA from Theorem 1 (which requires Wto be full rank)
also applies to SNPA; see Theorems 4 and 5. In Section 3.5, we generalize the analysis to a broader
class of matrices for which Wis not required to be full rank; see Theorems 6 and 7.
4
3.1 Assumptions and Definitions
In this section, we describe the assumptions and definitions useful to prove robustness of SNPA.
Without loss of generality, we will assume throughout the paper that the input matrix has the
following form:
Assumption 1 (Near-Separable Matrix).The separable matrix MRm×ncan be written as
M=W H =W[Ir, H ],
where WRm×r,HRr×n
+, and H(:, j)for all j. The near-separable matrix ˜
M=M+N
where Nis the noise with ||N||1,2ǫ.
In fact, any nonnegative near-separable matrix can be put in this form by proper permutation and
normalization (dividing each column of Mby its 1norm); see the discussion in [18]. Note that
Assumption 1 does not require Wto be nonnegative hence our result will apply to a broader class than
the nonnegative matrices.
We will also assume that, in SNPA,
Assumption 2. The function f:RmR+is strongly convex with parameter µ > 0, its gradient is
Lipschitz continuous with constant L, and its global minimizer is the all-zero vector with f(0) = 0.
A function fis strongly convex with parameter µif and only if it is convex and for any x, y dom(f)
and for all δ[0,1]
f(δx + (1 δ)y)δf (x) + (1 δ)f(y)µ
2δ(1 δ)||xy||2
2.(1)
Moreover, its gradient is Lipschitz continuous with constant Lif and only if for any x, y dom(f), we
have ||∇f(x)− ∇f(y)||2L||xy||2. Convex analysis also tells us that if fsatisfies Assumption 2
then, for any x, y,
f(x) + f(x)T(yx) + µ
2||xy||2
2f(y)f(x) + f(x)T(yx) + L
2||xy||2
2.
In particular, taking x= 0, we have, for any yRm,
µ
2||y||2
2f(y)L
2||y||2
2,(2)
since f(0) = 0 and f(0) = 0 (because zero is the global minimizer of f); see, e.g., [20]. Note that
this implies f(x)>0 for any x6= 0 hence finduces a semi-metric; the distance between two points x
and ybeing defined by f(xy).
We will use the following notation for the residual computed at step 5 of Algorithm SNPA.
Definiton (Projection and Residual) Given BRm×sand a function fsatisfying Assumption 2,
we define the projection Pf
B(x) of xonto the convex hull of the columns of Bwith respect to the semi-
metric induced by f(.) as follows:
Pf
B(x) : RmRm:x→ Pf
B(x) = By,
5
where y= argminyf(xBy). We also define the residual Rf
Bof the projection Pf
Bas follows:
Rf
B:RmRm:x→ Rf
B(x) = xPf
B(x).
For a matrix ARm×r, we will denote Pf
B(A) the matrix whose columns are the projections of the
columns of A, that is, Pf
B(A):i=Pf
B(A:i) for all i, and Rf
B(A) = A− Pf
B(A).
Given a matrix WRm×r, we introduce the following notations:
α(W) = min
1jr,xkW(:, j)W(:,J)xk2where J={1,2,...,r}\{j},
ν(W) = min
i||wi||2,
γ(W) = min
i6=j||wiwj||2,
ω(W) = min ν(W),1
2γ(W),
K(W) = ||W||1,2= max
i||wi||2,and
σ(W) = σr(W) = σmin(W),
= the smallest singular value of W.
The parameter α(W) is the minimum distance between a column of Wand the convex hull of the
other columns of Wand the origin. It is interesting to notice that, under Assumption 1, α(W)>0 is
a necessary condition to being able to identify the columns of Wamong the columns of M(in fact,
α(W) = 0 means that a column of Wbelongs to the convex hull of the other columns of Wand the
origin hence cannot be distinguished from the other data points). It is also a sufficient condition as
some algorithms are guaranteed to identify the columns of Wwhen α(W)>0 even in the presence of
noise [3, 14, 17].
3.2 Recovery in the Noiseless Case
In this section, we show that, in the noiseless case, SNPA is able to perfectly identify the columns of
Wamong the columns of M. Although this result is implied by our analysis in the noisy case (see
Section 3.4), it gives the intuition behind the working of SNPA.
Lemma 1. Let BRm×s,ARm×k,zk, and fsatisfy Assumption 2. Then
fRf
B(Az)fRf
B(A)z.
Proof. Let us denote Y(:, j) = argminyf(A(:, j)By) for all j, that is, Rf
B(A) = ABY . We
have
fRf
B(Az)= min
yf(Az By)f(Az BY z) = fRf
B(A)z.
The inequality follows from Y z ∆, since Y(:, j)jand z∆.
Theorem 2. Let M=W[Ir, H] Π be a separable matrix satisfying Assumption 1 where Wis full
rank and N= 0, and let fsatisfy Assumption 2. Then SNPA applied on matrix Midentifies a set of
indices Ksuch that, up to permutation, M(:,K) = W.
6
Proof. We prove the result by induction.
First step. Since the columns of Mbelong to the convex hull of the columns of Wand the origin
(the entries of each column of Hare nonnegative sum to at most one), and since a strongly convex
function is always maximized at a vertex of a polytope, a column of Wwill be identified at the first
step of SNPA (the origin cannot be extracted since, by assumption, it minimizes f). More formally,
let hr, we have
f(W h) = f r
X
k=1
W(:, k)h(k) + 1
r
X
k=1
h(k)!0!
r
X
k=1
h(k)f(W(:, k))
max
kf(W(:, k)) .
The first inequality follows from convexity of fand the fact that f(0) = 0. By strong convexity, see
Equation (1), the first inequality is always strict unless h(:, i)6=ejfor some j(where ejis the jth
column of the identify matrix). The second inequality follows from h∆ and the fact that f(x)>0
for any x6= 0. Since all columns of Mcan be written as W h for some hr, this implies that, at
the first step, SNPA extracts the index corresponding to the column of Wmaximizing f.
Induction step. Assume SNPA has extracted some indices Kcorresponding to columns of W, that is,
M(: K) = W(:,I) for some I. We have for any hrthat
fRf
W(:,I)(W h)
(Lemma 1)
fRf
W(:,I)(W)h
(Ass. 2)
r
X
k=1
h(k)fRf
W(:,I)(W(:, k))
(hr,f(x)>0x6=0)
max
kfRf
W(:,I)(W(:, k)).
Finally, since
• Rf
W(:,I)(W(:, k)) = 0 for all k∈ I because 0 is the global minimizer of f,
• Rf
W(:,I)(W(:, k)) 6= 0 for all k /∈ I because Wis full rank,
0 is the global minimum of f, and
the second inequality is strict unless h(:, i)6=ejfor some jby strong convexity of f,
SNPA identifies a column of Wnot extracted yet.
Note that the proof does not need Wto be full rank, but only that Rf
W(:,I)(W(:, k)) 6= 0 for all
k /∈ I for any subset Iof {1,2,...,r}. This observation will be exploited in Section 3.5 to show
robustness of SNPA when Wis not full rank.
7
3.3 Key Lemmas
In the following, we derive the key lemmas to prove robustness of SNPA.
More precisely, we subdivide the columns of Winto two subsets as follows W= [A, B ]. The
columns of the matrix Brepresent the columns of Wwhich have already been approximately identified
by SNPA while the columns of Aare the columns of Wyet to be identified. We denote the columns
of the matrix ˜
Bthe columns of matrix ˜
Mextracted by SNPA with ||B˜
B||1,2¯ǫ. Lemmas 2 to 9
lead to a lower bound for ωRf
˜
B(A)using σ([A, B]); see Corollary 2. Combined with Lemmas 10
to 12, these lemmas will imply that if Wis full rank then a column of Wnot extracted yet (that is,
a column of A) is identified approximately at the next step of SNPA; see Theorem 3. Finally, using
that result inductively leads to the robustness of SNPA; see Theorem 4.
Lemma 2. For any BRm×s,xRm, and fsatisfying Assumption 2, we have
Rf
B(x)
2sL
µkxk2.
Proof. Using Equation (2), we have
Rf
B(x)
2
22
µfRf
B(x)=2
µmin
yf(xBy)L
µmin
y||xBy||2
2L
µ||x||2
2,
since 0 ∆.
Lemma 3. Let BRm×sand B=˜
B+Nwith ||N||1,2¯ǫ, and let fsatisfy Assumption 2. Then,
max
j
Rf
˜
B(bj)
2sL
µ¯ǫ.
Proof. Using Equation (2), we have, for all j,
Rf
˜
B(bj)
2
22
µfRf
˜
B(bj)=2
µmin
xf(bj˜
Bx)
2
µf(bj˜
bj) = 2
µf(nj)
L
µ||nj||2
2L
µ¯ǫ2.
Lemma 4. Let ARm×k,BRm×s, and fsatisfy Assumption 2. Then,
νRf
B(A)α([A, B]).
Proof. This follows directly from the definitions of αand Rf
B: in fact,
νRf
B(A)min
jmin
y||A(:, j)By||2α([A, B ]).
8
Lemma 5. Let Zand ˜
ZRm×rsatisfy ||Z˜
Z||1,2¯ǫ. Then,
α(˜
Z)α(Z)ǫ.
Proof. Denoting N=Z˜
Zand J={1,2,...,k}\{j}, we have
α(˜
Z) = min
1jk,x||˜zj˜
Z(:,J)x||2
= min
1jk,x||zjnjZ(:,J)x+N(:,J)x||2
min
1jk,x||zjZ(:,J)x||2− ||nj||2− ||N(:,J)x||2
min
1jk,x||zjZ(:,J)x||2ǫ
=α(Z)ǫ,
since maxx||N(:,J)x||2max||x||11||N(:,J)x||2=||N(:,J)||1,2¯ǫ.
Corollary 1. Let ARm×k,Band ˜
BRm×ssatisfy ||B˜
B||1,2¯ǫ, and fsatisfy Assumption 2.
Then,
νRf
˜
B(A)α([A, B]) min(s, 2)¯ǫ.
Proof. If s= 0, the result follows from Lemma 4 (Bis an empty matrix). If s= 1, then it is
easily derived using the same steps as in the proof of Lemma 5 (Bonly has one column). Otherwise,
Lemmas 4 and 5 imply that
νRf
˜
B(A)α[A, ˜
B]α([A, B]) ǫ.
Lemma 6. For any WRm×r,α(W)σ(W).
Proof. We have
α(W) = min
1jrmin
x||W(:, j)W(:,J)x||2
min
1jrmin
zRr,z(j)=1 ||W z||2
min
zRr,||z||21||W z||2=σ(W).
Lemma 7. Let x, y Rm,BRm×s, and fsatisfy Assumption 2. Then
Rf
B(x)
2σ([B, x]) and
Rf
B(x)− Rf
B(y)
22σ([B, x, y]).
Proof. Let us denote zx= argminzf(xBz) and zy= argminzf(yBz), we have
||Rf
B(x)||2
2=||xBzx||2
2min
zRp||x+Bz||2= min
zRp+1,z(1)=1 ||[x, B]z||2
min
||z||21||[x, B]z||2σ([x, B]),
9
and
||Rf
B(x)− Rf
B(y)||2
2=||(xBzx)(yBzy)||2
2
min
zRp||xy+Bz||2
= min
zRp+2,z(1)=1,z(2)=1||[x, y, B ]z||2
min
||z||22||[x, y, B ]z||2=2σ([x, y, B]).
Lemma 8 (Singular Value Perturbation, Weyl).Let ˜
B=B+NRm×swith sm. Then, for all
1is,σi(B)σi(˜
B)σmax(N) = ||N||2s||N||1,2.
Lemma 9. Let x, y Rm,Band ˜
BRm×sbe such that || ˜
BB||1,2¯ǫ, and fsatisfy Assumption 2.
Then
Rf
˜
B(x)− Rf
˜
B(y)
22σ([B, x, y]) s¯ǫ.
Proof. This follows from Lemmas 7 and 8.
Corollary 2. Let ARm×k,Band ˜
BRm×ssatisfy || ˜
BB||1,2¯ǫ, and fsatisfy Assumption 2.
Then,
ωRf
˜
B(A)σ([A, B]) 2s¯ǫ,
Proof. Using Lemma 9, we have
1
2γRf
˜
B(A)=1
2min
i6=j||Rf
˜
B(ai)− Rf
˜
B(aj)||2σ([B, ai, aj]) s¯ǫσ([A, B]) s¯ǫ.
Using Corollary 1 and Lemma 6, we have
νRf
˜
B(A)= min
i
Rf
˜
B(ai)
2α([A, B]) min(s, 2)¯ǫσ([A, B]) min(s, 2)¯ǫ.
Since 2smin(s, 2) for any s0, the proof is complete.
Lemma 10. Let BRm×s,ARm×k,nRm,zk, and fsatisfy Assumption 2. Then
fRf
B(Az +n)fRf
B(Az) + nand fRf
B(Az +n)fRf
B(A)z+n.
Proof. Let us denote y= argminyf(Az By) and Y(:, j ) = argminyf(A(:, j)By) for all j,
that is, Rf
B(A) = ABY . We have
fRf
B(Az +n)= min
yf(Az +nBy)f(Az By+n) = fRf
B(Az) + n,
and
fRf
B(Az +n)= min
yf(Az +nBy)f(Az BY z +n) = fRf
B(A)z+n,
where the inequality follows from y=Y z ∆, since Y(:, j)jand z∆.
10
Let us also recall two useful lemmas from [18].
Lemma 11 ([18], Lemma 3).Let the function fsatisfy Assumption 2. Then, for any ||x||2Kand
||n||2ǫK, we have
f(x)ǫKL f(x+n)f(x) + 3
2ǫKL.
Lemma 12 ([18], Lemma 2).Let Z= [P, Q]where PRm×kand QRm×s, and let fsatisfy
Assumption 2. If ν(P)>2qL
µK(Q), then, for any 0δ1
2,
f= max
xf(Zx)such that xi1δfor 1ik,
satisfies
fmax
if(pi)1
2µ(1 δ)δ ω(P)2.
Moreover, the maximum is attained only at point xsuch that xi= 1 δfor some 1ik.
3.4 Robustness of SNPA when Wis full rank
Theorem 3 shows that if SNPA has already extracted some columns of Wup to error ¯ǫ, then the next
extracted column of ˜
Mwill be close to a column of Wnot extracted yet. This will allow us to prove
inductively that SNPA is robust to noise; see Theorem 4.
Theorem 3. Let
fsatisfy Assumption 2, with strong convexity parameter µ, and its gradient have Lipschitz
constant L.
˜
Msatisfy Assumption 1 with ˜
M=M+N=W H +N, where W= [A, B],ARm×k,
BRm×s,||N||1,2ǫ, and H= [Ir, H]Rr×n
+where H(:, j)for all j.
˜
BRm×ssatisfy
||B˜
B||1,2¯ǫ=Cǫ, for some C0.
W= [A, B]be such that σ(W) = σ > 0. We denote α=α(W), and K=K(W).
ǫbe sufficiently small so that
ǫ < min σ2µ3/2
144KL3/2,αµ
4LC ,σ
2C2s!.
Then the index icorresponding to a column ˜miof ˜
Mthat maximizes the function fRf
˜
B(.)satisfies
mi=W hi= [A, B]hi,where hi()1δwith 1k, (3)
and δ=72ǫKL3/2
σ2µ3/2, which implies
||˜miw||2=||˜mia||2ǫ+ 2Kδ =ǫ 1 + 144K2
σ2
L3/2
µ3/2!.(4)
11
Proof. First note that ǫσ2µ3/2
144KL3/2implies δ1
2. Then, let us show that
νRf
˜
B(A)>2sL
µKRf
˜
B(B),
so that Lemma 12 will apply to P=Rf
˜
B(A) and Q=Rf
˜
B(B). Since ||B˜
B||1,2¯ǫ, by Lemma 3,
we have KRf
˜
B(B)qL
µ¯ǫ. Therefore,
νRf
˜
B(A)
(Corollary 1)
αǫ
>
(¯ǫ=C ǫ< αµ
4L,Lµ)2L
µ¯ǫ
(Lemma 3)
2sL
µKRf
˜
B(B).
Let us also show that ωRf
˜
B(A)σ
2. By Corollary 2 and the assumption that ¯ǫ=Cǫ σ
22s, we
have
ωRf
˜
B(A)σ2s¯ǫσ
2.(5)
We can now prove Equation (3) by contradiction. Assume the extracted index, say the ith, which
maximizes fRf
˜
B(.)among the columns of ˜
M, satisfies ˜mi=mi+ni=W hi+niwith hi()<1δ
for 1 k. We have
fRf
˜
B( ˜mi)
(Lemma 10)
fRf
˜
B(W)hi+ni
(Lemma 11)
fRf
˜
B(W)hi+3
2ǫK Rf
˜
B(A)L
<
(Ass. 2) max
xr,x()1δ1kfRf
˜
B(W)x+3
2ǫKsL
µL
(Lemma 12)
max
jfRf
˜
B(aj)1
2µδ(1 δ)ωRf
˜
B(A)2+3
2ǫK L3/2
µ1/2
(Lem. 10, Eq . 5)
max
jfRf
˜
Baj)nj1
8µδ(1 δ)σ2+3
2ǫK L3/2
µ1/2,
(Lemma 11)
max
jfRf
˜
Baj)1
8µδ(1 δ)σ2+9
2ǫK L3/2
µ1/2,(6)
where ˜ajis the perturbed column of Mcorresponding to wj(that is, ˜aj=wj+nj). The second
inequality follows from Lemma 11 since, by convexity of ||.||2and by Lemma 2, we have
Rf
˜
B(W)hi
2max
i
Rf
˜
B(wi)
2sL
µK.
12
The third inequality is strict since, by strong convexity of f, the maximum is attained at a vertex
with x() = 1 δfor some 1 kat optimality while we assumed hi()<1δfor 1 k. The
last inequality follows from Lemma 11 since
Rf
˜
Baj)
2sL
µ||˜aj||2sL
µ(K+ǫ)2sL
µK.
As δ1
2, we have
1
8µδ(1 δ)σ21
16µσ2δ=1
16µσ2 72ǫK L3/2
σ2µ3/2!=9
2ǫK L3/2
µ3/2.
Combining this inequality with Equation (6), we obtain fRf
˜
B( ˜mi)<maxjfRf
˜
Baj), a contra-
diction since ˜mishould maximize fRf
˜
B(.)among the columns of ˜
Mand the ˜aj’s are among the
columns of ˜
M.
To prove Equation (4), we use Equation (3) and observe that
mi= (1 δ)w+X
k6=
βkwkfor some and 1 δ1δ,
so that Pk6=βkδδ. Therefore,
kmiwk2=
δw+X
k6=
βkwk
2
2δmax
j||wj||22δK2Kδ,
which gives
||˜miw||2≤ ||( ˜mimi) + (miw)||2ǫ+ 2Kδ,
for some 1 k.
We can now prove robustness of SNPA when Wis full rank.
Theorem 4. Let ˜
M=W H +NRm×nsatisfy Assumption 1, and let fsatisfy Assumption 2 with
strong convexity parameter µand its gradient with Lipschitz constant L. Let us denote K=K(W),
σ=σ(W), and let ||N||1,2ǫwith
ǫ < min αµ
4L,σ
22r 1 + 144K2
σ2
L3/2
µ3/2!1
.(7)
Let also Kbe the index set of cardinality rextracted by Algorithm SNPA. Then there exists a permu-
tation πof {1,2,...,r}such that
max
1jr||˜mK(j)wπ(j)||2¯ǫ=ǫ 1 + 144K2
σ2
L3/2
µ3/2!.
13
Proof. The result follows using Theorem 3 inductively with
C= 1 + 144K2
σ2
L3/2
µ3/2!.
The matrix Bin Theorem 3 corresponds to the columns of Wextracted so far by SNPA (note that
at the first step Bis the empty matrix) while the columns of Acorrespond to the columns of Wnot
extracted yet. By Theorem 3, if the columns of Bare at distance at most ¯ǫ=C ǫ of some columns of
W, then the next extracted column will be a column of the matrix Aand be at distance at most ¯ǫof
another column of W. The results therefore follows by induction.
Note that ǫ < σ
2C2rimplies ǫ < σ2µ3/2
144KL3/2since σKand
ǫ < σ
C=σ
1 + 144K2
σ2
L3/2
µ3/2σ3µ3/2
144K2L3/2σ2µ3/2
144KL3/2.
Finally, SNPA with f(.) = ||.||2
2satisfies the same error bound as SPA when Wis full rank:
Theorem 5. Let ˜
Mbe near-separable matrix satisfying Assumption 1 with the matrix Wfull rank.
If ǫ≤ Oσmin(W)
2(W), then Algorithm SNPA with f(.) = ||.||2
2identifies all the columns of Wup to
error Oǫ κ2(W).
Proof. This follows directly from K(W)σmax(W), α(W)σ(W) (Lemma 6), Theorem 3 and the
fact that µ=L= 2 for f(x) = ||x||2
2.
3.5 Generalization to Rank Deficient W
SNPA can be applied to a broader class of near-separable matrices: in fact, the full rank assumption
on matrix WRm×rin Theorem 5 is not necessary for SNPA to be robust to noise. We now define
the parameter β(W), which will replace σ(W) in the robustness analysis of SNPA.
Definition (Parameter β)Given WRm×rand fsatisfying Assumption 2, we define
νβ(W) = min
J⊆{1,2,...,r}\{i},i
Rf
W(:,J)(wi)
2,
γβ(W) = min
J⊆{1,2,...,r}\{i},i6=j
Rf
W(:,J)(wi)− Rf
W(:,J)(wj)
2,
and
β(W) = min νβ(W),1
2γβ(W).
The quantity β(W) is the minimum between
the norms of the residuals of the projections of the columns of Wonto the convex hull of any
subset of other columns of W, and
the distances between these residuals.
14
For example, if the columns of Ware the vertices of a triangle in the plane (WR2×3), SPA can
only extract two columns (the residual will be equal to zero after two steps because rank(W) = 2)
while, in most cases, β(W)>0 and SNPA is able to identify correctly the three vertices even in the
presence of noise. In particular, β(W) is positive for full rank matrices:
Lemma 13. For any WRm×r,β(W)σ(W).
Proof. This follows directly from Lemma 7.
However, the condition β(W)>0 is not necessarily satisfied for any set of vertices wi’s hence
SNPA is not robust to noise for any matrix Wwith α(W)>0. For example, in case of a triangle
in the plane, β(W) = 0 if and only if the residuals of the projections of two columns of Wonto the
segment joining the origin and the last column of Ware equal to one another (this requires that they
are on the same side and at the same distant of that segment). It is the case for example for the
following matrix
W=4 1 3
0 1 1
with β(W) = 0 while α(W)>0, and any data point on the segment [W(:,2), W (:,3)] could be
extracted at the second step of SNPA. In fact, we have
Rf
W(:,1)W(:,2)=Rf
W(:,1)W(:,3)=0
1.
We can link β(W) and α(W) as follows.
Lemma 14. For any WRm×r,α(W)qµ
Lβ(W).
Proof. Denoting J={1,2,...,r}\{j}, we have
α(W) = min
1jrmin
x||wjW(:,J)x||2
r2
Lmin
1jrmin
xf(wjW(:,J)x)
=r2
Lmin
1jrfRf
W(:,J)(wj)
rµ
Lmin
1jr
Rf
W(:,J)(wj)
2β(W).
In the following, we get rid of σ(W) in the robustness analysis of SNPA to replace it with β(W).
In Theorems 3 and 4, σ(W) was used to lower bound ωRf
˜
B(A); see in particular Equation (5).
Hence we need to lower bound ωRf
˜
B(A)using β(W). The remaining of the proofs follows exactly
the same steps as the proofs of Theorem 3 and 4, and we do not repeat them here.
15
Theorem 6. Let ˜
M=W H +NRm×nbe near-separable matrix satisfying Assumption 1, and let
fsatisfy Assumption 2 with strong convexity parameter µand its gradient with Lipschitz constant L.
Let us denote K=K(W),β=β(W), and let ||N||1,2ǫwith
ǫ < min β2µ3/2
144KL3/2,αµ
4LC ,β2µ
128KLC !,(8)
with C=1 + 144K2
β2L3/2
µ3/2. Let also Kbe the index set of cardinality rextracted by Algorithm SNPA.
Then there exists a permutation πof {1,2,...,r}such that
max
1jr||˜mK(j)wπ(j)||2¯ǫ=Cǫ.
Proof. Using the first four assumptions of Theorem 3 (that is, all assumptions of Theorem 3 but the
upper bound on ǫ) and denoting β=β(W), we show in Appendix B that
ωRf
˜
B(A)β2s6KL¯ǫ
µ.
Therefore, if
2s6KL¯ǫ
µβ
2¯ǫ=Cǫ β2µ
96KL ,
then ωRf
˜
B(A)β
2. Hence σ(W) can be replaced with β(W) in Equations (5), (6) and in the
following derivations, and Theorem 3 applies under the same conditions except for δ=72ǫK L3/2
β2µ3/2, and
the condition on ǫgiven by Equation (8).
Let us denote 1 κβ(W) = K(W)
β(W)κ(W).
Theorem 7. Let ˜
Mbe near-separable matrix (see Assumption 1) where Wsatisfies β(W)>0.
If ǫ≤ Oβ(W)
κ3
β(W), then Algorithm SNPA with f(.) = ||.||2
2identifies all the columns of Wup to error
Oǫ κ3
β(W).
Proof. This follows from Theorem 6, Lemma 14 and µ=L= 2 for f(.) = ||.||2
2.
Note that the bounds are cubic in κβ(W) while they were quadratic in κ(W). The reason is that
the lower bound on ωRf
˜
B(A)based on β(W) is of the form β(W)O(¯ǫ), while the one based on
σ(W) was of the form σ(W)− O(r¯ǫ). However the dependence on rhas disappeared.
Remark 1. Because SNPA extracts the columns of Win a specific order, the parameter β(W)could
be replaced by the following larger parameter. Assume without loss of generality that the columns of
Ware ordered in such a way that the kth column of Wis extracted at the kth step of SNPA. Then,
β(W)in the robustness analysis of SNPA can be replaced with the larger
β(W) = min
i<k
Rf
W(:,1:i)(wk)
2,1
2min
1ir2,i<k6=l
Rf
W(:,1:i)(wk)− Rf
W(:,1:i)(wl)
2.
16
It is also interesting to notice that β(W)could be equal to zero for some function fwhile being positive
for others. Hence, ideally, the function fshould be chosen such that β(W)is maximized. Note however
that Wis unknown and β(W)is expensive to compute (although β(W)could be used instead) so the
problem seems rather challenging. This is a topic for further research.
3.6 Improvements using Post-Processing, Pre-Conditioning and Outlier Detection
It is possible to improve the performance of SNPA, the same way it was done for SPA:
A first possibility is to use post-processing [2, Alg. 4]. Let Kbe the set of indices extracted by
SNPA, and denote K(k) the index extracted at step k. For k= 1,2,...r, the post-processing
Projects each column of the data matrix onto the convex hull of M(:,K\{K(k)}).
Identifies the column of the corresponding projected matrix maximizing f(.) (say the kth),
Updates K=K\{K(k)} ∪ {k}.
This allows to improve the bound on the error of Theorem 1 from Oǫ κ2(W)to O(ǫ κ(W)).
A second possibility is to pre-condition the input near-separable matrix [19], making the condi-
tion number of Wconstant, while multiplying the error by a factor of at most σ1
min(W). This al-
lows to improve the bound on the noise of Theorem 1 from ǫ≤ Oσmin (W)
2(W)to ǫ≤ O σmin(W)
rr
and the bound on the error from Oǫ κ2(W)to O(ǫ κ(W)).
A third possibility for improvement is to deal with outliers. They will be identified by SNPA
along with the columns of W, and can be discarded in a second step by computing the optimal
weights needed to reconstruct all columns of the input matrix with the extracted columns [18, 17].
We do not focus in this paper on these improvements as they are straightforward applications of
existing techniques. Our focus in the next section is rather to show the better performance of SNPA
compared to the original SPA.
4 Numerical Experiments
In this section, we compare the following algorithms
SPA: the successive projection algorithm; see Algorithm SPA.
SNPA: the successive nonnegative projection algorithm; see Algorithm SNPA with f(x) = ||x||2
2.
XRAY: recursive algorithm similar to SNPA [22]. It extracts columns recursively, and projects
the data points onto the convex cone generated by the columns extracted so far. (We use in this
paper the variant referred to as max.)
Our goal is two-fold:
1. Illustrate our theoretical result, namely that SNPA is more robust than SPA; see Section 4.1.
2. Show that SNPA can be used successfully on real-world hyperspectral images; see Section 4.2
where the popular Urban data set is used for comparison.
The Matlab code is available at https://sites.google.com/site/nicolasgillis/. All tests
are preformed using Matlab on a laptop Intel CORE i5-3210M CPU @2.5GHz 2.5GHz 6Go RAM.
17
4.1 Synthetic Data Sets
For well-conditioned matrices, β(W) and σ(W) are close to one another (in fact, β(Ir) = σ(Ir) = 1), in
which case we observed that SPA and SNPA provide very similar results (in most cases, they extract
the same index set). Therefore, in this section, our focus in on ill-conditioned instances and perform
an experiment very similar to the third and fourth experiment in [18] to assess the robustness to noise
of the different algorithms. We generate the near-separable matrices as follows.
We take m=r= 20.
Each entry of the matrix W[0,1]m×ris drawn uniformly at random in the interval [0,1]. Then
the compact singular value decomposition (U, S, V T) of Wis computed (using the function
svds(M,r) of Matlab), and Wis replaced with UΣVTwhere Σ is a diagonal matrix with
Σ(i, i) = αi1(1 ir) where αr1= 1000 so that κ(W) = 1000. (Note that Wis not
necessarily nonnegative, a situation handled by the above near-separable NMF algorithms –only
Hneeds to be nonnegative; see Assumption 1.)
The matrices Hand Nare generated in two different ways:
1. Dirichlet. H= [Ir, Ir, H] so that each column of Wis repeated twice while Hhas 200
columns (hence n= 240) generated at random following a Dirichlet distribution with
parameters drawn uniformly at random in the interval [0,1]. Each entry of the noise Nis
drawn following a normal distribution N(0,1) and then multiplied by the parameter δ.
2. Middle Points. H= [Ir, H] where each column of Hhas exactly two nonzero entries
equal to 0.5 (hence n=r+r
2= 210). The columns of Ware not perturbed (that is,
N(:,1 : r) = 0) while the remaining ones are perturbed towards the outside of the convex
hull of the columns of W, that is, N(:, i) = δ(M(:, i)¯w) where ¯w=1
rPr
i=1 wiand δis a
parameter.
For different values of the noise parameter δ(using logspace(-4,-0.5,100)), we generate 25
matrices of the two types: Figure 1 (resp. Figure 2) displays the percentage of columns correctly
identified by the different algorithms for the experiment ‘Dirichlet’ (resp. ‘Middle points’). Table 1 and
Table 2 give the robustness and the average running time for both experiments. SPA is significantly
Table 1: Robustness for the ‘Dirichlet’ experiment, that is, largest value of ǫfor which all (resp. 95%
of the) columns of Ware correctly identified, and average running time in seconds of the different
near-separable NMF algorithms.
SPA SNPA XRAY
Robustness (100%) 1.921043.0510-3 2.39103
Robustness (95%) 5.951048.2110-3 6.89103
Time (s.) <0.01 8.53 1.27
faster than XRAY and SNPA since the projection at each step simply amounts to a matrix-vector
product while, for SNPA and XRAY, the projection requires solving nlinearly constrained least
squares problems in rvariables. XRAY is faster than SNPA because the projections are simpler to
compute. (Also, a different algorithm was used: XRAY requires to solve least squares problems in the
nonnegative orthant, which is solved with an efficient coordinate descent method from [16].)
18
Figure 1: Comparison of the different near-separable NMF algorithms on ill-conditioned data sets
(‘Dirichlet’ type).
Figure 2: Comparison of the different near-separable NMF algorithms on ill-conditioned data sets
(‘Middle points’ type).
For both experiments (‘Dirichlet’ and ‘Middle points’), SNPA outperforms SPA in terms of ro-
bustness, as expected by our theoretical findings. In fact, SNPA is about ten (resp. five) times more
robust than SPA for the experiment ‘Dirichlet’ (resp. ‘Middle points’), that is, it identifies correctly
all columns of Wfor the noise parameter δten (resp. five) times larger; see Tables 1 and 2. Moreover,
for the ‘Dirichlet’ experiment, SNPA identifies significantly more columns of W, even for larger noise
levels (which fall outside the scope of our analysis); for example, for δ= 0.01, SNPA identifies correctly
19
Table 2: Robustness and average running time for the ‘Middle points’ experiment.
SPA SNPA XRAY
Robustness (100%) 5.861032.3410-2 2.08104
Robustness (95%) 6.081027.3710-2 4.67102
Time (s.) <0.01 11.67 1.55
more than 90% of the columns of Wwhile SPA identifies less than 70%.
For the ‘Dirichlet’ experiment, SNPA is slightly more robust than XRAY while, for the ‘Middle
points’ experiment, it is significantly more robust. The reason is that XRAY does not deal very well
with data points on the faces of the convex hull of the columns of W(in this case, on the middle of
the segment joining two vertices); see the discussion in [22]. In fact, SPA is also more robust than
XRAY on this experiment. However, the three algorithms overall perform similarly on the ‘Middle
points’ experiment in the sense that the percentages of columns of Wcorrectly identified do not differ
by more than about five percent for all δ0.1.
4.2 Real-World Hyperspectral Image
In this section, we analyze the Urban data set1with m= 162 and n= 3072= 94249. It is mainly
constituted of grass, trees, dirt, road and different roof and metallic surfaces; see Figure 3.
Figure 3: Urban data set taken from an aircraft (army geospatial center) with road surfaces (1), roofs
1 (2), dirt (3), grass (4), trees (5) and roofs 2 (6).
We run the near-separable NMF algorithms to extract r= 8 endmembers. On this data set, SPA
took less than half a second to run, SNPA about one minute and XRAY about half a minute.
Figure 4 displays the extracted spectral signatures, and Figure 5 displays the corresponding abun-
dance maps, that is, the rows of
H= argminH0||MM(: K)H||F
1Available at http://www.agc.army.mil/.
20
0 50 100 150
0
200
400
600
800
1000
Band number
Reflectance
SPA
0 50 100 150
0
200
400
600
800
1000
Band number
SNPA
0 50 100 150
0
200
400
600
800
1000
Band number
XRAY
Figure 4: Spectral signatures of the extracted endmembers.
where Kis the set of extracted indices by a given algorithm. SPA and SNPA extract six common
indices (out of the eight, the first one being different is the fourth).
Figure 5: Abundances maps corresponding to the extracted indices. From top to bottom: SPA, SNPA
and XRAY.
SNPA performs better than both SPA and XRAY as it is the only algorithm able to distinguish
the grass and trees (3rd and 8th extracted endmembers), while identifying the road surfaces (1st), the
dirt (6th) and the roof tops (7th). In particular, the relative error in percent, that is,
100 minH0||MM(:,K)H||F
||M||F[0,100]
for SPA is 9.45, for XRAY 6.82 and for SNPA 5.64. In other words, SNPA is able to identify eight
columns of Mwhich can reconstruct Mbetter.
Further research includes the comparison of SNPA with other endmember extraction algorithms,
and its incorporation in more sophisticated techniques, e.g., where pre-processing is used to remove
21
outliers and noise, or where endmember extraction algorithms are used as an initialization for iterative
methods not relying on the pure-pixel assumption; see, e.g., [7] where SPA is used.
4.3 Document Data Sets
Because, as for SPA, SNPA requires to normalize the input near-separable matrices not satisfying
Assumption 1 (that is, near-separable matrices for which the columns of Hdo not belong to ∆), it
may introduce distortion in the data set [22]. In particular, the normalization amplifies the noise of
the columns of Mwith small norm (see the discussion in [17]).
In document data set, the columns of the matrix Hare usually not assumed to belong to the unit
simplex hence normalization is necessary for applying SNPA. Therefore, XRAY should be preferred
and it has been observed that, for document data sets, SNPA and SPA perform similarly while XRAY
performs better [21]; see also [22].
5 Conclusion and Further Research
In this paper, we have proposed a new fast and robust recursive algorithm for near-separable NMF,
which we referred to as the successive nonnegative projection algorithm (SNPA). Although computa-
tionally more expensive than the successive projection algorithm (SPA), SNPA can be used to solve
large-scale problems, running in O(mnr) operations, while being more robust and applicable to a
broader class of nonnegative matrices. In particular, SNPA seems to be a good alternative to SPA for
real-world hyperspectral images.
There exists several algorithms robust for any near-separable matrix requiring only that α(W)>0
[3, 14, 17] which are therefore more general than SNPA which requires β(W)>0. In fact, under
Assumption 1, α(W)>0 is a necessary condition for being able to identify the columns of Wamong
the columns of ˜
M. However, these algorithms are computationally much more expensive (nlinear
programs in nvariables or a single linear program in n2variables have to be solved). Therefore, it
would be an interesting direction for further research to develop, if possible, an efficient (recursive?)
algorithm provably robust for any near-separable matrix M=W[Ir, H ] + Nwith α(W)>0.
Acknowledgment
The author would like to thank Abhishek Kumar and Vikas Sindhwani (IBM T.J. Watson Research
Center) for motivating him to study the robustness of algorithms based on nonnegative projections,
for insightful discussions and for performing some numerical experiments on document data sets.
The author would also like to thank Wing-Kin Ma (Chinese University of Hong Kong) for insightful
discussions and for suggesting to analyze the noiseless case separately.
A Fast Gradient Method for Least Squares on the Simplex
Algorithm FGM is a fast gradient method to solve
min
xrf(Ax y),(9)
22
where yRmand ARm×r. To achieve an accuracy of ǫin the objective function, the algorithm
requires O1
ǫiterations. In other words, the objective function converges to the optimal value at
rate O1
k2where kis the iteration number.
Algorithm FGM Fast Gradient Method for Solving (9); see [26, p.90]
Input: A point yRm, a matrix ARm×r, a function fwhose gradient is Lipschitz continuous
with constant Lf, and an initial guess xr.
Output: An approximate projection Ax ≈ Pf
A(y) with xargminzf(Az y).
1: α0(0,1); x=z;L=Lfσmax(A)2.
2: for k= 1 : maxiter do
3: x=x.% Keep the previous iterate in memory.
4: x=Pz1
Lf(Az y).%Pis the projection on ; see Appendix A.1.
5: z=x+βkxx, where βk=αk(1αk)
α2
k+αk+1 with αk+1 0 s.t. α2
k+1 = (1 αk+1 )α2
k.
6: end for
Remark 2. Note that the function g(x) = f(Ax b)is not necessarily strongly convex, even if fis.
This would require Ato be full rank, which we we do not assume here. If it were the case, then even
faster methods could be used although the convergence of Algorithm FGM becomes linear [26].
Remark 3 (Stopping Condition).For the numerical experiments in Section 4, we used maxiter =
500 and combined it with a stopping condition based on the evolution of the iterates; see the online
code for more details.
A.1 Projection on the Unit Simplex
In Algorithm FGM, the projection onto the unit simplex ∆ needs to be computed, that is, given
yRr, we have to compute
P(y) = x= argminx
1
2||xy||2such that x.
Let us construct the Lagrangian dual corresponding to the sum-to-one constraint (since the problem
above has a Slater point, there is no duality gap):
max
µ0min
x0
1
2||xy||2µ(1 eTx),
where eis the all-one vector and µ0 the Lagrangian multiplier. For µfixed, the optimal solution
in xis given by
x= max(0, y µe).
If the sum-to-one constraint is not active, that is, Pix
i<1 , we must have µ= 0 hence x= max(0, y)
(hence this happens if and only if max(0, y)∆). Otherwise the value of µcan computed by solving the
system Pix
i= 1 and x= max(0, y µe), equivalent to finding µsatisfying Pn
i=1 max(0, yiµ) = 1
(µcan be found easily after having sorted the entries of y).
23
B Lower Bound for ωRf
˜
B(A)depending on β([A, B])
In this appendix, we derive a lower bound on ωRf
˜
B(A)based on β([A, B]).
Lemma 15. Let xRm,Band ˜
BRm×sbe such that ||B˜
B||1,2¯ǫ≤ ||B||1,2, and fsatisfy
Assumption 2. Then
Rf
B(x)Rf
˜
B(x)
2
212 L
µ¯ǫ||B||1,2.
Proof. Let us denote z=Pf
B(x), ˜z=Pf
˜
B(x), and z=Pf
[B, ˜
B](x). We have
Rf
B(x)− Rf
˜
B(x)
2=
(x− Pf
B(x)) (x− Pf
˜
B(x))
2=
Pf
B(x)− Pf
˜
B(x)
2=kz˜zk2.
Since z= [B, ˜
B]wfor some w∆, there exists y=Bw and ˜y=˜
B˜wwith w, ˜w∆ such that
||zy||2¯ǫand ||z˜y||2¯ǫ. In fact, it suffices to take w= ˜w=w(1:r) + w(r+ 1:2r) since
||zy||2=||[B, ˜
B]wBw||2
=||Bw(1 : r) + ˜
Bw(r+ 1 : 2r)Bw(1 : r)Bw(r+ 1 : 2r)||2
=||(˜
BB)w(r+ 1 : 2r)||2≤ ||B˜
B||1,2¯ǫ,
and similarly for ˜y. Therefore, there exists some n, ˜nsuch that z=y+n= ˜y+ ˜nwith ||n||2,||˜n||2¯ǫ.
By Lemma 11 and the fact that ||y||2≤ ||B||1,2and ||˜y||2≤ || ˜
B||1,2≤ ||B||1,2+ ¯ǫ2||B||1,2, we have
f(z) = f(y+n)f(y)− ||B||1,2L¯ǫand f(z) = f(˜y+ ˜n)fy)2||B||1,2L¯ǫ.
Therefore, by definition of zand ˜z,
f(z)1
2f(y) + 1
2fy)3
2||B||1,2L¯ǫ1
2f(z) + 1
2fz)3
2||B||1,2L¯ǫ.
Moreover, by definition of zand strong convexity of f, we obtain
f(z)f1
2z+1
2˜z1
2f(z) + 1
2fz)µ
8||z˜z||2
2.
Hence, combining the above two inequalities, ||z˜z||2
212L
µ||B||1,2¯ǫ.
Lemma 16. Let x, y Rm,Band ˜
BRm×sbe such that ||B˜
B||1,2¯ǫ≤ ||B||1,2, and fsatisfy
Assumption 2. Then
Rf
˜
B(x)− Rf
˜
B(y)
2
Rf
B(x)− Rf
B(y)
24s3KL
µ¯ǫ.
Proof. This follows directly from Lemma 15:
Rf
˜
B(x)− Rf
˜
B(y)
2=
Rf
˜
B(x)− Rf
B(x) + Rf
B(x)Rf
˜
B(y) + Rf
B(y)− Rf
B(y)
2
Rf
B(x)− Rf
B(y)
22s12||B||1,2L¯ǫ
µ.
24
Lemma 17. Let ARm×k,Band ˜
BRm×sbe such that ||B˜
B||1,2¯ǫ≤ ||B||1,2, and fsatisfy
Assumption 2. Then
ωRf
˜
B(A)β([A, B]) 2s6L
µ||B||1,2¯ǫ.
Proof. This follows directly from Lemmas 15 and 16. In fact, for all i,
β([A, B])
Rf
˜
B(ai)
2
Rf
B(ai)
2
Rf
˜
B(ai)
2
Rf
B(ai)− Rf
˜
B(ai)
2s12 L
µ¯ǫ||B||1,2,
while, for all i, j,
1
2
Rf
˜
B(ai)− Rf
˜
B(aj)
21
2
Rf
B(ai)− Rf
B(aj)
24
2s3||B||1,2L
µ¯ǫ
β([A, B]) 2s6||B||1,2L
µ¯ǫ.
References
[1] Ara´ujo, U., Saldanha, B., Galv˜ao, R., Yoneyama, T., Chame, H., Visani, V.: The successive projections
algorithm for variable selection in spectroscopic multicomponent analysis. Chemometrics and Intelligent
Laboratory Systems 57(2), 65–73 (2001)
[2] Arora, S., Ge, R., Halpern, Y., Mimno, D., Moitra, A., Sontag, D., Wu, Y., Zhu, M.: A practical algorithm
for topic modeling with provable guarantees. In: International Conference on Machine Learning (ICML ’13),
vol. 28, pp. 280–288 (2013)
[3] Arora, S., Ge, R., Kannan, R., Moitra, A.: Computing a nonnegative matrix factorization – provably. In:
Proceedings of the 44th symposium on Theory of Computing, STOC ’12, pp. 145–162 (2012)
[4] Arora, S., Ge, R., Moitra, A.: Learning topic models - going beyond SVD. In: Proceedings of the 53rd
Annual IEEE Symposium on Foundations of Computer Science, FOCS ’12, pp. 1–10 (2012)
[5] Bioucas-Dias, J., Plaza, A., Dobigeon, N., Parente, M., Du, Q., Gader, P., Chanussot, J.: Hyper-
spectral Unmixing Overview: Geometrical, Statistical, and Sparse Regression-Based Approaches (2012).
arXiv:1202.6294v2
[6] Bittorf, V., Recht, B., R´e, E., Tropp, J.: Factoring nonnegative matrices with linear programs. In: Advances
in Neural Information Processing Systems (NIPS ’12), pp. 1223–1231 (2012)
[7] Chan, T.H., Ma, W.K., Ambikapathi, A., Chi, C.Y.: A simplex volume maximization framework for
hyperspectral endmember extraction. IEEE Trans. on Geoscience and Remote Sensing 49(11), 4177–4193
(2011)
[8] Chan, T.H., Ma, W.K., Chi, C.Y., Wang, Y.: A convex analysis framework for blind separation of non-
negative sources. IEEE Trans. on Signal Processing 56(10), 5120–5134 (2008)
[9] Devarajan, K.: Nonnegative Matrix Factorization: An Analytical and Interpretive Tool in Computational
Biology. PLoS Computational Biology 4(7), e1000029 (2008)
[10] Ding, W., Rohban, M., Ishwar, P., Saligrama, V.: Topic discovery through data dependent and random
projections. In: International Conference on Machine Learning (ICML ’13), vol. 28, pp. 471–479 (2013)
25
[11] Elhamifar, E., Sapiro, G., Vidal, R.: See all by looking at a few: Sparse modeling for finding representative
objects. In: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR ’12), pp. 1600–1607 (2012)
[12] Esser, E., Moller, M., Osher, S., Sapiro, G., Xin, J.: A convex model for nonnegative matrix factorization
and dimensionality reduction on physical space. IEEE Trans. on Image Processing 21(7), 3239–3252 (2012)
[13] Fu, X., Ma, W.K., Chan, T.H., Bioucas-Dias, J., Iordache, M.D.: Greedy algorithms for pure pixel iden-
tification in hyperspectral unmixing: A multiple-measurement vector viewpoint. In: Proc. of European
Signal Processing Conference (EUSIPCO ’13) (2013)
[14] Gillis, N.: Sparse and unique nonnegative matrix factorization through data preprocessing. Journal of
Machine Learning Research 13(Nov), 3349–3386 (2012)
[15] Gillis, N.: Robustness analysis of Hottopixx, a linear programming model for factoring nonnegative matri-
ces. SIAM J. Mat. Anal. Appl. 34(3), 1189–1212 (2013)
[16] Gillis, N., Glineur, F.: Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative
matrix factorization. Neural Computation 24(4), 1085–1105 (2012)
[17] Gillis, N., Luce, R.: Robust near-separable nonnegative matrix factorization using linear optimization
(2013). arXiv:1302.4385
[18] Gillis, N., Vavasis, S.: Fast and robust recursive algorithms for separable nonnegative matrix factorization.
IEEE Trans. Pattern Anal. Mach. Intell. (2013). DOI:10.1109/TPAMI.2013.226
[19] Gillis, N., Vavasis, S.: Semidefinite programming based preconditioning for more robust near-separable
nonnegative matrix factorization (2013). arXiv:1310.2273
[20] Hiriart-Urruty, J.B., Lemar´echal, C.: Fundamentals of Convex Analysis. Springer, Berlin (2001)
[21] Kumar, A.: private communication (2013)
[22] Kumar, A., Sindhwani, V., Kambadur, P.: Fast conical hull algorithms for near-separable non-negative
matrix factorization. In: Int. Conf. on Machine Learning (ICML ’13), vol. 28, pp. 231–239 (2013)
[23] Lee, D., Seung, H.: Learning the Parts of Ob jects by Nonnegative Matrix Factorization. Nature 401,
788–791 (1999)
[24] Ma, W.K., Bioucas-Dias, J., Chan, T.H., Gillis, N., Gader, P., Plaza, A., Ambikapathi, A., Chi, C.Y.:
A Signal processing perspective on hyperspectral unmixing (2014). To appear in IEEE Signal Processing
Magazine, Special Issue on Signal and Image Processing in Hyperspectral Remote Sensing
[25] Mizutani, T.: Ellipsoidal Rounding for Nonnegative Matrix Factorization Under Noisy Separability (2013).
arXiv:1309.5701
[26] Nesterov, Y.: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publish-
ers (2004)
[27] Pauca, V., Piper, J., Plemmons, R.: Nonnegative matrix factorization for spectral data analysis. Linear
Algebra and its Applications 406 (1), 29–47 (2006)
[28] Ren, H., Chang, C.I.: Automatic spectral target recognition in hyperspectral imagery. IEEE Trans. on
Aerospace and Electronic Systems 39(4), 1232–1249 (2003)
[29] Shahnaz, F., Berry, M., A., Pauca, V., Plemmons, R.: Document clustering using nonnegative matrix
factorization. Information Processing and Management 42, 373–386 (2006)
[30] Vavasis, S.: On the complexity of nonnegative matrix factorization. SIAM Journal on Optimization 20(3),
1364–1377 (2009)
[31] Wang, F., Li, T., Wang, X., Zhu, S., Ding, C.: Community discovery using nonnegative matrix factoriza-
tion. Data Mining and Knowledge Discovery 22(3), 493–521 (2011)
26
... This assumptions implies that the spectra in M form a simplex with r vertices. These geometric algorithms include vertex component analysis (VCA), successive projection algorithm (SPA), successive nonnegative projection algorithm (SNPA) and NFINDR, to cite a few, which aim at finding these vertices by resorting to geometric tools, such as projections [11]- [13]. These methods are usually very fast even for very large data set, but are restricted to the pure-pixel case and may not behave well when the data is grossly corrupted. ...
... This section is divided in two parts. The first one is a comparison of M2PNALS (the non-negative version of M2PALS) on synthetic data with state-of-the-art spectral unmixing algorithms, namely SPA [25], SNPA [13], NFINDR [11], Fast Gradient for Nonnegative Sparse Regression (FGNSR) [7], Group Lasso Unit sum Positivity constraints [6], SDSOMP [15] and MPALS [16]. A recent segmentation algorithm, hierarchical rank-one NMF (H2NMF) [26], is also used as a preprocessing tool for M2PNALS to compute a set of dictionaries in an unsupervised manner. ...
Preprint
Spectral unmixing aims at recovering the spectral signatures of materials, called endmembers, mixed in a hyperspectral or multispectral image, along with their abundances. A typical assumption is that the image contains one pure pixel per endmember, in which case spectral unmixing reduces to identifying these pixels. Many fully automated methods have been proposed in recent years, but little work has been done to allow users to select areas where pure pixels are present manually or using a segmentation algorithm. Additionally, in a non-blind approach, several spectral libraries may be available rather than a single one, with a fixed number (or an upper or lower bound) of endmembers to chose from each. In this paper, we propose a multiple-dictionary constrained low-rank matrix approximation model that address these two problems. We propose an algorithm to compute this model, dubbed M2PALS, and its performance is discussed on both synthetic and real hyperspectral images.
... SSMF algorithms. We compare MV-Dual to six state-of-the-art algorithms: \bullet SNPA [16] is based on the separability assumption and presents a robust extension to the successive projection algorithm (SPA) [2,16] by taking advantage of the nonnegativity constraint in the decomposition. \bullet Simplex volume minimization (Min-Vol) fits a simplex with minimum volume to the data points using the following optimization problem [24]: ...
... SSMF algorithms. We compare MV-Dual to six state-of-the-art algorithms: \bullet SNPA [16] is based on the separability assumption and presents a robust extension to the successive projection algorithm (SPA) [2,16] by taking advantage of the nonnegativity constraint in the decomposition. \bullet Simplex volume minimization (Min-Vol) fits a simplex with minimum volume to the data points using the following optimization problem [24]: ...
Article
Full-text available
. Simplex-structured matrix factorization (SSMF), closely related to nonnegative matrix factorization, is a fundamental interpretable data analysis model and has applications in hyperspectral unmixing and topic modeling. To obtain identifiable solutions, a standard approach is to find minimum- volume solutions. By taking advantage of the duality/polarity concept for polytopes, we convert minimum-volume SSMF in the primal space to a maximum-volume problem in the dual space. We first prove the identifiability of this maximum-volume dual problem. Then, we use this dual formulation to provide a novel optimization approach which bridges the gap between two existing families of algorithms for SSMF, namely volume minimization and facet identification. Numerical experiments show that the proposed approach performs favorably compared to the state-of-the-art SSMF algorithms.
... ‚ Check the identification performances of the DCPD model on synthetic data with respect to the CPD model. We also use DCPD in the matrix case to perform spectral unmixing under pure-pixel assumption of the Urban and Terrain data sets 1 and compare our results with state-of-the-art methods [10]- [13]. ...
... For this communication on dictionary based tensor factorization models, we choose to try out the different models and algorithms on a well-known sparse coding problem, namely spectral unmixing under the pure-pixel assumption. As explained in Section II, the data itself can be used as the dictionary, and models described in this paper for matrices can be straightforwardly compared with state-of-the-art sparse coding approaches for spectral unmixing, namely the succesive projection algorithm (SPA) [38], the successive nonnegative projection algorithm (SNPA) [10], the Hierarchical Clustering algorithm (H2NMF) [65] and FGNSR [41]. Below, the spectra contained in the data are used as atoms, so that the dictionary has row dimension equals to the number of spectral bands (« 150), and number of atoms equals to the number of pixels in the hyperspectral images (HSI) (« 10 5 ). ...
Preprint
To ensure interpretability of extracted sources in tensor decomposition, we introduce in this paper a dictionary-based tensor canonical polyadic decomposition which enforces one factor to belong exactly to a known dictionary. A new formulation of sparse coding is proposed which enables high dimensional tensors dictionary-based canonical polyadic decomposition. The benefits of using a dictionary in tensor decomposition models are explored both in terms of parameter identifiability and estimation accuracy. Performances of the proposed algorithms are evaluated on the decomposition of simulated data and the unmixing of hyperspectral images.
... Successive nonnegative projection algorithm (SNPA). A variant of SPA using the nonnegativity constraints in the projection step [15]. To the best of our knowledge, it is the provably most robust sequential algorithm for separable NMF (in particular, it does not need M (: K) to be full rank). ...
Preprint
A nonnegative matrix factorization (NMF) can be computed efficiently under the separability assumption, which asserts that all the columns of the given input data matrix belong to the cone generated by a (small) subset of them. The provably most robust methods to identify these conic basis columns are based on nonnegative sparse regression and self dictionaries, and require the solution of large-scale convex optimization problems. In this paper we study a particular nonnegative sparse regression model with self dictionary. As opposed to previously proposed models, this model yields a smooth optimization problem where the sparsity is enforced through linear constraints. We show that the Euclidean projection on the polyhedron defined by these constraints can be computed efficiently, and propose a fast gradient method to solve our model. We compare our algorithm with several state-of-the-art methods on synthetic data sets and real-world hyperspectral images.
... In satellite imaging, each pixel represents sensor readings for different patches of land at multiple wavelengths. Individual sources correspond to reflectances of materials at different wavelengths that are mixed according to the material composition of each pixel [1][2][3][4][5]. ...
Preprint
Identifying concentrations of components from an observed mixture is a fundamental problem in signal processing. It has diverse applications in fields ranging from hyperspectral imaging to denoising biomedical sensors. This paper focuses on in-silico deconvolution of signals associated with complex tissues into their constitutive cell-type specific components, along with a quantitative characterization of the cell-types. Deconvolving mixed tissues/cell-types is useful in the removal of contaminants (e.g., surrounding cells) from tumor biopsies, as well as in monitoring changes in the cell population in response to treatment or infection. In these contexts, the observed signal from the mixture of cell-types is assumed to be a linear combination of the expression levels of genes in constitutive cell-types. The goal is to use known signals corresponding to individual cell-types along with a model of the mixing process to cast the deconvolution problem as a suitable optimization problem. In this paper, we present a survey of models, methods, and assumptions underlying deconvolution techniques. We investigate the choice of the different loss functions for evaluating estimation error, constraints on solutions, preprocessing and data filtering, feature selection, and regularization to enhance the quality of solutions, along with the impact of these choices on the performance of regression-based methods for deconvolution. We assess different combinations of these factors and use detailed statistical measures to evaluate their effectiveness. We identify shortcomings of current methods and avenues for further investigation. For many of the identified shortcomings, such as normalization issues and data filtering, we provide new solutions. We summarize our findings in a prescriptive step-by-step process, which can be applied to a wide range of deconvolution problems.
... In order to extract more vertices, one can take the nonnegativity of H into account in the projection step, leading to the successive nonnegative projection algorithm (SNPA). SNPA can extract all the vertices of conv(W ), regardless of the dimensions; see [12] for more details. However, when rk(W ) = r, SPA and SNPA have the same error bounds, while SPA is significantly faster. ...
Preprint
Full-text available
The successive projection algorithm (SPA) is a workhorse algorithm to learn the r vertices of the convex hull of a set of (r1)(r-1)-dimensional data points, a.k.a. a latent simplex, which has numerous applications in data science. In this paper, we revisit the robustness to noise of SPA and several of its variants. In particular, when r3r \geq 3, we prove the tightness of the existing error bounds for SPA and for two more robust preconditioned variants of SPA. We also provide significantly improved error bounds for SPA, by a factor proportional to the conditioning of the r vertices, in two special cases: for the first extracted vertex, and when r2r \leq 2. We then provide further improvements for the error bounds of a translated version of SPA proposed by Arora et al. (''A practical algorithm for topic modeling with provable guarantees'', ICML, 2013) in two special cases: for the first two extracted vertices, and when r3r \leq 3. Finally, we propose a new more robust variant of SPA that first shifts and lifts the data points in order to minimize the conditioning of the problem. We illustrate our results on synthetic data.
... To simplify the presentation, we use the approach proposed in [6]. It initializes with the successive nonnegative projection algorithm (SNPA) [5] that identifies a subset of columns of that represent well-spread data points in the data set. ...
Preprint
Full-text available
Orthogonal nonnegative matrix factorization (ONMF) has become a standard approach for clustering. As far as we know, most works on ONMF rely on the Frobenius norm to assess the quality of the approximation. This paper presents a new model and algorithm for ONMF that minimizes the Kullback-Leibler (KL) divergence. As opposed to the Frobenius norm which assumes Gaussian noise, the KL divergence is the maximum likelihood estimator for Poisson-distributed data, which can model better vectors of word counts in document data sets and photo counting processes in imaging. We have developed an algorithm based on alternating optimization, KL-ONMF, and show that it performs favorably with the Frobenius-norm based ONMF for document classification and hyperspectral image unmixing.
... It can be used for tasks like text mining, sentiment analysis, and news clustering. To solve the NMF problem in polynomial time, a separability assumption is proposed in [10], i.e., W is composed of some columns of A. Then some researchers proposed many algorithms to solve the separable NMF problem [2,15,12]. Recently, Pan and Ng generalized the separability assumption into the coseparability in [30], which assumes that W is composed of some rows and columns of A. In other words, W is a submatrix of A. It provides a more compact core matrix to represent the original data matrix A. ...
Preprint
The key challenge of time-resolved Raman spectroscopy is the identification of the constituent species and the analysis of the kinetics of the underlying reaction network. In this work we present an integral approach that allows for determining both the component spectra and the rate constants simultaneously from a series of vibrational spectra. It is based on an algorithm for non-negative matrix factorization which is applied to the experimental data set following a few pre-processing steps. As a prerequisite for physically unambiguous solutions, each component spectrum must include one vibrational band that does not significantly interfere with vibrational bands of other species. The approach is applied to synthetic "experimental" spectra derived from model systems comprising a set of species with component spectra differing with respect to their degree of spectral interferences and signal-to-noise ratios. In each case, the species involved are connected via monomolecular reaction pathways. The potential and limitations of the approach for recovering the respective rate constants and component spectra are discussed.
Article
Full-text available
Blind hyperspectral unmixing (HU), also known as unsupervised HU, is one of the most prominent research topics in signal processing (SP) for hyperspectral remote sensing [1], [2]. Blind HU aims at identifying materials present in a captured scene, as well as their compositions, by using high spectral resolution of hyperspectral images. It is a blind source separation (BSS) problem from a SP viewpoint. Research on this topic started in the 1990s in geoscience and remote sensing [3]-[7], enabled by technological advances in hyperspectral sensing at the time. In recent years, blind HU has attracted much interest from other fields such as SP, machine learning, and optimization, and the subsequent cross-disciplinary research activities have made blind HU a vibrant topic. The resulting impact is not just on remote sensing - blind HU has provided a unique problem scenario that inspired researchers from different fields to devise novel blind SP methods. In fact, one may say that blind HU has established a new branch of BSS approaches not seen in classical BSS studies. In particular, the convex geometry concepts - discovered by early remote sensing researchers through empirical observations [3]-[7] and refined by later research - are elegant and very different from statistical independence-based BSS approaches established in the SP field. Moreover, the latest research on blind HU is rapidly adopting advanced techniques, such as those in sparse SP and optimization. The present development of blind HU seems to be converging to a point where the lines between remote sensing-originated ideas and advanced SP and optimization concepts are no longer clear, and insights from both sides would be used to establish better methods.
Article
Full-text available
Hyperspectral endmember extraction is a process to estimate endmember signatures from the hyperspectral observations, in an attempt to study the underlying mineral composition of a landscape. However, estimating the number of endmembers, which is usually assumed to be known a priori in most endmember estimation algorithms (EEAs), still remains a challenging task. In this paper, assuming hyperspectral linear mixing model, we propose a hyperspectral data geometry-based approach for estimating the number of endmembers by utilizing successive endmember estimation strategy of an EEA. The approach is fulfilled by two novel algorithms, namely geometry-based estimation of number of endmembers—convex hull (GENE-CH) algorithm and affine hull (GENE-AH) algorithm. The GENE-CH and GENE-AH algorithms are based on the fact that all the observed pixel vectors lie in the convex hull and affine hull of the endmember signatures, respectively. The proposed GENE algorithms estimate the number of endmembers by using the Neyman–Pearson hypothesis testing over the endmember estimates provided by a successive EEA until the estimate of the number of endmembers is obtained. Since the estimation accuracies of the proposed GENE algorithms depend on the performance of the EEA used, a reliable, reproducible, and successive EEA, called p-norm-based pure pixel identification (TRI-P) algorithm is then proposed. The performance of the proposed TRI-P algorithm, and the estimation accuracies of the GENE algorithms are demonstrated through Monte Carlo simulations. Finally, the proposed GENE and TRI-P algorithms are applied to real AVIRIS hyperspectral data obtained over the Cuprite mining site, Nevada, and some conclusions and future directions are provided.
Article
Full-text available
Nonnegative matrix factorization (NMF) has become a widely used tool for the analysis of high-dimensional data as it automatically extracts sparse and meaningful features from a set of nonnegative data vectors. We first illustrate this property of NMF on three applications, in image processing, text mining and hyperspectral imaging --this is the why. Then we address the problem of solving NMF, which is NP-hard in general. We review some standard NMF algorithms, and also present a recent subclass of NMF problems, referred to as near-separable NMF, that can be solved efficiently (that is, in polynomial time), even in the presence of noise --this is the how. Finally, we briefly describe some problems in mathematics and computer science closely related to NMF via the nonnegative rank.
Article
Full-text available
We present algorithms for topic modeling based on the geometry of cross-document word-frequency patterns. This perspective gains significance under the so called separability condition. This is a condition on existence of novel-words that are unique to each topic. We present a suite of highly efficient algorithms based on data-dependent and random projections of word-frequency patterns to identify novel words and associated topics. We will also discuss the statistical guarantees of the data-dependent projections method based on two mild assumptions on the prior density of topic document matrix. Our key insight here is that the maximum and minimum values of cross-document frequency patterns projected along any direction are associated with novel words. While our sample complexity bounds for topic recovery are similar to the state-of-art, the computational complexity of our random projection scheme scales linearly with the number of documents and the number of words per document. We present several experiments on synthetic and real-world datasets to demonstrate qualitative and quantitative merits of our scheme.
Article
Full-text available
Nonnegative matrix factorization (NMF) under the separability assumption can provably be solved efficiently, even in the presence of noise, and has been shown to be a powerful technique in document classification and hyperspectral unmixing. This problem is referred to as near-separable NMF and requires that there exists a cone spanned by a small subset of the columns of the input nonnegative matrix approximately containing all columns. In this paper, we propose a preconditioning based on semidefinite programming making the input matrix well-conditioned. This in turn can improve significantly the performance of near-separable NMF algorithms which is illustrated on the popular successive projection algorithm (SPA). The new preconditioned SPA is provably more robust to noise, and outperforms SPA on several synthetic data sets. We also show how an active-set method allow us to apply the preconditioning on large-scale real-world hyperspectral images.
Conference Paper
Full-text available
This paper studies a multiple-measurement vector (MMV)-based sparse regression approach to blind hyperspectral un-mixing. In general, sparse regression requires a dictionary. The considered approach uses the measured hyperspectral data as the dictionary, thereby intending to represent the whole measured data using the fewest number of measured hyperspectral vectors. We tackle this self-dictionary MMV (SD-MMV) approach using greedy pursuit. It is shown that the resulting greedy algorithms are identical or very similar to some representative pure pixels identification algorithms, such as vertex component analysis. Hence, our study pro-vides a new dimension on understanding and interpreting pure pixels identification methods. We also prove that in the noiseless case, the greedy SD-MMV algorithms guaran-tee perfect identification of pure pixels when the pure pixel assumption holds.
Article
Full-text available
This paper presents a new framework for blind source separation (BSS) of non-negative source signals. The proposed framework, referred herein to as convex analysis of mixtures of non-negative sources (CAMNS), is deterministic requiring no source independence assumption, the entrenched premise in many existing (usually statistical) BSS frameworks. The development is based on a special assumption called local dominance. It is a good assumption for source signals exhibiting sparsity or high contrast, and thus is considered realistic to many real-world problems such as multichannel biomedical imaging. Under local dominance and several standard assumptions, we apply convex analysis to establish a new BSS criterion, which states that the source signals can be perfectly identified (in a blind fashion) by finding the extreme points of an observation-constructed polyhedral set. Methods for fulfilling the CAMNS criterion are also derived, using either linear programming or simplex geometry. Simulation results on several data sets are presented to demonstrate the efficacy of the proposed method over several other reported BSS methods.
Chapter
We have mentioned in our preamble to Chap. C that sublinearity permits the approximation of convex functions to first order around a given point. In fact, we will show here that, if f : ℝn → ℝ is convex and x ∈ ℝn is fixed, then the function f(x+h)=f(x)+f(x,h)+o(h). f(x + h) = f(x) + f'(x,h) + o(||h||). (0.1)exists and is finite sublinear. Furthermore, f′ approximates f around x in the sense that d \mapsto f'(x, d): = \mathop {\lim }\limits_{t \downarrow 0} \frac{{f(x + td) - f(x)}} {t}
Article
We present a numerical algorithm for nonnegative matrix factorization (NMF) problems under noisy separability. An NMF problem under separability can be stated as one of finding all vertices of the convex hull of data points. The research interest of this paper is to find the vectors as close to the vertices as possible in a situation in which noise is added to the data points. Our algorithm is designed to capture the shape of the convex hull of data points by using its enclosing ellipsoid. We show that the algorithm has correctness and robustness properties from theoretical and practical perspectives; correctness here means that if the data points do not contain any noise, the algorithm can find the vertices of their convex hull; robustness means that if the data points contain noise, the algorithm can find the near-vertices. Finally, we apply the algorithm to document clustering, and report the experimental results.