Content uploaded by Nicolas Gillis
Author content
All content in this area was uploaded by Nicolas Gillis on Feb 06, 2019
Content may be subject to copyright.
MINIMUM-VOLUME RANK-DEFICIENT NONNEGATIVE MATRIX FACTORIZATIONS
Valentin Leplat, Andersen M.S. Ang, Nicolas Gillis
University of Mons, Rue de Houdain 9, 7000 Mons, Belgium
ABSTRACT
In recent years, nonnegative matrix factorization (NMF) with
volume regularization has been shown to be a powerful iden-
tifiable model; for example for hyperspectral unmixing, docu-
ment classification, community detection and hidden Markov
models. In this paper, we show that minimum-volume NMF
(min-vol NMF) can also be used when the basis matrix is rank
deficient, which is a reasonable scenario for some real-world
NMF problems (e.g., for unmixing multispectral images). We
propose an alternating fast projected gradient method for min-
vol NMF and illustrate its use on rank-deficient NMF prob-
lems; namely a synthetic data set and a multispectral image.
Index Terms—nonnegative matrix factorization, mini-
mum volume, identifiability, rank deficiency
1. INTRODUCTION
Given a nonnegative matrix X∈Rm×n
+and a factorization
rank r, nonnegative matrix factorization (NMF) requires to
find two nonnegative matrices W∈Rm×r
+and H∈Rr×n
+
such that X≈W H . For simplicity, we will use the Frobe-
nius norm, which is arguably the most widely used, to assess
the error of an NMF solution and consider the following opti-
mization problem
min
W∈Rm×r,H∈Rr×n||X−W H ||2
Fs.t. W≥0and H≥0.
NMF is in most cases ill-posed because the optimal solution
is not unique. In order to make the solution of the above prob-
lem unique (up to permutation and scaling of the columns of
Wand rows of H) hence making the problem well-posed and
the parameters (W, H)of the problem identifiable, a key idea
is to look for a solution Wwith minimum volume; see [1] and
the references therein. A possible formulation for minimum-
volume NMF (min-vol NMF) is as follows
min
W≥0,H(:,j )∈∆r∀j||X−W H ||2
F+λvol(W),(1)
where ∆r={x∈Rr
+|Pixi≤1},λis a penalty parame-
ter, and vol(W)is a function that measures the volume of the
columns of W. Note that Hneeds to be normalized otherwise
Authors acknowledge the support by the European Research Council
(ERC starting grant no679515) and by the Fonds de la Recherche Sci-
entifique - FNRS and the Fonds Wetenschappelijk Onderzoek - Vlanderen
(FWO) under EOS Project no O005318F-RG47.
Wwould go to zero since W H = (cW )(H/c)for any c > 0.
In this paper, we will use vol(W) = logdet(WTW+δI),
where Iis the identity matrix of appropriate dimensions. The
reason for using such a measure is that pdet(WTW)/r!is
the volume of the convex hull of the columns of Wand the
origin. Under some appropriate conditions on X=W H ,
this model will provably recover the true underlying (W, H )
that generated X. These recovery conditions require that the
columns of Xare sufficiently well spread in the convex hull
generated by the columns of W[2, 3, 4]; this is the so-called
sufficiently scattered condition. In particular, data points need
to be located on the facets of this convex hull hence Hneeds
to be sufficiently sparse. A few remarks are in order:
•The ideas behind min-vol NMF have been introduced in
the hyperspectral image community and date back from the
paper [5]; see also the discussions in [6, 1].
•As far as we know, these theoretical results only apply in
noiseless conditions hence robustness to noise of model (1)
still needs to be rigorously analyzed (this is a very promising
but difficult direction of further research).
•The sufficiently scattered condition is a generalization of
the separability condition which requires W=X(:,K)for
some index set Kof size r. Separability makes the NMF
problem easily solvable, and efficient and robust algorithms
exist; see, e.g., [7, 6, 8] and the references therein. Note that
although min-vol NMF guarantees identifiability, the corre-
sponding optimization problem (1) is still hard to solve in
general; as the original NMF problem [9].
Another key assumption that is used in min-vol NMF is
that the basis matrix Wis full rank, that is, rank(W) = r;
otherwise det(WTW)=0. However, there are situations
when the matrix Wis not full rank: this happens in particular
when rank(X)6= rank+(X)where rank+(X)is the non-
negative rank of Xwhich is the smallest rsuch that Xhas
an exact NMF decomposition (that is, X=W H). Here is a
simple example:
X=
1100
0011
0110
1001
(2)
for which rank(X)=3<rank+(X)=4. The columns of
the matrix Xare the vertices of a square in a 2-dimensional
subspace; see Fig. 2 for an illustration. A practical situation
where this could happen is in multispectral imaging. Let us
construct the matrix Xsuch that each column X(:, j)≥0
is the spectral signature of a pixel. Then, under the linear
mixing model, each column of Xis the nonnegative linear
combination of the spectral signatures of the constitutive ma-
terials present in the image, referred to as endmembers: we
have X(:, j) = Pr
k=1 W(:, k)H(k, j ), where W(:, k)is the
spectral signature of the kth endmember, and H(k, j)is the
abundance of the kth endmember in the jth pixel; see [6]
for more details. For multispectral images, the number of
materials within the scene being imaged can be larger than
the number of spectral bands meaning that r > m hence
rank(W)≤m<r.
In this paper, we focus on the min-vol NMF formulation
in the rank-deficient scenario, that is, when rank(W)< r.
The main contribution of this paper is three-fold: (i) We ex-
plain why min-vol NMF (1) can be used meaningfully when
the basis matrix Wis not full rank. This is, as far as we know,
the first time this observation is made in the literature. (ii) We
propose an algorithm based on alternating projected fast gra-
dient method to tackle this problem. (iii) We illustrate our
results on a synthetic data set and a multispectral image.
2. MIN-VOL NMF IN THE RANK-DEFICIENT CASE
Let us discuss the min-vol NMF model we consider in this
paper, namely,
min
W≥0,H(:,j )∈∆r∀j||X−W H ||2
F+λlogdet(WTW+δI),(3)
which has three key ingredients: the choice of the volume
regularizer, that is, logdet(WTW+δI), the parameters δand
λ. They are discussed in the next three paragraphs.
Choice of the volume regularizer Most functions used
to minimize the volume of the columns of Ware based
on the Gram matrix WTW; in particular, det(WTW)and
logdet(WTW+δI)for some δ > 0are the most widely
used measures; see, e.g., [10, 11]. Note that det(WTW) =
Πr
i=1σ2
i(W), hence the log term allows to weight down
large singular values and has been observed to work bet-
ter in practice; see, e.g., [12]. When Wis rank deficient
(that is, rank(W)< r), some singular values of Ware
equal to zero hence det(WTW)=0. Therefore, the func-
tion det(WTW)cannot distinguish between different rank-
deficient solutions1. However, we have logdet(WTW+δI)
=Pr
i=1 log(σ2
i(W) + δ). Hence if Whas one (or more)
singular value equal to zero, this measure still makes sense:
among two rank-deficient solutions belonging to the same
low-dimensional subspace, minimizing logdet(WTW+δI)
will favor a solution whose convex hull has a smaller volume
within that subspace since decreasing the non-zero singular
values of (WTW+δI)will decrease logdet(WTW+δI).
In mathematical terms, let W∈Rm×rbelong to a k-
dimensional subspace with k < r so that W=U S where
1Of course, one could also use the measure det(WTW+δI )mean-
ingfully in the rank-deficient case. However, it would be numerically more
challenging since for each singular value of Wequal to zero, the objective is
multiplied by δwhich should be chosen relatively small.
U∈Rm×kis an orthogonal basis of that subspace and S∈
Rk×rare the coordinates of the columns of Win that sub-
space. Then, logdet(WTW+δI) = Pk
i=1 log(σ2
i(S) + δ) +
(r−k) log(δ). The min-vol criterion logdet(WTW+δI)
with δ > 0is therefore meaningful even when Wdoes not
have rank r.
Choice of δThe function logdet(WTW+δI)which is equal
to Pr
i=1 log(σ2
i(W) + δ)is a non-convex surrogate for the
`0norm of the vector of singular values of W(up to con-
stants factors), that is, of rank(W)[13, 14]. It is sharper than
the `1norm of the vector of singular values (that is, the nu-
clear norm) for δsufficiently small; see Fig. 1. Therefore, if
one wants to promote rank-deficient solutions, δshould not
be chosen too large, say δ≤0.1. Moreover, δshould not
Fig. 1. Function log(x2+δ)−log(δ)
log(1+δ)−log(δ)for different values of δ,`1
norm (=|x|) and `0norm (= 0 for x= 0,= 1 otherwise).
be chosen too small otherwise W W T+δI might be badly
conditioned which makes the optimization problem harder to
solve (see Section 3) –also, this could give too much impor-
tance to zero singular values which might not be desirable.
Therefore, in practice, we recommend to use a value of δbe-
tween 0.1 and 10−3. We will use δ= 0.1in this paper. Note
that in previous works, δwas chosen very small (e.g., 10−8
in [11]) which, as explained above, is not a desirable choice,
at least in the rank-deficient case. Even in the full-rank case,
we argue that choosing δtoo small is also not desirable since
it promotes rank-deficient solutions.
Choice of λThe choice of δwill influence the choice of λ.
In fact, the smaller δ, the larger |logdet(δ)|, hence to balance
the two terms in the objective (3), λshould be smaller. For the
practical implementation, we will initialize W(0) =X(:,K)
where Kis computed with the successive nonnegative pro-
jection algorithm (SNPA) that can handle the rank-deficient
separable NMF problem [15]. Note that SNPA also provides
the matrix H(0) so as to minimize ||X−W(0)H(0) ||2
Fwhile
H(0)(:, j )∈∆rfor all j. Finally, we will choose
λ=˜
λ||X−W(0)H(0) ||2
F
|logdet(W(0)TW(0) +δI)|,
where we recommend to choose ˜
λbetween 1 and 10−3de-
pending on the noise level (the noisier the input matrix, the
larger λshould be).
3. ALGORITHM FOR MIN-VOL NMF
Most algorithms for NMF optimize alternatively over Wand
H, and we adopt this strategy in this paper. For the up-
date of H, we will use the projected fast gradient method
(PFGM) from [15]. Note that, as opposed to previously pro-
posed methods for min-vol NMF, we assume that the sum of
the entries of each column of His smaller or equal to one,
not equal to one, which is more general. For the update of W,
we use a PFGM applied on an strongly convex upper approx-
imation of the objective function; similarly as done in [11]–
although in that paper, authors did not consider explicitly the
case W≥0(Wis unconstrained in their model) and did
not write down explicitly a PFGM taking advantage of strong
convexity. For the sake of completeness, we briefly recall this
approach. The following upper bound for the logdet term
holds: for any Q0and S0, we have
logdet(Q)≤g(Q, S) = logdet(S) + trace S−1(Q−S)
= trace S−1Q+ logdet(S)−r.
This follows from the concavity of logdet(.)as g(Q, S)is
the first-order Taylor approximation of logdet(Q)around
S–it has also been used for example in [16]. This gives
logdet(WTW+δI)≤trace(Y W TW) + logdet(Y−1)−r
for any Wand any Y= (ZTZ+δI)−1with δ > 0. Plugging
this in the original objective function, and denoting wT
ithe
ith row of matrix Wand h., .iis the Frobenius inner product
of two matrices, we obtain
`(W) = ||X−W H ||2
F+λlogdet(WTW+δI)
=||X||2
F−2hXHT, Wi+hWTW, H HTi
+λlogdet(WTW+δI)
≤ hWTW, HH T+λY i − 2hC, W i+b
= 2
n
X
i=1 1
2wT
iAwi−cT
iwi+b=¯
`(W),
where Y= (ZTZ+δI)−1and A=HHT+λY are pos-
itive definite for δ, λ > 0,C=XH T, and bis a constant
independent of W. Note that ¯
`(W) = `(W)for Z=W.
Minimizing the upper bound ¯
`(W)of `(W)requires to solve
mindependent strongly convex optimization problems with
Hessian matrix A. Using PFGM on this problem, we obtain
a linear convergence method with rate 1−√κ−1
1+√κ−1where κis
the condition number of A[17]. Note that the subproblem in
variable His not strongly convex when Wis rank deficient in
which case PFGM converges sublinearly, in O(1/k2)where
kis the iteration number. In any case, PFGM is an optimal
first-order method in both cases [17], that is, no first-order
method can have a faster convergence rate. When Wis rank
deficient, we have λ
δ≤L=λmax(A)≤ ||H||2
2+λ
δ, where
Lis the largest eigenvalue of A. This shows the importance
of not choosing δtoo small, since the smaller δ, the larger the
conditioning of Ahence the slower will be the PFGM. Note
that Lis the Lipschitz constant of the gradient of the objective
function and controls the stepsize which is equal to 1/L. Our
proposed algorithm is summarized in Alg. 1. We will use 10
inner iterations for the PFGM on Wand H.
Algorithm 1 Min-vol NMF using alternating PFGM
Require: Input matrix X∈Rm×n
+, the factorization rank r,
δ > 0,˜
λ > 0, number of iterations maxiter.
Ensure: (W, H)is an approximate solution of (3).
1: Initialize (W, H)using SNPA [15].
2: Let λ=˜
λ||X−W H||2
F
logdet(WTW+δI).
3: for k= 1,2,...,maxiter do
4: % Update W
5: Let A=HHT+λ(WTW+δI)−1and C=XH T.
6: Perform a few steps of PFGM on the prob-
lem minU≥01
2hUTU, Ai−hU, Ci, with initializa-
tion U=W. Set Was last iterate.
7: % Update H
8: Perform a few steps of PFGM on the problem
minH(:,j)∈∆r∀j||X−W H ||2
Fas in [15].
9: end for
4. NUMERICAL EXPERIMENTS
We now apply our method on a synthetic and a real-world data
set. All tests are preformed using Matlab R2015a on a laptop
Intel CORE i7-7500U CPU @2.9GHz 24GB RAM. The code
is available from http://bit.ly/minvolNMF.
Synthetic data set. Let us construct the matrix X∈R4×500
as follows: Wis taken as the matrix from (2) so that
rank(W) = 3 < r = 4, and each column of His distributed
using the Dirichlet distribution of parameter (0.1,...,0.1).
Each column of Hwith an entry larger 0.8 is resampled as
long as this condition does not hold. This guarantees that no
data point is close to a column of W(this is sometimes re-
ferred to as the purity index). Fig. 2 illustrates this geometric
problem. As observed on Fig. 2, Alg. 1 is able to perfectly
Fig. 2. Synthetic data set and recovery. (Only the first three
entries of each four-dimensional vector are displayed.)
recover the true columns of W. For this experiment, we
use ˜
λ= 0.01. Fig. 3 illustrates the same experiment where
noise is added to X= max(0, W H +N)where N=
randn(m,n) in Matlab notation (i.i.d. Gaussian distribution of
mean zero and standard deviation ). Note that the average of
the entries of Xis 0.5 (each column is a linear combination
of the columns of W, with weights summing to one). Fig. 3
displays the average over 20 randomly generated matrices X
of the relative error d(W, ˜
W) = ||W−˜
W||F
||W||Fwhere ˜
Wis the
solution computed by Alg. 1 depending on the noise level
. This illustrates that min-vol NMF is robust against noise
since the d(W, ˜
W)is smaller than 1% for ≤1%.
Fig. 3. Evolution of the recovery of the true Wdepending on
the noise N=rand(m,n) using Alg. 1 (˜
λ= 0.01,δ= 0.1,
maxiter = 100).
Multispectral image. The San Diego airport is a HYDICE
hyperspectral image (HSI) containing 158 clean bands, and
400 ×400 pixels for each spectral image; see, e.g., [18].
There are mainly three types of materials: road surfaces,
roofs and vegetation (trees and grass). The image can be
well approximated using r=8. Since we are interested in
the case rank(W)<r, we select m=5 spectral band using
the successive projection algorithm [19] (this is essentially
Gram-Schmidt with column pivoting) applied on XT. This
provides bands that are representative: the selected bands are
4, 32, 116, 128, 150. Hence, we are factoring a 5-by-160000
matrix using a r=8. Note that we have removed outlying
pixels (some spectra contain large negative entries while oth-
ers have a norm order of magnitude larger than most pixels).
Fig. 4 displays the abundance maps extracted (that is, the
rows of matrix H): they correspond to meaningful locations
of materials. Here we have used ˜
λ=0.1 and 1000 iterations.
From the initial solution provided by SNPA, min-vol NMF
is able to reduce the error ||X−W H ||Fby a factor of 11.7
while the term logdet(WTW+δI)only increases by a factor
of 1.06. The final relative error is ||X−WH ||F
||X||F= 0.2%.
5. CONCLUSION
In this paper, we have shown that min-vol NMF can be used
meaningfully for rank-deficient NMF’s. We have provided a
simple algorithm to tackle this problem and have illustrated
the behaviour of the method on synthetic and real-world data
Fig. 4. Abundance maps extract by min-vol NMF using only
five bands of the San Diego airport HSI. From left to right, top
to bottom: vegetation (grass and trees), three different types
of roof tops, four different types of road surfaces.
sets. This work is only preliminary and many important ques-
tions remain open; in particular
•Under which conditions can we prove the identifiability of
min-vol NMF in the rank-deficient case (as done in [2, 3] for
the full-rank case)? Intuitively, it seems that a condition sim-
ilar to the sufficiently-scattered condition would be sufficient
but this has to be analysed thoroughly.
•Can we prove robustness to noise of such techniques? (The
question is also open for the full-rank case.)
•Can we design faster and more robust algorithms? And
algorithms taking advantage of the fact that the solution is
rank-deficient?
6. REFERENCES
[1] Xiao Fu, Kejun Huang, Nicholas D Sidiropoulos, and
Wing-Kin Ma, “Nonnegative matrix factorization for
signal and data analytics: Identifiability, algorithms, and
applications,” IEEE Signal Processing Magazine, 2018,
to appear.
[2] Chia-Hsiang Lin, Wing-Kin Ma, Wei-Chiang Li,
Chong-Yung Chi, and ArulMurugan Ambikapathi,
“Identifiability of the simplex volume minimization cri-
terion for blind hyperspectral unmixing: The no-pure-
pixel case,” IEEE Transactions on Geoscience and Re-
mote Sensing, vol. 53, no. 10, pp. 5530–5546, 2015.
[3] Xiao Fu, Wing-Kin Ma, Kejun Huang, and Nicholas D
Sidiropoulos, “Blind separation of quasi-stationary
sources: Exploiting convex geometry in covariance do-
main.,” IEEE Transactions Signal Processing, vol. 63,
no. 9, pp. 2306–2320, 2015.
[4] Xiao Fu, Kejun Huang, and Nicholas D Sidiropoulos,
“On identifiability of nonnegative matrix factorization,”
IEEE Signal Processing Letters, vol. 25, no. 3, pp. 328–
332, 2018.
[5] Maurice D Craig, “Minimum-volume transforms for re-
motely sensed data,” IEEE Transactions on Geoscience
and Remote Sensing, vol. 32, no. 3, pp. 542–552, 1994.
[6] Wing-Kin Ma, Jos´
e M Bioucas-Dias, Tsung-Han Chan,
Nicolas Gillis, Paul Gader, Antonio J Plaza, ArulMu-
rugan Ambikapathi, and Chong-Yung Chi, “A signal
processing perspective on hyperspectral unmixing: In-
sights from remote sensing,” IEEE Signal Processing
Magazine, vol. 31, no. 1, pp. 67–81, 2014.
[7] Sanjeev Arora, Rong Ge, Ravindran Kannan, and
Ankur Moitra, “Computing a nonnegative matrix
factorization–provably,” in Proceedings of the forty-
fourth annual ACM symposium on Theory of computing.
ACM, 2012, pp. 145–162.
[8] Nicolas Gillis, “Introduction to nonnegative matrix fac-
torization,” SIAG/OPT Views and News, vol. 25, no. 1,
pp. 7–16, 2017.
[9] Stephen A Vavasis, “On the complexity of nonnegative
matrix factorization,” SIAM Journal on Optimization,
vol. 20, no. 3, pp. 1364–1377, 2010.
[10] Lidan Miao and Hairong Qi, “Endmember extraction
from highly mixed data using minimum volume con-
strained nonnegative matrix factorization,” IEEE Trans-
actions on Geoscience and Remote Sensing, vol. 45, no.
3, pp. 765–777, 2007.
[11] Xiao Fu, Kejun Huang, Bo Yang, Wing-Kin Ma,
and Nicholas D. Sidiropoulos, “Robust volume
minimization-based matrix factorization for remote
sensing and document clustering,” IEEE Transactions
on Signal Processing, vol. 64, no. 23, pp. 6254–6268,
2016.
[12] Andersen M.S. Ang and Nicolas Gillis, “Volume reg-
ularized non-negative matrix factorizations,” in 2018
Workshop on Hyperspectral Image and Signal Process-
ing: Evolution in Remote Sensing (WHISPERS), 2018.
[13] Maryam Fazel, Matrix rank minimization with applica-
tions, Ph.D. thesis, Stanford University, 2002.
[14] Maryam Fazel, Haitham Hindi, and Stephen P Boyd,
“Log-det heuristic for matrix rank minimization with
applications to Hankel and Euclidean distance matri-
ces,” in Proceedings of the 2003 American Control Con-
ference. IEEE, 2003, vol. 3, pp. 2156–2162.
[15] Nicolas Gillis, “Successive nonnegative projection algo-
rithm for robust nonnegative blind source separation,”
SIAM Journal on Imaging Sciences, vol. 7, no. 2, pp.
1420–1450, 2014.
[16] Kazuyoshi Yoshii, Ryota Tomioka, Daichi Mochihashi,
and Masataka Goto, “Beyond NMF: Time-domain au-
dio source separation without phase reconstruction,” in
ISMIR, 2013, pp. 369–374.
[17] Yurii Nesterov, Introductory lectures on convex opti-
mization: A basic course, vol. 87, Springer Science &
Business Media, 2013.
[18] Nicolas Gillis, Da Kuang, and Haesun Park, “Hierarchi-
cal clustering of hyperspectral images using rank-two
nonnegative matrix factorization,” IEEE Transactions
on Geoscience and Remote Sensing, vol. 53, no. 4, pp.
2066–2078, 2015.
[19] Nicolas Gillis and Stephen A Vavasis, “Fast and robust
recursive algorithms for separable nonnegative matrix
factorization,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 36, no. 4, pp. 698–714,
2014.