Content uploaded by Nicolas Gillis

Author content

All content in this area was uploaded by Nicolas Gillis on Feb 06, 2019

Content may be subject to copyright.

MINIMUM-VOLUME RANK-DEFICIENT NONNEGATIVE MATRIX FACTORIZATIONS

Valentin Leplat, Andersen M.S. Ang, Nicolas Gillis

University of Mons, Rue de Houdain 9, 7000 Mons, Belgium

ABSTRACT

In recent years, nonnegative matrix factorization (NMF) with

volume regularization has been shown to be a powerful iden-

tiﬁable model; for example for hyperspectral unmixing, docu-

ment classiﬁcation, community detection and hidden Markov

models. In this paper, we show that minimum-volume NMF

(min-vol NMF) can also be used when the basis matrix is rank

deﬁcient, which is a reasonable scenario for some real-world

NMF problems (e.g., for unmixing multispectral images). We

propose an alternating fast projected gradient method for min-

vol NMF and illustrate its use on rank-deﬁcient NMF prob-

lems; namely a synthetic data set and a multispectral image.

Index Terms—nonnegative matrix factorization, mini-

mum volume, identiﬁability, rank deﬁciency

1. INTRODUCTION

Given a nonnegative matrix X∈Rm×n

+and a factorization

rank r, nonnegative matrix factorization (NMF) requires to

ﬁnd two nonnegative matrices W∈Rm×r

+and H∈Rr×n

+

such that X≈W H . For simplicity, we will use the Frobe-

nius norm, which is arguably the most widely used, to assess

the error of an NMF solution and consider the following opti-

mization problem

min

W∈Rm×r,H∈Rr×n||X−W H ||2

Fs.t. W≥0and H≥0.

NMF is in most cases ill-posed because the optimal solution

is not unique. In order to make the solution of the above prob-

lem unique (up to permutation and scaling of the columns of

Wand rows of H) hence making the problem well-posed and

the parameters (W, H)of the problem identiﬁable, a key idea

is to look for a solution Wwith minimum volume; see [1] and

the references therein. A possible formulation for minimum-

volume NMF (min-vol NMF) is as follows

min

W≥0,H(:,j )∈∆r∀j||X−W H ||2

F+λvol(W),(1)

where ∆r={x∈Rr

+|Pixi≤1},λis a penalty parame-

ter, and vol(W)is a function that measures the volume of the

columns of W. Note that Hneeds to be normalized otherwise

Authors acknowledge the support by the European Research Council

(ERC starting grant no679515) and by the Fonds de la Recherche Sci-

entiﬁque - FNRS and the Fonds Wetenschappelijk Onderzoek - Vlanderen

(FWO) under EOS Project no O005318F-RG47.

Wwould go to zero since W H = (cW )(H/c)for any c > 0.

In this paper, we will use vol(W) = logdet(WTW+δI),

where Iis the identity matrix of appropriate dimensions. The

reason for using such a measure is that pdet(WTW)/r!is

the volume of the convex hull of the columns of Wand the

origin. Under some appropriate conditions on X=W H ,

this model will provably recover the true underlying (W, H )

that generated X. These recovery conditions require that the

columns of Xare sufﬁciently well spread in the convex hull

generated by the columns of W[2, 3, 4]; this is the so-called

sufﬁciently scattered condition. In particular, data points need

to be located on the facets of this convex hull hence Hneeds

to be sufﬁciently sparse. A few remarks are in order:

•The ideas behind min-vol NMF have been introduced in

the hyperspectral image community and date back from the

paper [5]; see also the discussions in [6, 1].

•As far as we know, these theoretical results only apply in

noiseless conditions hence robustness to noise of model (1)

still needs to be rigorously analyzed (this is a very promising

but difﬁcult direction of further research).

•The sufﬁciently scattered condition is a generalization of

the separability condition which requires W=X(:,K)for

some index set Kof size r. Separability makes the NMF

problem easily solvable, and efﬁcient and robust algorithms

exist; see, e.g., [7, 6, 8] and the references therein. Note that

although min-vol NMF guarantees identiﬁability, the corre-

sponding optimization problem (1) is still hard to solve in

general; as the original NMF problem [9].

Another key assumption that is used in min-vol NMF is

that the basis matrix Wis full rank, that is, rank(W) = r;

otherwise det(WTW)=0. However, there are situations

when the matrix Wis not full rank: this happens in particular

when rank(X)6= rank+(X)where rank+(X)is the non-

negative rank of Xwhich is the smallest rsuch that Xhas

an exact NMF decomposition (that is, X=W H). Here is a

simple example:

X=

1100

0011

0110

1001

(2)

for which rank(X)=3<rank+(X)=4. The columns of

the matrix Xare the vertices of a square in a 2-dimensional

subspace; see Fig. 2 for an illustration. A practical situation

where this could happen is in multispectral imaging. Let us

construct the matrix Xsuch that each column X(:, j)≥0

is the spectral signature of a pixel. Then, under the linear

mixing model, each column of Xis the nonnegative linear

combination of the spectral signatures of the constitutive ma-

terials present in the image, referred to as endmembers: we

have X(:, j) = Pr

k=1 W(:, k)H(k, j ), where W(:, k)is the

spectral signature of the kth endmember, and H(k, j)is the

abundance of the kth endmember in the jth pixel; see [6]

for more details. For multispectral images, the number of

materials within the scene being imaged can be larger than

the number of spectral bands meaning that r > m hence

rank(W)≤m<r.

In this paper, we focus on the min-vol NMF formulation

in the rank-deﬁcient scenario, that is, when rank(W)< r.

The main contribution of this paper is three-fold: (i) We ex-

plain why min-vol NMF (1) can be used meaningfully when

the basis matrix Wis not full rank. This is, as far as we know,

the ﬁrst time this observation is made in the literature. (ii) We

propose an algorithm based on alternating projected fast gra-

dient method to tackle this problem. (iii) We illustrate our

results on a synthetic data set and a multispectral image.

2. MIN-VOL NMF IN THE RANK-DEFICIENT CASE

Let us discuss the min-vol NMF model we consider in this

paper, namely,

min

W≥0,H(:,j )∈∆r∀j||X−W H ||2

F+λlogdet(WTW+δI),(3)

which has three key ingredients: the choice of the volume

regularizer, that is, logdet(WTW+δI), the parameters δand

λ. They are discussed in the next three paragraphs.

Choice of the volume regularizer Most functions used

to minimize the volume of the columns of Ware based

on the Gram matrix WTW; in particular, det(WTW)and

logdet(WTW+δI)for some δ > 0are the most widely

used measures; see, e.g., [10, 11]. Note that det(WTW) =

Πr

i=1σ2

i(W), hence the log term allows to weight down

large singular values and has been observed to work bet-

ter in practice; see, e.g., [12]. When Wis rank deﬁcient

(that is, rank(W)< r), some singular values of Ware

equal to zero hence det(WTW)=0. Therefore, the func-

tion det(WTW)cannot distinguish between different rank-

deﬁcient solutions1. However, we have logdet(WTW+δI)

=Pr

i=1 log(σ2

i(W) + δ). Hence if Whas one (or more)

singular value equal to zero, this measure still makes sense:

among two rank-deﬁcient solutions belonging to the same

low-dimensional subspace, minimizing logdet(WTW+δI)

will favor a solution whose convex hull has a smaller volume

within that subspace since decreasing the non-zero singular

values of (WTW+δI)will decrease logdet(WTW+δI).

In mathematical terms, let W∈Rm×rbelong to a k-

dimensional subspace with k < r so that W=U S where

1Of course, one could also use the measure det(WTW+δI )mean-

ingfully in the rank-deﬁcient case. However, it would be numerically more

challenging since for each singular value of Wequal to zero, the objective is

multiplied by δwhich should be chosen relatively small.

U∈Rm×kis an orthogonal basis of that subspace and S∈

Rk×rare the coordinates of the columns of Win that sub-

space. Then, logdet(WTW+δI) = Pk

i=1 log(σ2

i(S) + δ) +

(r−k) log(δ). The min-vol criterion logdet(WTW+δI)

with δ > 0is therefore meaningful even when Wdoes not

have rank r.

Choice of δThe function logdet(WTW+δI)which is equal

to Pr

i=1 log(σ2

i(W) + δ)is a non-convex surrogate for the

`0norm of the vector of singular values of W(up to con-

stants factors), that is, of rank(W)[13, 14]. It is sharper than

the `1norm of the vector of singular values (that is, the nu-

clear norm) for δsufﬁciently small; see Fig. 1. Therefore, if

one wants to promote rank-deﬁcient solutions, δshould not

be chosen too large, say δ≤0.1. Moreover, δshould not

Fig. 1. Function log(x2+δ)−log(δ)

log(1+δ)−log(δ)for different values of δ,`1

norm (=|x|) and `0norm (= 0 for x= 0,= 1 otherwise).

be chosen too small otherwise W W T+δI might be badly

conditioned which makes the optimization problem harder to

solve (see Section 3) –also, this could give too much impor-

tance to zero singular values which might not be desirable.

Therefore, in practice, we recommend to use a value of δbe-

tween 0.1 and 10−3. We will use δ= 0.1in this paper. Note

that in previous works, δwas chosen very small (e.g., 10−8

in [11]) which, as explained above, is not a desirable choice,

at least in the rank-deﬁcient case. Even in the full-rank case,

we argue that choosing δtoo small is also not desirable since

it promotes rank-deﬁcient solutions.

Choice of λThe choice of δwill inﬂuence the choice of λ.

In fact, the smaller δ, the larger |logdet(δ)|, hence to balance

the two terms in the objective (3), λshould be smaller. For the

practical implementation, we will initialize W(0) =X(:,K)

where Kis computed with the successive nonnegative pro-

jection algorithm (SNPA) that can handle the rank-deﬁcient

separable NMF problem [15]. Note that SNPA also provides

the matrix H(0) so as to minimize ||X−W(0)H(0) ||2

Fwhile

H(0)(:, j )∈∆rfor all j. Finally, we will choose

λ=˜

λ||X−W(0)H(0) ||2

F

|logdet(W(0)TW(0) +δI)|,

where we recommend to choose ˜

λbetween 1 and 10−3de-

pending on the noise level (the noisier the input matrix, the

larger λshould be).

3. ALGORITHM FOR MIN-VOL NMF

Most algorithms for NMF optimize alternatively over Wand

H, and we adopt this strategy in this paper. For the up-

date of H, we will use the projected fast gradient method

(PFGM) from [15]. Note that, as opposed to previously pro-

posed methods for min-vol NMF, we assume that the sum of

the entries of each column of His smaller or equal to one,

not equal to one, which is more general. For the update of W,

we use a PFGM applied on an strongly convex upper approx-

imation of the objective function; similarly as done in [11]–

although in that paper, authors did not consider explicitly the

case W≥0(Wis unconstrained in their model) and did

not write down explicitly a PFGM taking advantage of strong

convexity. For the sake of completeness, we brieﬂy recall this

approach. The following upper bound for the logdet term

holds: for any Q0and S0, we have

logdet(Q)≤g(Q, S) = logdet(S) + trace S−1(Q−S)

= trace S−1Q+ logdet(S)−r.

This follows from the concavity of logdet(.)as g(Q, S)is

the ﬁrst-order Taylor approximation of logdet(Q)around

S–it has also been used for example in [16]. This gives

logdet(WTW+δI)≤trace(Y W TW) + logdet(Y−1)−r

for any Wand any Y= (ZTZ+δI)−1with δ > 0. Plugging

this in the original objective function, and denoting wT

ithe

ith row of matrix Wand h., .iis the Frobenius inner product

of two matrices, we obtain

`(W) = ||X−W H ||2

F+λlogdet(WTW+δI)

=||X||2

F−2hXHT, Wi+hWTW, H HTi

+λlogdet(WTW+δI)

≤ hWTW, HH T+λY i − 2hC, W i+b

= 2

n

X

i=1 1

2wT

iAwi−cT

iwi+b=¯

`(W),

where Y= (ZTZ+δI)−1and A=HHT+λY are pos-

itive deﬁnite for δ, λ > 0,C=XH T, and bis a constant

independent of W. Note that ¯

`(W) = `(W)for Z=W.

Minimizing the upper bound ¯

`(W)of `(W)requires to solve

mindependent strongly convex optimization problems with

Hessian matrix A. Using PFGM on this problem, we obtain

a linear convergence method with rate 1−√κ−1

1+√κ−1where κis

the condition number of A[17]. Note that the subproblem in

variable His not strongly convex when Wis rank deﬁcient in

which case PFGM converges sublinearly, in O(1/k2)where

kis the iteration number. In any case, PFGM is an optimal

ﬁrst-order method in both cases [17], that is, no ﬁrst-order

method can have a faster convergence rate. When Wis rank

deﬁcient, we have λ

δ≤L=λmax(A)≤ ||H||2

2+λ

δ, where

Lis the largest eigenvalue of A. This shows the importance

of not choosing δtoo small, since the smaller δ, the larger the

conditioning of Ahence the slower will be the PFGM. Note

that Lis the Lipschitz constant of the gradient of the objective

function and controls the stepsize which is equal to 1/L. Our

proposed algorithm is summarized in Alg. 1. We will use 10

inner iterations for the PFGM on Wand H.

Algorithm 1 Min-vol NMF using alternating PFGM

Require: Input matrix X∈Rm×n

+, the factorization rank r,

δ > 0,˜

λ > 0, number of iterations maxiter.

Ensure: (W, H)is an approximate solution of (3).

1: Initialize (W, H)using SNPA [15].

2: Let λ=˜

λ||X−W H||2

F

logdet(WTW+δI).

3: for k= 1,2,...,maxiter do

4: % Update W

5: Let A=HHT+λ(WTW+δI)−1and C=XH T.

6: Perform a few steps of PFGM on the prob-

lem minU≥01

2hUTU, Ai−hU, Ci, with initializa-

tion U=W. Set Was last iterate.

7: % Update H

8: Perform a few steps of PFGM on the problem

minH(:,j)∈∆r∀j||X−W H ||2

Fas in [15].

9: end for

4. NUMERICAL EXPERIMENTS

We now apply our method on a synthetic and a real-world data

set. All tests are preformed using Matlab R2015a on a laptop

Intel CORE i7-7500U CPU @2.9GHz 24GB RAM. The code

is available from http://bit.ly/minvolNMF.

Synthetic data set. Let us construct the matrix X∈R4×500

as follows: Wis taken as the matrix from (2) so that

rank(W) = 3 < r = 4, and each column of His distributed

using the Dirichlet distribution of parameter (0.1,...,0.1).

Each column of Hwith an entry larger 0.8 is resampled as

long as this condition does not hold. This guarantees that no

data point is close to a column of W(this is sometimes re-

ferred to as the purity index). Fig. 2 illustrates this geometric

problem. As observed on Fig. 2, Alg. 1 is able to perfectly

Fig. 2. Synthetic data set and recovery. (Only the ﬁrst three

entries of each four-dimensional vector are displayed.)

recover the true columns of W. For this experiment, we

use ˜

λ= 0.01. Fig. 3 illustrates the same experiment where

noise is added to X= max(0, W H +N)where N=

randn(m,n) in Matlab notation (i.i.d. Gaussian distribution of

mean zero and standard deviation ). Note that the average of

the entries of Xis 0.5 (each column is a linear combination

of the columns of W, with weights summing to one). Fig. 3

displays the average over 20 randomly generated matrices X

of the relative error d(W, ˜

W) = ||W−˜

W||F

||W||Fwhere ˜

Wis the

solution computed by Alg. 1 depending on the noise level

. This illustrates that min-vol NMF is robust against noise

since the d(W, ˜

W)is smaller than 1% for ≤1%.

Fig. 3. Evolution of the recovery of the true Wdepending on

the noise N=rand(m,n) using Alg. 1 (˜

λ= 0.01,δ= 0.1,

maxiter = 100).

Multispectral image. The San Diego airport is a HYDICE

hyperspectral image (HSI) containing 158 clean bands, and

400 ×400 pixels for each spectral image; see, e.g., [18].

There are mainly three types of materials: road surfaces,

roofs and vegetation (trees and grass). The image can be

well approximated using r=8. Since we are interested in

the case rank(W)<r, we select m=5 spectral band using

the successive projection algorithm [19] (this is essentially

Gram-Schmidt with column pivoting) applied on XT. This

provides bands that are representative: the selected bands are

4, 32, 116, 128, 150. Hence, we are factoring a 5-by-160000

matrix using a r=8. Note that we have removed outlying

pixels (some spectra contain large negative entries while oth-

ers have a norm order of magnitude larger than most pixels).

Fig. 4 displays the abundance maps extracted (that is, the

rows of matrix H): they correspond to meaningful locations

of materials. Here we have used ˜

λ=0.1 and 1000 iterations.

From the initial solution provided by SNPA, min-vol NMF

is able to reduce the error ||X−W H ||Fby a factor of 11.7

while the term logdet(WTW+δI)only increases by a factor

of 1.06. The ﬁnal relative error is ||X−WH ||F

||X||F= 0.2%.

5. CONCLUSION

In this paper, we have shown that min-vol NMF can be used

meaningfully for rank-deﬁcient NMF’s. We have provided a

simple algorithm to tackle this problem and have illustrated

the behaviour of the method on synthetic and real-world data

Fig. 4. Abundance maps extract by min-vol NMF using only

ﬁve bands of the San Diego airport HSI. From left to right, top

to bottom: vegetation (grass and trees), three different types

of roof tops, four different types of road surfaces.

sets. This work is only preliminary and many important ques-

tions remain open; in particular

•Under which conditions can we prove the identiﬁability of

min-vol NMF in the rank-deﬁcient case (as done in [2, 3] for

the full-rank case)? Intuitively, it seems that a condition sim-

ilar to the sufﬁciently-scattered condition would be sufﬁcient

but this has to be analysed thoroughly.

•Can we prove robustness to noise of such techniques? (The

question is also open for the full-rank case.)

•Can we design faster and more robust algorithms? And

algorithms taking advantage of the fact that the solution is

rank-deﬁcient?

6. REFERENCES

[1] Xiao Fu, Kejun Huang, Nicholas D Sidiropoulos, and

Wing-Kin Ma, “Nonnegative matrix factorization for

signal and data analytics: Identiﬁability, algorithms, and

applications,” IEEE Signal Processing Magazine, 2018,

to appear.

[2] Chia-Hsiang Lin, Wing-Kin Ma, Wei-Chiang Li,

Chong-Yung Chi, and ArulMurugan Ambikapathi,

“Identiﬁability of the simplex volume minimization cri-

terion for blind hyperspectral unmixing: The no-pure-

pixel case,” IEEE Transactions on Geoscience and Re-

mote Sensing, vol. 53, no. 10, pp. 5530–5546, 2015.

[3] Xiao Fu, Wing-Kin Ma, Kejun Huang, and Nicholas D

Sidiropoulos, “Blind separation of quasi-stationary

sources: Exploiting convex geometry in covariance do-

main.,” IEEE Transactions Signal Processing, vol. 63,

no. 9, pp. 2306–2320, 2015.

[4] Xiao Fu, Kejun Huang, and Nicholas D Sidiropoulos,

“On identiﬁability of nonnegative matrix factorization,”

IEEE Signal Processing Letters, vol. 25, no. 3, pp. 328–

332, 2018.

[5] Maurice D Craig, “Minimum-volume transforms for re-

motely sensed data,” IEEE Transactions on Geoscience

and Remote Sensing, vol. 32, no. 3, pp. 542–552, 1994.

[6] Wing-Kin Ma, Jos´

e M Bioucas-Dias, Tsung-Han Chan,

Nicolas Gillis, Paul Gader, Antonio J Plaza, ArulMu-

rugan Ambikapathi, and Chong-Yung Chi, “A signal

processing perspective on hyperspectral unmixing: In-

sights from remote sensing,” IEEE Signal Processing

Magazine, vol. 31, no. 1, pp. 67–81, 2014.

[7] Sanjeev Arora, Rong Ge, Ravindran Kannan, and

Ankur Moitra, “Computing a nonnegative matrix

factorization–provably,” in Proceedings of the forty-

fourth annual ACM symposium on Theory of computing.

ACM, 2012, pp. 145–162.

[8] Nicolas Gillis, “Introduction to nonnegative matrix fac-

torization,” SIAG/OPT Views and News, vol. 25, no. 1,

pp. 7–16, 2017.

[9] Stephen A Vavasis, “On the complexity of nonnegative

matrix factorization,” SIAM Journal on Optimization,

vol. 20, no. 3, pp. 1364–1377, 2010.

[10] Lidan Miao and Hairong Qi, “Endmember extraction

from highly mixed data using minimum volume con-

strained nonnegative matrix factorization,” IEEE Trans-

actions on Geoscience and Remote Sensing, vol. 45, no.

3, pp. 765–777, 2007.

[11] Xiao Fu, Kejun Huang, Bo Yang, Wing-Kin Ma,

and Nicholas D. Sidiropoulos, “Robust volume

minimization-based matrix factorization for remote

sensing and document clustering,” IEEE Transactions

on Signal Processing, vol. 64, no. 23, pp. 6254–6268,

2016.

[12] Andersen M.S. Ang and Nicolas Gillis, “Volume reg-

ularized non-negative matrix factorizations,” in 2018

Workshop on Hyperspectral Image and Signal Process-

ing: Evolution in Remote Sensing (WHISPERS), 2018.

[13] Maryam Fazel, Matrix rank minimization with applica-

tions, Ph.D. thesis, Stanford University, 2002.

[14] Maryam Fazel, Haitham Hindi, and Stephen P Boyd,

“Log-det heuristic for matrix rank minimization with

applications to Hankel and Euclidean distance matri-

ces,” in Proceedings of the 2003 American Control Con-

ference. IEEE, 2003, vol. 3, pp. 2156–2162.

[15] Nicolas Gillis, “Successive nonnegative projection algo-

rithm for robust nonnegative blind source separation,”

SIAM Journal on Imaging Sciences, vol. 7, no. 2, pp.

1420–1450, 2014.

[16] Kazuyoshi Yoshii, Ryota Tomioka, Daichi Mochihashi,

and Masataka Goto, “Beyond NMF: Time-domain au-

dio source separation without phase reconstruction,” in

ISMIR, 2013, pp. 369–374.

[17] Yurii Nesterov, Introductory lectures on convex opti-

mization: A basic course, vol. 87, Springer Science &

Business Media, 2013.

[18] Nicolas Gillis, Da Kuang, and Haesun Park, “Hierarchi-

cal clustering of hyperspectral images using rank-two

nonnegative matrix factorization,” IEEE Transactions

on Geoscience and Remote Sensing, vol. 53, no. 4, pp.

2066–2078, 2015.

[19] Nicolas Gillis and Stephen A Vavasis, “Fast and robust

recursive algorithms for separable nonnegative matrix

factorization,” IEEE Transactions on Pattern Analysis

and Machine Intelligence, vol. 36, no. 4, pp. 698–714,

2014.