PreprintPDF Available

Inertial Majorization-Minimization Algorithm for Minimum-Volume NMF

Preprints and early-stage research may not have been peer reviewed yet.


Nonnegative matrix factorization with the minimum-volume criterion (min-vol NMF) guarantees that, under some mild and realistic conditions, the factorization has an essentially unique solution. This result has been successfully leveraged in many applications, including topic modeling, hyperspectral image unmixing, and audio source separation. In this paper, we propose a fast algorithm to solve min-vol NMF which is based on a recently introduced block majorization-minimization framework with extrapolation steps. We illustrate the effectiveness of our new algorithm compared to the state of the art on several real hyperspectral images and document data sets.
Inertial Majorization-Minimization Algorithm for
Minimum-Volume NMF
Olivier Vu Thanh1, Andersen Ang2,1, Nicolas Gillis1, Le Thi Khanh Hien1
1Department of Mathematics and Operational Research, Facult´
e Polytechnique, Universit´
e de Mons
Rue de Houdain 9, 7000 Mons, Belgium
{olivier.vuthanh, manshun.ang, nicolas.gillis, thikhanhhien.le}
2Department of Combinatorics and Optimization, Faculty of Mathematics, University of Waterloo, Canada
Abstract—Nonnegative matrix factorization with the
minimum-volume criterion (min-vol NMF) guarantees that,
under some mild and realistic conditions, the factorization has
an essentially unique solution. This result has been successfully
leveraged in many applications, including topic modeling,
hyperspectral image unmixing, and audio source separation. In
this paper, we propose a fast algorithm to solve min-vol NMF
which is based on a recently introduced block majorization-
minimization framework with extrapolation steps. We illustrate
the effectiveness of our new algorithm compared to the state of
the art on several real hyperspectral images and document data
Index Terms—nonnegative matrix factorization, minimum vol-
ume, fast gradient method, majorization-minimization, hyper-
spectral imaging
Nonnegative Matrix Factorization (NMF) has been an active
field of research since the seminal paper by Lee and Seung [1].
The success of NMF comes from many specific applications
since many types of data are nonnegative; for example ampli-
tude spectrograms in audio source separation, images, evalua-
tions in recommendation systems, and documents represented
by vectors of word counts; see [2] and the references therein.
Compared to other unconstrained factorization models such
as PCA/SVD, NMF requires the factors to be nonnegative.
This constraint naturally leads to factors that are more easily
interpretable [1]. Nonetheless, there are two drawbacks with
NMF: computability and identifiability.
Computability. As opposed to PCA/SVD, solving NMF is
NP-hard in general [3]. Hence most NMF algorithms rely
on standard non-linear optimization schemes without global
optimality guarantee.
Identifiability. NMF solutions are typically not unique, that is,
they are not unique even after removing the trivial scaling and
permutation ambiguities of the rank-one factors; see [4] and
the references therein. For NMF to have a unique solution, also
known as identifiability, one needs to add additional structure
to the sought solution. One way to ensure identifiability is the
min-vol criterion, which minimizes the volume of one of the
NG and LTKH acknowledge the support by the European Research Council
(ERCstarting grant No 679515), the Fonds de la Recherche Scientifique -
FNRS and the Fonds Wetenschappelijk Onderzoek - Vlaanderen (FWO) under
EOS project O005318F-RG47.
The names of the last three authors are in alphabetical order.
factors. If the sufficiently scattered condition (SSC) is satisfied,
then identifiability holds for min-vol NMF [5]–[7].
Identifiability for min-vol NMF is a strong result that has
been used successfully in many applications such as topic
modeling and hyperspectral imaging [8], and audio source sep-
aration [7]. However, min-vol NMF is computationally hard to
solve. In this paper, after introducing the considered min-vol
NMF model in Section II, we propose a fast method to solve
min-vol NMF in Section III. Our method is an application of
a recent inertial block majorization-minimization framework
called TITAN [9]. Experimental results on real data sets show
that the proposed method performs better than the state of the
art; see Section IV.
In the noiseless case, the exact NMF model is M=WH
where MRm×n
+denotes the measured data, WRm×r
(resp. HRr×n
+) denotes the left factor (resp. the right
factor). The idea behind the min-vol criterion, a.k.a. Craig’s
belief [10], is that the convex hull spanned by the columns of
W, denoted conv(W), should embrace all the data points as
tightly as possible. In the absence of noise, min-vol NMF is
formulated as follows
where 1is a vector of appropriate size containing only ones.
The constraint (1c) ensures that every data point lies within
the convex hull spanned by the columns of W, that is,
M(:, j)conv(W)for all j. The volume of the convex hull
of Wand the origin in the subspace span by columns of W,
is proportional to det(W>W); see for example [5]. Under
the sufficiently scattered conditions (SSC), which requires the
columns of Mto be sufficiently spread within conv(W)or,
equivalently, that His sufficiently sparse, min-vol NMF has
an essentially unique solution [5], [6]. A drawback of (1c)
is that it requires the entries in each column of Hto sum to
one, which is not without loss of generality: it imposes that the
columns of Mbelong to the convex hull of the columns of W
as opposed to the conical hull when the equality constraints
of (1c) are absent; see for example [2, Chapter 4].
It was recently shown that the same model where the
constraint 1>H=1>is replaced with 1>W=1>retains
identifiability [7]. The sum-to-one constraint on the columns
of W, that is, 1>W=1>, can be assumed w.l.o.g. via
the scaling ambiguity of the rank-one factors W(:, k)H(k, :)
in any NMF decomposition. Moreover, the model with the
constraint on Wwas shown to be numerically much more
stable as it makes Wbetter conditioned which is important
because computing the derivative of det(W>W)requires
computing the inverse of W>W. We refer the interested
reader to [2, Chapter 4.3.3] for a discussion on these models.
In the presence of noise, min-vol NMF is typically formu-
lated via penalization. In this paper, we consider the following
min-vol NMF model
2log det(W>W+δIr)
where k.kFis the Frobenius norm, λ > 0is a parameter
balancing the two terms in the objective function, Iris the
r×ridentity matrix, and δ > 0is a small parameter that
prevents log det(W>W)from going to −∞ if Wis rank
deficient [11]. The use of the logarithm of the determinant is
less sensible to very disparate singular values of W, leading
to better practical performances [8], [12].
Applications. In hyperspectral unmixing (HU), each column
of Mcontains the spectral reflectance of a pixel, each row of
Mcorresponds to the reflectance of a spectral band among
all pixels, each column of Wis the spectral signature of an
endmember (a pure material in the image), and each column of
Hcontains the proportion of each identified pure material in
the corresponding pixel; see [13]. Geometrically, the min-vol
NMF in (2) applied to HU consists of finding endmembers
such that the convex hull spanned by them and the origin
embraces as tightly as possible every pixels in M. This is
the so-called Craig’s belief [10]. In document classification,
Mis a word-by-document matrix so that the columns of W
correspond to topics (that is, set of words found simultaneously
in several documents) while the columns of Hallow to assign
each documents to the topics it discusses [8].
As far as we know, all algorithms for min-vol NMF rely
on two-block coordinate descent methods that update each
block (Wor H) by using some outer optimization algorithm
to solve the subproblems formed by restricting the min-vol
NMF problem to each block. For example, the state-of-the-
art method from [11] uses Nesterov fast gradient method to
update each factor matrix, one at a time.
Our proposed algorithm for (2) will be based on the TITAN
framework from [9]. TITAN is an inertial block majorization
minimization framework for nonsmooth nonconvex optimiza-
tion. It updates one block at a time while fixing the values
of the other blocks, as previous min-vol NMF algorithms. In
order to update a block, TITAN chooses a block surrogate
function for the corresponding objective function (a.k.a. a
majorizer), embeds an inertial term to this surrogate function
and then minimizes the obtained inertial surrogate function.
When a Lipschitz gradient surrogate is used, TITAN reduces
to the Nesterov-type accelerated gradient descent step for each
block of variables [9, Section 4.2]. The difference of TITAN
compared to previous min-vol NMF algorithms is threefold:
1) The inertial force (also known as the extrapolation, or
momentum) is used between block updates. This is a
crucial aspect that will make our proposed algorithm
faster: when we start the update of a block of variables
(here, Wor H), we can use the inertial force (using the
previous iterate) although the other blocks have been
updated in the mean time.
2) TITAN allows to update the surrogate after each update
of Wand H, which was not possible with the algorithm
from [11] because it applied fast gradient from convex
optimization on a fixed surrogate.
3) It has subsequential convergence guarantee, that is, every
limit point of the generated sequence is a stationary point
of Problem (2). Note that the state-of-the-art algorithm
from [11] does not have convergence guarantees.
Remark. The block prox-linear (BPL) method from [14] can
be used to solve (2) since the block functions in W7→
Fand in H7→ 1
Fhave Lipschitz
continuous gradients. However, BPL applies extrapolation to
the Lipschitz gradient surrogate of these block functions and
requires to compute the proximal point of the regularizer
2log det(W>W+δIr), which does not have a closed form.
In contrast, TITAN applies extrapolation to the surrogate
function of W7→ f(W,H)with a surrogate function for
the regularizer λ
2log det(W>W+δIr)(see Section III-A1).
This allows TITAN to have closed-form solutions for the sub-
problems, an acceleration effect, and convergence guarantee.
A. Surrogate functions
An important step of TITAN is to define a surrogate function
for each block of variables. These surrogate functions are
upper approximation of the objective function at the current
iterate. Denote
f(W,H) = 1
2log det(W>W+δIr)
and suppose we are cyclically updating (W,H). Let us denote
uWk(W)the surrogate function of W7→ f(W,Hk)to
update Wk, that is,
f(W,Hk)uWk(W)for all W∈ XW,(3)
where uWk(Wk) = f(Wk,Hk)and XWis the feasible
domain of W. Similarly, let us denote uHk(H)the surrogate
function of H7→ f(Wk+1,H)to update Hk, that is
f(Wk+1,H)uHk(H)for all H∈ XH,(4)
where uHk(Hk) = f(Wk+1,Hk)and XHis the feasible
domain of H.
1) Surrogate function and update of W:Denote A=
kWk+δIrand Pk= (Bk)1. Since
log det is concave, its first-order Taylor expansion around Bk
leads to log det(A)log det(Bk) + h(Bk)1,ABki.
fWk(W) := 1
where C1is a constant independent of W. Note that the
gradient of W7→ e
fWk(W), being equal to
is Lk
W-Lipschitz continuous with Lk
Hence, from (5) and the descent lemma (see [15, Section 2.1]),
f(W,Hk)uWk(W) := h∇ e
where C2is a constant depending on Wk. We use the
surrogate uWk(W)defined in (6) to update Wk. As TITAN
recovers Nesterov-type acceleration for the update of each
block of variables [9, Section 4.2], we have the following
update for W:
Wk+1 = argmin
h∇ e
where Pperforms column wise projections onto the unit
simplex as in [16] in order to satisfy the constraint on W
in (2), and where Wkis an extrapolated point, that is, the
current point Wkplus some momentum,
where the extrapolation parameter βk
Wis chosen as follows
W= min
α0= 1,αk= (1+q1+4α2
k1)/2. This choice of parameter
satisfies the conditions to have a subsequential convergence of
TITAN, see Section III-C.
2) Surrogate function and update of H:Since
Hf(Wk+1,H) = W>
k+1(Wk+1 HM),
the gradient of faccording to His Lk
H-Lipschitz continuous
with Lk
k+1Wk+1 k2. Hence, we use the following
Lipschitz gradient surrogate to update Hk:
uHk(H) = h∇Hf(Wk+1,Hk),Hi+Lk
where C3is a constant depending on Hk. We derive our
update rule for Hby minimizing the surrogate function from
Equation (10) embedded with extrapolation,
Hk+1 = argmin
k+1(MWk+1 Hk)+
where [.]+denotes the projector setting all negative values to
zero, and Hkis the extrapolated Hk:
where, as for the update of W,
H= min
B. Algorithm
Note that the update of Win (7) and Hin (11) was
described when the cyclic update rule is applied. Since TITAN
also allows the essentially cyclic rule [9, Section 5], we can
update Wseveral times before switching updating H, and
vice versa. This leads to our proposed method TITANized
min-vol, see Algorithm 1 for the pseudo code. The stopping
criteria in lines 4 and 15 is the same as in [11]. The way λ
and δare computed is also identical to [11]. Let us mention
that technically the main difference with [11] resides in how
the extrapolation is embedded. In [11] the Nesterov sequence
is restarted and evolves in each inner loop to solve each
subproblem corresponding to each block. In our algorithm,
the extrapolation parameter βW(and βH) for updating each
block W(and H) is updated continuously without restarting.
It means we are accelerating the global convergence of the
sequences rather than trying to accelerate the convergence
for the subproblems. Moreover, TITAN allows to update the
surrogate function at each step, while the algorithm from [11]
can only update it before each subproblem is solved, as it
relies on Nesterov’s acceleration for convex optimization.
C. Convergence guarantee
In order to have a convergence guarantee, TITAN requires
the update of each block to satisfy the nearly sufficiently
decreasing property (NSDP), see [9, Section 2]. By [9, Section
4.2.1], the update for Hof TITANized min-vol satisfies the
NSDP condition since it uses a Lipschitz gradient surrogate
for H7→ f(W,H)combined with the Nesterov-type ex-
trapolation; and the bounds of the extrapolation parameters
in the update of Hare derived similarly as in [9, Section
6.1]. However, it is important noting that the update for Wof
TITANized min-vol does not directly use a Lipschitz gradient
surrogate for W7→ f(W,H). We thus need to verify NSDP
condition for the update of Wby another method that is
presented in the following.
The function uWk(W)is a Lipschitz gradient surrogate
of ˜
fWk(W)and we apply the Nesterov-type extrapolation to
Algorithm 1 TITANized min-vol
1: initialize W0and H0,
2: α1= 1,α2= 1,Wold =W0,Hold =H0,Lprev
3: repeat
4: while stopping criteria not satisfied do
5: α0=α1, α1= (1 + p1+4α2
6: P(W>W+δIr)1
7: LW← kHH>+λPk2
8: βW= min h(α01)1,0.9999pLprev
9: WW+βW(WWold)
10: Wold W
11: WPhW+(MH>W(HH>+λP))
12: Lprev
13: end while
14: LH← kW>Wk2
15: while stopping criteria not satisfied do
16: α0=α2, α2= (1 + p1+4α2
17: βH= min h(α01)2,0.9999pLprev
18: HH+βH(HHold)
19: Hold H
20: HhH+W>(MWH)
21: Lprev
22: end while
23: until some stopping criteria is satisfied
obtain the update in (7). Note that the feasible set of Wis
convex. Hence, it follows from [9, Remark 4.1] that
fWk(Wk) + Lk
fWk(Wk+1) + Lk
2kWk+1 Wkk2
Furthermore, we note that ˜
fWk(Wk) = f(Wk,Hk), and
fWk(Wk+1)f(Wk+1 ,Hk). Therefore, from (14) we have
f(Wk,Hk) + Lk
f(Wk+1,Hk) + Lk
2kWk+1 Wkk2
which is the required NSDP condition of TITAN. Conse-
quently, the choice of βk
Win (9) satisfy the required condition
to guarantee subsequential convergence [9, Proposition 3.1].
On the other hand, we note that the error function W7→
e1(W) := uWk(W)f(W,Hk)is continuously differen-
tiable and We1(Wk) = 0; similarly for the error function
H7→ e2(H) := uHk(H)f(Wk+1,H). Hence, it follows
from [9, Lemma 2.3] that the Assumption 2.2 in [9] is satisfied.
Applying [9, Theorem 3.2], we conclude that every limit point
of the generated sequence is a stationary point of Problem (2).
It is worth noting that as TITANized min-vol does not apply
restarting step, [9, Theorem 3.5] for a global convergence is
not applicable.
In this section we compare TITANized min-vol to [11], an
accelerated version of the method from [8] (for p= 2), on
two NMF applications: hyperspectral unmixing and document
clustering, which are dense and sparse data sets, respectively.
All tests are performed on MATLAB R2018a, on a PC with
an Intel® Core™ i7 6700HQ and 24GB RAM. The code is
available from
The data sets used are shown in Table I. For each data set,
each algorithm is launched with the same random initializa-
tions, for the same amount of CPU time. In order to derive
some statistics, for both hyperspectral unmixing and document
clustering, 20 random initializations are used (each entry of
Wand Hare drawn from the uniform distribution in [0,1]).
The CPU time used for each data set is adjusted manually, and
corresponds to the maximum displayed value on the respective
time axes in Fig. 1; see also Table II.
data set m n r
Urban 162 94249 6
Indian Pine 200 21025 16
Pavia Univ. 103 207400 9
San Diego 158 160000 7
Terrain 166 153500 5
20 News 61188 7505 20
Sports 14870 8580 7
Reviews 18483 4069 5
TABLE I: data sets used in our experiments and their respec-
tive dimensions
For display purposes, for each data set, we compare the
average of the scaled objective functions according to time,
that is, the average of (f(W,H)emin)/kMkFwhere emin
is the minimum obtained error among the 20 different runs
and among both methods. The results are presented in Fig. 1.
On both hyperspectral and document data sets, TITANized
min-vol converges on average faster than [11] except for the
San Diego data set (although TITANized min-vol converges
initially faster). For most tested data sets, min-vol [11] cannot
reach the same error as TITANized min-vol within the allo-
cated time. In particular, TITANized min-vol achieves a lower
error in 94 out of the 100 runs for the hyperspectral images
(5 images with 20 random initialization each), and 55 out of
60 for the document data sets (3 sets of documents with 20
random initialization each).
We also reported in Table II TITANized min-vol’s lead time
over [11] when the latter reaches its minimum error after the
maximum allotted CPU time. The lead time is the time saved
by TITANized min-vol to achieve the error of the method
from [11] using the maximum allotted CPU time. On average,
TITANized min-vol is twice faster than [11], with an average
gain of CPU time above 50%.
To summarize, our experimental results show that TI-
TANized min-vol has a faster convergence speed and smaller
final solutions than [11].
data set Our method’s CPU time Saved
lead time (s) for [11] CPU time
Urban 44 60 73%
Indian Pines 25 30 83%
Pavia Univ. 68 90 76%
San Diego NaN 120 0%
Terrain 44 60 73%
20News 221 300 74%
Reviews 26 30 80%
Sports 15 30 50%
TABLE II: TITANized min-vol’s lead time over min-vol [11]
to obtain the same minimum error.
We developed a new algorithm to solve min-vol NMF (2)
based on the inertial block majorization-minimization frame-
work of [9]. This framework, under some conditions that
hold for our method, guarantees subsequential convergence.
Experimental results show that this acceleration strategy per-
forms better than the state-of-the-art accelerated min-vol NMF
algorithm from [11]. Future works will focus on different types
of acceleration such as Anderson’s acceleration [17], and on
different constraints on Wand/or Hto address some specific
[1] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-
negative matrix factorization,Nature, vol. 401, pp. 788–791, 1999.
[2] N. Gillis, Nonnegative Matrix Factorization. SIAM, 2020. [Online].
[3] S. A. Vavasis, “On the complexity of nonnegative matrix factorization,
SIAM Journal on Optimization, vol. 20, no. 3, pp. 1364–1377, 2010.
[4] X. Fu, K. Huang, N. D. Sidiropoulos, and W.-K. Ma, “Nonnegative ma-
trix factorization for signal and data analytics: Identifiability, algorithms,
and applications,” IEEE Signal Process. Mag., vol. 36, pp. 59–80, 2019.
[5] X. Fu, W.-K. Ma, K. Huang, and N. D. Sidiropoulos, “Blind separation
of quasi-stationary sources: Exploiting convex geometry in covariance
domain,” IEEE Trans. Signal Process., vol. 63, pp. 2306–2320, 2015.
[6] C.-H. Lin, W.-K. Ma, W.-C. Li, C.-Y. Chi, and A. Ambikapathi,
“Identifiability of the simplex volume minimization criterion for blind
hyperspectral unmixing: The no-pure-pixel case,” IEEE Trans. Geosci.
Remote Sens., vol. 53, no. 10, pp. 5530–5546, 2015.
[7] V. Leplat, N. Gillis, and M. S. Ang, “Blind audio source separation with
minimum-volume beta-divergence NMF,IEEE Trans. Signal Process.,
vol. 68, pp. 3400–3410, 2020.
[8] X. Fu, K. Huang, B. Yang, W.-K. Ma, and N. D. Sidiropoulos, “Robust
volume minimization-based matrix factorization for remote sensing and
document clustering,” IEEE Trans. Signal Process., vol. 64, no. 23, pp.
6254–6268, 2016.
[9] L. T. K. Hien, D. N. Phan, and N. Gillis, “An inertial block majorization
minimization framework for nonsmooth nonconvex optimization,” 2020.
[10] M. D. Craig, “Minimum-volume transforms for remotely sensed data,”
IEEE Trans. Geosci. Remote Sens., vol. 32, no. 3, pp. 542–552, 1994.
[11] V. Leplat, A. M. S. Ang, and N. Gillis, “Minimum-volume rank-deficient
nonnegative matrix factorizations,” in ICASSP, 2019, pp. 3402–3406.
[12] A. M. S. Ang and N. Gillis, “Algorithms and comparisons of nonneg-
ative matrix factorizations with volume regularization for hyperspectral
unmixing,” IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens., vol. 12,
no. 12, pp. 4843–4853, 2019.
[13] J. M. Bioucas-Dias, A. Plaza, N. Dobigeon, M. Parente, Q. Du, P. Gader,
and J. Chanussot, “Hyperspectral unmixing overview: Geometrical,
statistical, and sparse regression-based approaches,” IEEE J. Sel. Top.
Appl. Earth Obs. Remote Sens., vol. 5, no. 2, pp. 354–379, 2012.
[14] Y. Xu and W. Yin, “A globally convergent algorithm for nonconvex
optimization based on block coordinate update,” Journal of Scientific
Computing, vol. 72, no. 2, pp. 700–734, Aug 2017.
[15] Y. Nesterov, Lectures on Convex Optimization, 2nd ed. Springer
Publishing Company, Incorporated, 2018.
[16] L. Condat, “Fast projection onto the simplex and the `1ball,” Mathe-
matical Programming, vol. 158, no. 1, pp. 575–585, 2016.
[17] D. G. Anderson, “Iterative procedures for nonlinear integral equations,”
Journal of the ACM (JACM), vol. 12, no. 4, pp. 547–560, 1965.
(a) Urban
0 20 40 60
time (s)
TITANized min-vol
min-vol from [11]
(b) Indian Pines
0 10 20 30
time (s)
(c) Pavia Uni
0 30 60 90
time (s)
(d) San Diego
0 40 80 120
time (s)
(e) Terrain
0 20 40 60
time (s)
(f) 20News
0 100 200 300
time (s)
(g) Reviews
0 10 20 30
time (s)
(h) Sports
0 10 20 30
time (s)
Fig. 1: Evolution w.r.t. time of the average of (f(W,H)
emin)/kMkFfor the different data sets.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Nonnegative matrix factorization (NMF) in its modern form has become a standard tool in the analysis of high-dimensional data sets. This book provides a comprehensive and up-todate account of the most important aspects of the NMF problem and is the first to detail its theoretical aspects, including geometric interpretation, nonnegative rank, complexity, and uniqueness. It explains why understanding these theoretical insights is key to using this computational tool effectively and meaningfully. Nonnegative Matrix Factorization is accessible to a wide audience and is ideal for anyone interested in the workings of NMF, discusses some new results on the nonnegative rank and the identifiability of NMF, and makes available MATLAB codes for readers to run the numerical examples presented in the book; see Graduate students starting to work on NMF and researchers interested in better understanding the NMF problem and how they can use it will find this book useful. It can be used in advanced undergraduate and graduate-level courses on numerical linear algebra and on advanced topics in numerical linear algebra and requires only a basic knowledge of linear algebra and optimization.
Conference Paper
Full-text available
ABSTRACT In recent years, nonnegative matrix factorization (NMF) with volume regularization has been shown to be a powerful identifiable model; for example for hyperspectral unmixing, document classification, community detection and hidden Markov models. In this paper, we show that minimum-volume NMF (min-volNMF) can also be used when the basis matrix is rank deficient, which is a reasonable scenario for some real-world NMF problems (e.g., for unmixing multispectral images). We propose an alternating fast projected gradient method for minvol NMF and illustrate its use on rank-deficient NMF problems; namely a synthetic data set and a multispectral image. Index Terms— nonnegative matrix factoriztion, minimum volume, identifiability, rank deficiency
Full-text available
Nonnegative matrix factorization (NMF) has become a workhorse for signal and data analytics, triggered by its model parsimony and interpretability. Perhaps a bit surprisingly, the understanding to its model identifiability---the major reason behind the interpretability in many applications such as topic mining and hyperspectral imaging---had been rather limited until recent years. Beginning from the 2010s, the identifiability research of NMF has progressed considerably: Many interesting and important results have been discovered by the signal processing (SP) and machine learning (ML) communities. NMF identifiability has a great impact on many aspects in practice, such as ill-posed formulation avoidance and performance-guaranteed algorithm design. On the other hand, there is no tutorial paper that introduces NMF from an identifiability viewpoint. In this paper, we aim at filling this gap by offering a comprehensive and deep tutorial on model identifiability of NMF as well as the connections to algorithms and applications. This tutorial will help researchers and graduate students grasp the essence and insights of NMF, thereby avoiding typical `pitfalls' that are often times due to unidentifiable NMF formulations. This paper will also help practitioners pick/design suitable factorization tools for their own problems.
Full-text available
This paper revisits blind source separation of instantaneously mixed quasi-stationary sources (BSS-QSS), motivated by the observation that in certain applications (e.g., speech) there exist time frames during which only one source is active, or locally dominant. Combined with nonnegativity of source powers, this endows the problem with a nice convex geometry that enables elegant and efficient BSS solutions. Local dominance is tantamount to the so-called pure pixel/separability assumption in hyperspectral unmixing/nonnegative matrix factorization, respectively. Building on this link, a very simple algorithm called successive projection algorithm (SPA) is considered for estimating the mixing system in closed form. To complement SPA in the specific BSS-QSS context, an algebraic preprocessing procedure is proposed to suppress short-term source cross-correlation interference. The proposed procedure is simple, effective, and supported by theoretical analysis. Solutions based on volume minimization (VolMin) are also considered. By theoretical analysis, it is shown that VolMin guarantees perfect mixing system identifiability under an assumption more relaxed than (exact) local dominance—which means wider applicability in practice. Exploiting the specific structure of BSS-QSS, a fast VolMin algorithm is proposed for the overdetermined case. Careful simulations using real speech sources showcase the simplicity, efficiency, and accuracy of the proposed algorithms.
Full-text available
Nonconvex optimization problems arise in many areas of computational science and engineering and are (approximately) solved by a variety of algorithms. Existing algorithms usually only have local convergence or subsequence convergence of their iterates. We propose an algorithm for a generic nonconvex optimization formulation, establish the convergence of its whole iterate sequence to a critical point along with a rate of convergence, and numerically demonstrate its efficiency. Specially, we consider the problem of minimizing a nonconvex objective function. Its variables can be treated as one block or be partitioned into multiple disjoint blocks. It is assumed that each non-differentiable component of the objective function or each constraint applies to one block of variables. The differentiable components of the objective function, however, can apply to one or multiple blocks of variables together. Our algorithm updates one block of variables at time by minimizing a certain prox-linear surrogate. The order of update can be either deterministic or randomly shuffled in each round. We obtain the convergence of the whole iterate sequence under fairly loose conditions including, in particular, the Kurdyka-{\L}ojasiewicz (KL) condition, which is satisfied by a broad class of nonconvex/nonsmooth applications. We apply our convergence result to the coordinate descent method for non-convex regularized linear regression and also a modified rank-one residue iteration method for nonnegative matrix factorization. We show that both the methods have global convergence. Numerically, we test our algorithm on nonnegative matrix and tensor factorization problems, with random shuffling enable to avoid local solutions.
Full-text available
In blind hyperspectral unmixing (HU), the pure-pixel assumption is well-known to be powerful in enabling simple and effective blind HU solutions. However, the pure-pixel assumption is not always satisfied in an exact sense, especially for scenarios where pixels are all intimately mixed. In the no pure-pixel case, a good blind HU approach to consider is the minimum volume enclosing simplex (MVES). Empirical experience has suggested that MVES algorithms can perform well without pure pixels, although it was not totally clear why this is true from a theoretical viewpoint. This paper aims to address the latter issue. We develop an analysis framework wherein the perfect identifiability of MVES is studied under the noiseless case. We prove that MVES is indeed robust against lack of pure pixels, as long as the pixels do not get too heavily mixed and too asymmetrically spread. Also, our analysis reveals a surprising and counter-intuitive result, namely, that MVES becomes more robust against lack of pure pixels as the number of endmembers increases. The theoretical results are verified by numerical simulations.
Considering a mixed signal composed of various audio sources and recorded with a single microphone, we consider on this paper the blind audio source separation problem which consists in isolating and extracting each of the sources. To perform this task, nonnegative matrix factorization (NMF) based on the Kullback-Leibler and Itakura-Saito β-divergences is a standard and state-of-the-art technique that uses the time-frequency representation of the signal. We present a new NMF model better suited for this task. It is based on the minimization of β-divergences along with a penalty term that promotes the columns of the dictionary matrix to have a small volume. Under some mild assumptions and in noiseless conditions, we prove that this model is provably able to identify the sources. In order to solve this problem, we propose multiplicative updates whose derivations are based on the standard majorization-minimization framework. We show on several numerical experiments that our new model is able to obtain more interpretable results than standard NMF models. Moreover, we show that it is able to recover the sources even when the number of sources present into the mixed signal is overestimated. In fact, our model automatically sets sources to zero in this situation, hence performs model order selection automatically.
In this paper, we consider nonnegative matrix factorization (NMF) with a regularization that promotes small volume of the convex hull spanned by the basis matrix. We present highly efficient algorithms for three different volume regularizers, and compare them on endmember recovery in hyperspectral unmixing. The NMF algorithms developed in this paper are shown to outperform the state-of-the-art volume-regularized NMF methods, and produce meaningful decompositions on real-world hyperspectral images in situations where endmembers are highly mixed (no pure pixels). Furthermore, our extensive numerical experiments show that when the data is highly separable, meaning that there are data points close to the true endmembers, and there are a few endmembers, the regularizer based on the determinant of the Gramian produces the best results in most cases. For data that is less separable and/or contains more endmembers, the regularizer based on the logarithm of the determinant of the Gramian performs best in general.
This paper considers \emph{volume minimization} (VolMin)-based structured matrix factorization (SMF). VolMin is a factorization criterion that decomposes a given data matrix into a basis matrix times a structured coefficient matrix via finding the minimum-volume simplex that encloses all the columns of the data matrix. Recent work showed that VolMin guarantees the identifiability of the factor matrices under mild conditions that are realistic in a wide variety of applications. This paper focuses on both theoretical and practical aspects of VolMin. On the theory side, exact equivalence of two independently developed sufficient conditions for VolMin identifiability is proven here, thereby providing a more comprehensive understanding of this aspect of VolMin. On the algorithm side, computational complexity and sensitivity to outliers are two key challenges associated with real-world applications of VolMin. These are addressed here via a new VolMin algorithm that handles volume regularization in a computationally simple way, and automatically detects and {iteratively downweights} outliers, simultaneously. Simulations and real-data experiments using a remotely sensed hyperspectral image and the Reuters document corpus are employed to showcase the effectiveness of the proposed algorithm.
A new algorithm is proposed to project, exactly and in finite time, a vector of arbitrary size onto a simplex or a l1-norm ball. The algorithm is demonstrated to be faster than existing methods. In addition, a wrong statement in a paper by Duchi et al. is corrected and an adversary sequence for Michelot's algorithm is exhibited, showing that it has quadratic complexity in the worst case.