Content uploaded by Andersen Ang
Author content
All content in this area was uploaded by Andersen Ang on Oct 10, 2018
Content may be subject to copyright.
VOLUME REGULARIZED NON-NEGATIVE MATRIX FACTORIZATIONS
Andersen M.S. Ang, Nicolas Gillis
University of Mons
Department of Mathematics and Operational Research
Rue de Houdain 9, 7000 Mons, Belgium
ABSTRACT
This work considers two volume regularized non-negative
matrix factorization (NMF) problems that decompose a non-
negative matrix Xinto the product of two nonnegative ma-
trices Wand Hwith a regularization on the volume of the
convex hull spanned by the columns of W. This regularizer
takes two forms: the determinant (det) and logarithm of the
determinant (logdet) of the Gramian of W. In this paper, we
explore the structure of these problems and present several
algorithms, including a new algorithm based on an eigenvalue
upper bound of the logdet function. Experimental results on
synthetic data show that (i) the new algorithm is competitive
with the standard Taylor bound, and (ii) the logdet regularizer
works better than the det regularizer. We also illustrate the
applicability of the new algorithm on the San Diego airport
hyperspectral image.
Index Terms—Non-negative matrix factorization, vol-
ume regularizer, determinant, log-determinant, coordinate de-
scent
1. INTRODUCTION
Given a non-nonnegative matrix X∈Rm×n
+and a positive
integer rmin{m, n}, non-negative matrix factorization
(NMF) aims to approximate Xas the product of two non-
negative matrices W∈Rm×r
+and H∈Rn×r
+so that X≈
WH>. In this paper, we focus on volume regularized NMF
(VR-NMF) and would like to solve the following regularized
optimization problem:
min
W≥0,
H≥0,
H1=1
F(W,H) = f(W,H) + λg(W).(1)
where f(W,H) = 1
2kX−WH>k2
Fis the data fitting term,
λ≥0is the regularization parameter and gis the volume
regularizer. The constraints W≥0and H≥0mean that
Wand Hare component-wise nonnegative, where 0is the
matrix of zeros. The constraint H1r=1r, where 1ris vector
of 1’s with length r, means His a row-stochastic matrix, that
is, the entries in each row of Hsum to one. This implies
Thanks to the ERC starting grant no 679515.
that the columns of Xare encapsulated inside the convex hull
spanned by the columns of W. Let wjbe the jth column
of Wand let C(W)be the convex hull spanned by the set
{wj}j=r
j=1. Figure 1 illustrates the geometry of the VR-NMF
problem: the m-dimensional data points (columns in X) are
located inside C(W)whose vertices are the wj’s. In this work
we focus on two measures for the volume of C(W)
det : g1(W) = 1
2det W>W,(2)
logdet : g2(W) = 1
2log det W>W+δIr.(3)
VR-NMF is asymmetrical with respect to (w.r.t.) Wand H
as the constraints on Wand Hare different and there is no
regularization on H. VR-NMF has several applications in
particular to unmix hyperspectral images where the columns
of Xare the spectral signatures of the pixels present in the
image, the columns of Ware the spectral signature of the
endmembers and the rows of Hare the abundances of the
endmembers in each pixel; see, e.g., [1] and the references
therein.
Contributions and organization. The contributions of
this work are : (1) develop a vector-wise update algorithm for
the VR-NMF with logdet regularization using a simple upper
bound on the logdet function, (2) analyze and compare det
and logdet regularizers. The paper is organized as follows: §2
gives the algorithmic framework we consider in this paper. §3
and 4 discusses the two regularizers g1and g2and the corre-
sponding algorithms. §5 contains the numerical experiments
and §6 concludes and provides directions of future research.
2. BLOCK COORDINATE DESCENT FOR VR-NMF
The optimization problem (1) has two block of variables, W
and H, and this work adopts the block coordinate descent
(BCD) framework with these two blocks of variables; see Al-
gorithm 1. To update H, which is a convex optimization prob-
lem, we use the fast gradient method (FGM) described in [2].
The update of Wis more difficult as this subproblem is non-
convex. This work focuses on the update of W, for which we
will use cyclic BCD with projected gradient update.
Algorithm 1 Algorithm framework for VR-NMF
Input: X∈Rm×n
+,r≥3and λ > 0
Output: W∈Rm×r
+and H∈Rr×n
+
Initialisation :W∈Rm×r
+,H∈Rr×n
+
1: for k= 1,2, ... do
2: Update W
3: H←FGM(W,H,X)
4: end for
3. VR-NMF WITH DETERMINANT REGULARIZER
Let us first consider g1(W) = 1
2det W>W. Focusing
on the ith column of W,det W>Wcan be written as a
quadratic function of wi[3]. In fact, letting Wi∈Rm×(r−1)
+
be Wwith the column wiremoved, we have
det W>W=ηiw>
iBiwi=ηikwik2
Bi,(4)
where ηi= det W>
iWi,Bi=Im−Wi(W>
iWi)−1Wi
is a projection matrix that projects into the orthogonal com-
plement of the column space of Wi. As Biand W>
iWiare
symmetric positive semidefinite, the right hand side of (4) is
a weighted norm. This means that the determinant regularizer
can be interpreted as a re-weighted `2norm regularization.
This also shows that the problem w.r.t. each column of Wis
convex. The objective function (1) w.r.t. wiis
F(wi) = 1
2w>
ikhik2
2Im+ληiBi
|{z }
Qi
wi− hXihi,wii+c.
(5)
with gradient ∇F(wi) = Qwi−Xihiand Lipschitz constant
Li=kQkFso that we can use the update
wi←I−Q
kQkFwi+Xihi
kQikF+
.(6)
The update (6) can be used to update Win Algorithm 1 and
we will refer to this algorithm as Det; see Algorithm 2.
Algorithm 2 – Det [3]
1: For i= 1 :r
2: |compute Qi, update wias (6)
4. VR-NMF WITH LOG DETERMINANT
REGULARIZER
We now consider g2(W) = 1
2log det W>W+δIr, where
Iris the identity matrix of order r. The term δIr(δis set to
be a small positive value, here we set δ= 1 for simplicity)
acts as a lower bound to prevent log det W>Wto go to
−∞ as Wtends to a rank deficient matrix. The non-convex
function g2has a tight convex upper bound (see §3.3 of [1]
and the reference therein), which comes from the first-order
Taylor approximation of the logdet function:
log det(W>W+δIr)≤trFW>W+c, (7)
where F= (Y>Y+δIr)−1and Y∈Rm×ris a matrix
constant, and c= log det(F)−r. Equality holds when Y=
W. Hence
FT(W) = 1
2kX−WH>k2
F+λ
2trFW>W+c
is an upper bound for the objective function of (1) using g2.
The gradient and its Lipschitz constant are given by
∇WFT(W) = W(HH>+λF)−XH>,and
L=kHH>+λFkF.
Minimizing the upper bound instead of the original objective
function using Y=W, we obtain an inexact BCD method
that we refer to as Taylor; see Algorithm 3
Algorithm 3 – Taylor [1]
1: W←W−1
L∇WFT(W)+
.
As Fis a dense matrix, Taylor updates Wmatrix-
wise, which is different from algorithm Det that updates
Wcolumn-wise. In the following, we obtain a column-wise
update based on the logdet regularizer by using a simple up-
per bound of g2. Given a matrix Awith rank r, let us denote
µi, i = 1,2, ..., r the eigenvalues of Aand µ(A)the vector
containing these eigenvalues. We will assume that they are
arranged in descending order |µ1|≥|µ2| ≥ ... ≥ |µr|.
Theorem (logdet-trace inequality). Let A∈Rr×rbe a
positive definite (pd) matrix. Then
log det(A)≤νtr(A) + c, (8)
where ν=µr(Y>Y+δIr)−1,c=Plog µi(Y>Y)−r.
Proof. Recall the log inequality log x≤x−1for x≥0.
Equality holds when x= 1. Generalizing the inequality to an
arbitrary point x0>0, we get log x≤x−1
0x+ log x0−1. To
prove the logdet-trace inequality (8), let x=µi(A)and use
the facts that det(A) = Qµi, that tr(A) = Pµiand that µr
is the smallest eigenvalue of A.
Letting A=W>W+δIr, Equation (8) becomes
log det(W>W+δIr)≤νtr(W>W) + c. (9)
With (9), function (1) using g=g2w.r.t. wican be upper
bounded using
FE(wi) = 1
2w>
i(khik2
2+λν)Im
| {z }
Qi
wi− hXihi,wii+c.
(10)
Fig. 1. The geometry of VR-NMF as convex hull fitting. Here
(m, n, r, θ) = (10,1500,4,0.8). For visualization, the data
points were projected onto a 2-dimensional space using PCA.
We update Wby taking a gradient step, that is, use (6) with
Qidefined in (10), and we refer to this algorithm as Eigen;
see Algoritm 4.
Algorithm 4 – Eigen
1: For i= 1 :r
2: |compute Qias (10), update wias (6)
Comparing (9) with (7) reveals that νtr(W>W)is an ap-
proximation of tr(FWWT)so that (9) is an approximation
of (7). The advantage of (9) over (7) is its separable structure,
which allows the logdet regularizer to have a column-wise
update, similarly as for Det.
5. NUMERICAL EXPERIMENTS
We first conduct experiments on synthetic data to compare the
performance of the three algorithms: Det, Taylor and Eigen
for comparing g1(W)and g2(W). Then, we apply Eigen on
the real San Diego airport hyperspectral image.
5.1. Synthetic data sets
Given integers (m, n, r), each entry of the ground truth ma-
trix W0∈Rm×r
+is generated using the uniform distribution
in [0 1]. The matrix H0∈Rn×r
+is generated in the form
H=ΠIr
H0
0where H0
0∈R(n−r)×r
+is a row-stochastic
matrix that is randomly generated using the Dirichlete distri-
bution (with parameters equal to 1) and the permutation ma-
trix Πshuffles the order of the rows of H. The clean non-
negative matrix X0=W0H0is then corrupted by additive
white Gaussian noise (scaled to fit standard signal to noise ra-
tio) N∈Rm×rto form Xas X=X0+N. Note that in the
generation process, the rows of Hwith element Hij larger
than a threshold θ≥0are removed and resampled so that all
the points in Xare away from the generating vertices W0;
see Figure 1 for an illustration.
We set λas a constant 5f(W(0),H(0))
g(W(0))where W(0) is the
initialization. To initialize the variables, Wis generated using
Fig. 2. Error curves of Taylor and Eigen on fitting Xand
W0. The maximum iteration is 200 (recall Eigen has rinner
iterations to update each column of W). Both curves start
with the same initialization and move towards the bottom left
corner. Eigen makes larger progress in every outer iteration
than Taylor.The unit of the axes is percentage. The final value
of Taylor and Eigen are (0.84, 4.82) and (0.01, 2.03) respec-
tively.
the successive nonnegative projection algorithm (SNPA) from
[2], and His generated using the FGM from [2].
5.2. An illustrative example
We first compare the use of inequalities (9) and (7) in min-
imizing (1) with g2. Here (m, n, r, θ) = (20,1000,8,0.8)
and SNR=100dB (noiseless). Figures 2 shows the relative er-
rors of data fitting (kX−WHkF/kXkF) and vertex fitting
(kW0−ˆ
WkF/kW0kF, where ˆ
Wis the matrix Wscaled
and matched to W0using the Hungarian algorithm). Note
that Taylor uses matrix-wise update and Eigen uses column-
wise update. Hence, to make a fair comparison, we plot the
iteration on the outer loop (every riteration for Eigen).
Figures 2 and 3 show that, in this example, algorithm
Eigen performs better as it achieves lower data fitting error
both on Xand W.
5.3. Comparing Det, Taylor and Eigen
Tables 1 gives the statistics of the relative errors in percent
over 100 trials in the format of ‘average ±standard devi-
ation’ for the three algorithms, in all the experiments, we
use (m, n, r) = (20,1000,8) and a maximum number of
iterations of 200. The results show that Eigen performs sig-
nificantly better than Taylor and Det in the noiseless cases.
In noisy settings, Taylor is better than Eigen to identify the
Fig. 3. The geometry of the fittings (PCA projected).
Table 1. Comparison of Det, Taylor and Eigen on synthetic
data sets. The table reports the average and standard deviation
of the relative errors in percent over 100 trials. The first (resp.
second) column is the error on fitting X(resp. W).
θ= 0.9, no noise
Det 2.49 ±0.51 9.79 ±1.49
Taylor 0.46 ±0.12 3.29 ±0.64
Eigen 0.01 ±0.00 1.19 ±0.40
θ= 0.9,10% noise
Det 27.18 ±0.45 36.64 ±3.45
Taylor 27.76 ±0.33 25.43 ±2.37
Eigen 23.64 ±0.14 33.21 ±5.25
θ= 0.7, no noise
Det 3.36 ±0.62 11.74 ±2.05
Taylor 1.76 ±0.34 8.63 ±1.13
Eigen 0.02 ±0.01 2.80 ±1.50
θ= 0.7,10% noise
Det 27.17 ±0.42 39.03 ±3.51
Taylor 28.00 ±0.34 27.97 ±2.10
Eigen 23.58 ±0.14 37.43 ±4.10
ground truth W0. In all cases, Det perform poorly.
This experiment shows that the log determinant models
produce more accurate solutions. This can be explained as
follows: consider the singular value expression of the regular-
izers, we have log det(W>W+δIr) = Pilog(σi(W)2+δ)
while det(W>W) = Qiσ2
i(W). Hence the det regularizer
is more sensitive to the large singular values. For logdet reg-
ularizer, the log operator reduces the effect of large singular
values, and thus making a better fit. For example, the singular
values of W>W+δIrin a trial are : 37.98, 4.26, 3.47, 3.29,
2.60, 2.36, 1.73 and 1.50.
In terms of computational time, the three methods have
the same computational complexity, running in O(mnr)op-
erations per iteration. For instance, on average on the syn-
thetic data sets, the matrix-wise method Taylor takes 2.70
seconds, while the column-wise methods take about 2.63 sec-
onds.
Fig. 4. A portion of data San Diego (band no. 35, data points
no. 59000 to 60000) before and after preprocessing.
Fig. 5. The spectra of the San Diego airport image obtained
by Eigen. The x-axis is the wavelength band number and the
y-axis the reflectance. The relative percentage error on fitting
the data is 1.76%.
5.4. San Diego airport hyperspectral image
For illustration, we apply the method Eigen on the San Diego
airport image (see §3.4.3 of [4] for details) with parameters
r= 8 and maximum iteration is 100 (the initialization and λ
are chosen as for the synthetic data sets). The raw data with
sizes (m, n) = (158,4002)is preprocessed by replacing all
negative values (caused by camera shaking) to zero, spikes
are corrected by median filter with window length 20. Figure
4 shows the data before and after preprocessing. Figures 5
and 6 show the spectra and the corresponding abundance map
extracted by Eigen, respectively.
Eigen successfully decomposes the data into meaningful
components: components 1 and 2 correspond to roof tops,
components 3 and 8 corresponds to trees and grass, the re-
maning ones correspond to different road surfaces [4].
6. CONCLUSION
In this paper, we have studied two VR-NMF problems: one
with det and one with logdet regularizer. For the logdet
case, we have proposed a new column-wise update of the
columns of Wcalled Eigen, and showed that it has a better
Fig. 6. The abundance map (matrix Hcomputed with Eigen)
for the San Diego airport image.
numerical performance than the matrix-wise update algo-
rithm from [1] (Taylor) and the vector-wise update with det
regularizer from [3] (Det). We have also illustrated the abil-
ity of the method Eigen to decompose data into meaningful
components on the San Diego airport image.
Future directions include: (1) Compare Det, Taylor and Eigen
on real data. (2) Design faster algorithms, e.g., the update of
wiin Eigen contains many implicit steps and repeated com-
putation, which can be improved [5]. Furthermore, as both
Det and Eigen are BCD algorithm with PGD update, it will
be interesting to apply randomized acceleration of BCD [6].
(3) Connect to rank minimumization and nuclear norm min-
imization. The equations (4), (7) and (8) show that there is
strong connections between the different regularizations in
terms of the singular values of the matrix W. Therefore it
will be interesting to study the connection between the vol-
ume regularizer and the (convex) nuclear norm regularizer
(which is the sum of the singular values of W).
7. REFERENCES
[1] X. Fu, K. Huang, B. Yang, W.-K. Ma, and N.D.
Sidiropoulos, “Robust volume minimization-based ma-
trix factorization for remote sensing and document clus-
tering,” IEEE Trans. on Signal Processing, vol. 64, no.
23, pp. 6254–6268.
[2] N. Gillis, “Successive nonnegative projection algorithm
for robust nonnegative blind source separation,” SIAM J.
on Imaging Sciences, vol. 7, no. 2, pp. 1420–1450, 2014.
[3] G. Zhou, S. Xie, Z. Yang, J.-M. Yang, and Z. He,
“Minimum-volume-constrained nonnegative matrix fac-
torization: Enhanced ability of learning parts,” IEEE
Trans. on Neural Networks, vol. 22, no. 10, pp. 1626–
1637, 2011.
[4] N. Gillis, D. Kuang, and H. Park, “Hierarchical cluster-
ing of hyperspectral images using rank-two nonnegative
matrix factorization,” IEEE Trans. on Geoscience and
Remote Sensing, vol. 53, no. 4, pp. 2066–2078, 2015.
[5] N. Gillis and F. Glineur, “Accelerated multiplicative up-
dates and hierarchical als algorithms for nonnegative ma-
trix factorization,” Neural computation, vol. 24, no. 4,
pp. 1085–1105, 2012.
[6] Yu. Nesterov, “Efficiency of coordinate descent methods
on huge-scale optimization problems,” SIAM J. on Opti-
mization, vol. 22, no. 2, pp. 341–362, 2012.