PreprintPDF Available

Abstract and Figures

Considering a mixed signal composed of various audio sources and recorded with a single microphone, we consider on this paper the blind audio source separation problem which consists in isolating and extracting each of the sources. To perform this task, nonnegative matrix factorization (NMF) based on the Kullback-Leibler and Itakura-Saito β\beta-divergences is a standard and state-of-the-art technique that uses the time-frequency representation of the signal. We present a new NMF model better suited for this task. It is based on the minimization of β\beta-divergences along with a penalty term that promotes the columns of the dictionary matrix to have a small volume. Under some mild assumptions and in noiseless conditions, we prove that this model is provably able to identify the sources. In order to solve this problem, we propose multiplicative updates whose derivations are based on the standard majorization-minimization framework. We show on several numerical experiments that our new model is able to obtain more interpretable results than standard NMF models. Moreover, we show that it is able to recover the sources even when the number of sources present into the mixed signal is overestimated. In fact, our model automatically sets sources to zero in this situation, hence performs model order selection automatically.
Content may be subject to copyright.
IEEE TRANSACTIONS ON SIGNAL PROCESSING , ISSUE XX, MONTH 2020 1
Blind Audio Source Separation with
Minimum-Volume Beta-Divergence NMF
Valentin Leplat, Nicolas Gillis, Andersen M.S. Ang*
Abstract—Considering a mixed signal composed of various
audio sources and recorded with a single microphone, we
consider on this paper the blind audio source separation
problem which consists in isolating and extracting each
of the sources. To perform this task, nonnegative matrix
factorization (NMF) based on the Kullback-Leibler and
Itakura-Saito β-divergences is a standard and state-of-the-
art technique that uses the time-frequency representation of
the signal. We present a new NMF model better suited for
this task. It is based on the minimization of β-divergences
along with a penalty term that promotes the columns of the
dictionary matrix to have a small volume. Under some mild
assumptions and in noiseless conditions, we prove that this
model is provably able to identify the sources. In order
to solve this problem, we propose multiplicative updates
whose derivations are based on the standard majorization-
minimization framework. We show on several numerical
experiments that our new model is able to obtain more
interpretable results than standard NMF models. Moreover,
we show that it is able to recover the sources even when
the number of sources present into the mixed signal is
overestimated. In fact, our model automatically sets sources
to zero in this situation, hence performs model order
selection automatically.
Index Terms—nonnegative matrix factorization,
β-divergences, minimum-volume regularization,
identifiability, blind audio source separation, model
order selection
I. INTRODUCTION
Blind audio source separation concerns the techniques
used to extract unknown signals called sources from a
mixed audio signal x. In this paper, we assume that
the audio signal is recorded with a single microphone.
Considering a mixed signal composed of various audio
sources, the blind audio source separation consists in
isolating and extracting each of the sources on the
* Department of Mathematics and Operational Research, Facult´
e
Polytechnique, Universit´
e de Mons, Rue de Houdain 9, 7000 Mons,
Belgium. Authors acknowledge the support by the Fonds de la
Recherche Scientifique - FNRS and the Fonds Wetenschappelijk On-
derzoek - Vlanderen (FWO) under EOS Project no O005318F-RG47,
and by the European Research Council (ERC starting grant no 679515).
E-mails: {valentin.leplat, nicolas.gillis, manshun.ang}@umons.ac.be.
Manuscript received in July 2019. Accepted April 2020.
basis of the single recording. Usually, the only known
information is the number of estimated sources present
in the mixed signal. The blind source separation problem
is said to be underdetermined as there are fewer sensors
(only one in our case) than sources. It then appears nec-
essary to find additional information to make the problem
well posed. The most common technique used for this
kind of problem is to get some form of redundancy in
the mixed signal in order to make it overdetermined.
This is typically done by computing the spectrogram
which represents the signal in the time and frequency
domains simultaneously (splitting the signals into over-
lapping time frames). The computation of spectrograms
can be summarized as follows: short time segments are
extracted from the signal and multiplied element wise
by a window function or “smoothing” window of size
F. Successive windows overlap by a fraction of their
length, which is usually taken as 50%. On each of these
segments, a discrete Fourier transform is computed and
stacked column-by-column in a matrix X. Thus, from a
one-dimensional signal xRT, we obtain a complex
matrix XCF×Ncalled spectrogram where F×N'
2T(due to the 50% overlap between windows). Note that
the length of the window determines the shape of the
spectrogram. These preliminary operations correspond
to computing the short time Fourier transform (STFT),
which is given by the following formula: for 1fF
and 1nN,Xf,n =PF1
j=0 wjxnL+je(i2πf j
F),
where wRFis the smoothing window of size F,Lis
a shift parameter (also called hop size), and H=FLis
the overlap parameter. The number of rows corresponds
to the frequency resolution. Letting fsbe the sampling
rate of the audio signal, consecutive rows correspond to
frequency bands that are fs/F Hz apart.
The time-frequency representation of a signal highlights
two of its fundamental properties: sparsity and redun-
dancy. Sparsity comes from the fact that most real
signals are not active at all frequencies at all time points.
Redundancy comes from the fact that frequency patterns
of the sources repeat over time. Mathematically, this
means that the spectrogram is a low-rank matrix. These
arXiv:1907.02404v2 [eess.SP] 28 Apr 2020
two fundamental properties led sound source separation
techniques to integrate algorithms such as nonnegative
matrix factorization (NMF). Such techniques retrieve
sensible solutions even for single-channel signals.
A. Mixing assumptions
Given Ksource signals s(k)RTfor 1kK,
we assume the acquisition process is well modelled by
a linear instantaneous mixing model:
x(t) =
K
X
k=1
s(k)(t) with t= 0, ..., T 1.(1)
Therefore, for each time index t, the mixed signal x(t)
from a single microphone is the sum of the Ksource
signals. It is standard to assume that microphones are
linear as long as the recorded signals are not too loud.
If signals are too loud, they are usually clipped. The
mixing process is modelled as instantaneous as opposed
to convolutive used to take into account sound effects
such as reverberation. The source separation problem
consist in finding source estimates ˆs(k)of s(k)sources
for all k∈ {1, . . . , K}. Let us denote Sthe linear STFT
operator, and let Sbe its conjugate transpose. We have
SS=F I , where Iis the identity matrix of appropriate
dimension. For the remainder of this paper, Sstands for
the inverse short time Fourier transform. Note that the
term inverse is not meant in a mathematical sense. Indeed
the STFT is not a surjective transformation from RTto
CF×N. In other words, each spectrogram or each matrix
with complex entries is not necessarily the STFT of a
real signal; see [1] and [2] for more details. By applying
the STFT operator Sto (1), we obtain the mixing model
in the time-frequency domain :
X=S(x(t)) = S K
X
k=1
s(k)(t)!=
K
X
k=1
S(k),
where S(k)is the STFT of the source k, that is, the
spectrogram of source k. To identify the sources, we
use in this paper the amplitude spectrogram V=|X| ∈
RF×N
+defined as Vfn =|Xfn |for all f,n. We assume
that V=PK
k=1 S(k), which means that there is no
sound cancellation between the sources, which is usually
the case in most signals. Finally, we assume that the
source spectrograms S(k)are well approximated by
nonnegative rank-one matrices. This leads to the NMF
model described in the next section. Note that a source
can be made of several rank-one factors in which case
a post-processing step will have to recombine them a
posteriori (e.g., looking at the correspondence in the
activation of the sources over time). Note also that we
focus on the NMF stage of the source separation which
factorizes Vinto the source spectrograms. For the phases
reconstruction, which is a highly non-trivial problem, we
consider a naive reconstruction procedure consisting in
keeping the same phase as the input mixture for each
source [1].
B. NMF for audio source separation
Given a non-negative matrix VRF×N
+(the spec-
trogram) and a positive integer Kmin(F, N)(the
number of sources, called the factorization rank), NMF
aims to compute two non-negative matrices Wwith K
columns and Hwith Krows such that VW H. NMF
approximate each column of Vby a linear combination
of the columns of Wweighted by the components of
the corresponding column of H[3]. When the matrix V
corresponds to the amplitude spectrogram or the power
spectrogram of an audio signal, we have that
Wis referred as the dictionary matrix and each column
corresponds to the spectral content of a source, and
His the activation matrix specifying if a source is
active at a certain time frame and in which intensity.
In other words, each rank-one factor W(:, k)H(k, :) will
correspond to a source: the kth column W(:, k)of Wis
the spectral content of source k, and the kth row H(k, :)
of His its activation over time. To compute Wand
H, NMF requires to solve the following optimization
problem
min
W0,H0D(V|W H ) = X
f,n
d(Vfn|[W H ]fn ),
where A0means that Ais component-wise nonneg-
ative, and d(x|y)is an appropriate measure of fit. In
audio source separation, a common measure of fit is the
discrete β-divergence denoted dβ(x|y)and equal to
1
β(β1) xβ+ (β1) yββxyβ1for β6= 0,1,
xlog x
yx+yfor β= 1,
x
ylog x
y1for β= 0.
For β= 2, this the standard squared Euclidean distance,
that is, the squared Frobenius norm ||VW H ||2
F. For
β= 1 and β= 0, the β-divergence corresponds to the
Kullback-Leibler (KL) divergence and the Itakura-Saito
(IS) divergence, respectively. The error measure which
should be chosen accordingly with the noise statistic
assumed on the data. The Frobenius norm assumes
i.i.d. Gaussian noise, KL divergence assumes additive
2
Poisson noise, and the IS divergence assumes multi-
plicative Gamma noise [4]. The β-divergence dβ(x|y)
is homogeneous of degree β:dβ(λx|λy) = λβdβ(x|y).
It implies that factorizations obtained with β > 0(such
as the Euclidean distance or the KL divergence) will
rely more heavily on the largest data values and less
precision is to be expected in the estimation of the low-
power components. The IS divergence (β= 0) is scale-
invariant that is dIS (λx|λy) = dI S (x|y)[5]. The IS
divergence is the only one in the β-divergences family
to possess this property. It implies that time-frequency
areas of low power are as important in the divergence
computation as the areas of high power. This property
is interesting in audio source separation as low-power
frequency bands can perceptually contribute as much as
high-power frequency bands. Note that both KL and IS
divergences are more adapted to audio source separation
than Euclidean distance as it is built on logarithmic
scale as human perception; see [1] and [5]. Moreover,
the β-divergence is only convex with respect to W(or
H) if β1. Otherwise, the objective function is non-
convex. This implies that, for β < 1, even the problem
of inferring Hwith Wfixed is non-convex. For more
details on β-divergences; see [5].
C. Contribution and outline of the paper
In Section II, we propose a new NMF model, referred
to as minimum-volume β-NMF (min-vol β-NMF), to
tackle the audio source separation problem. This model
penalizes the columns of the dictionary matrix Wso that
their convex hull has a small volume. To the best of our
knowledge, this model is novel in two aspects: (1) it is
the first time a minimum-volume penalty is associated
with a β-divergence for β6= 2 and it is the first time such
models are used in the context of audio source separa-
tion, and (2) as opposed to most previously proposed
minimum-volume NMF models, our model imposes a
normalization constraints on the factor Winstead of H.
As far as we know, the only other paper that used a
normalization of Wis [6] but the authors did not justify
this choice compared to the normalization of H(the
choice seems arbitrary, motivated by the ‘elimination
of the norm indeterminacy’), nor provided theoretical
guarantees. In this paper, we explain why normalization
of Wis a better choice in practice, and we prove that,
under some mild assumptions and in the noiseless case,
this model provably identify the sources; see Theorem 1.
To the best of our knowledge, this is the first result of
this type in the audio source separation literature. In
Section III, we propose an algorithm to tackle min-vol
β-NMF, focusing on the KL and IS divergences. The
algorithm is based on multiplicative updates (MU) that
are derived using the standard majorization-minimization
framework, and that monotonically decrease the objec-
tive function. In Section IV, we present several numerical
experiments, comparing min-vol β-NMF with standard
NMF and sparse NMF. The two mains conclusions are
that (1) minimum-volume β-NMF performs consistently
better to identify the sources, and (2) as opposed to NMF
and sparse NMF, min-vol β-NMF is able to detect when
the factorization rank is overestimated by automatically
setting sources to zero.
II. MINIMUM-VO LU ME NMF WITH β-DIVERGENCES
In this section, we present a new model of separation
based on the minimization of β-divergences including a
penalty term promoting solutions with minimum volume
spanned by the columns of the dictionary matrix W.
Section II-A recalls the geometric interpretation of NMF
which motivated the use of a minimum volume penalty
on the dictionary W. Section II-B discusses the new
proposed normalization compared to previous minimum
volume NMF models, and proves that min-vol β-NMF
provably recovers the true factors (W, H)under mild
conditions and in the noiseless case; see Theorem 1.
A. Geometry and the min-vol β-NMF model
As mentioned earlier, V=W H means that each
column of Vis a linear combination of the columns
of Wweighted by the components of the corresponding
column of H; in fact, vn=W hnfor n= 1, ..., N , where
vndenotes the nth column of data matrix V. This gives
to NMF a nice geometric interpretation: for all n
vncone(W) = vRF|v=W θ, θ 0,
meaning that the columns of Vare contained in the con-
vex cone generated by the columns of W; see Figure 1
for an illustration. From this interpretation, it follows
that, in general, NMF decompositions are not unique
because there exists several (often, infinitely many) sets
of columns of Wthat span the convex cone generated
by the data points; see for example [8] for more details.
Hence, NMF is in most cases ill-posed because the
optimal solution is not unique. In order to make the
solution unique (up to permutation and scaling of the
columns of Wand the rows of H) hence making the
problem well-posed and the parameters (W, H)of the
problem identifiable, a key idea is to look for a solution
Wwith minimum volume. Intuitively, we will look for
the cone cone(W)containing the data points and as close
3
Fig. 1: Geometric interpretation of NMF for K= 3 [7].
as possible to these data points. The use of minimum-
volume NMF has lead to a new class of NMF methods
that outperforms existing ones in many applications such
as document analysis and blind hyperspectral unmixing;
see the recent survey [9]. Note that minimum-volume
NMF implicitly enhances the factor Hto be sparse: the
fact that Whas a small volume implies that many data
points will be located on the facets of the cone(W)hence
Hwill be sparse.
Hence, in this paper, we consider the following model,
referred to as min-vol β-NMF:
min
W(:,j)Fj,H 0Dβ(V|W H) + λvol(W),(2)
where F=nxRF
+PF
i=1 xi= 1ois the unit sim-
plex, λis a penalty parameter and vol(W)is a function
that measures the volume spanned by the columns of W.
In this paper, we use vol(W) = logdet(WTW+δI),
where δis a small positive constant that prevents
logdet(WTW)to go to −∞ when Wtends to a
rank-deficient matrix (that is, when r=rank(W)<
K). The reason for using such a measure is that
pdet (WTW)/K!is the volume of the convex hull of
the columns of Wand the origin. This measure is one
of the most widely used ones, and has been shown to
perform very well in practice [10], [11]. Moreover, the
criterion logdet(WTW+δI)is able to distinguish two
rank-deficient solutions and favour solutions for Wwith
smaller volume [12]. Finally, as we will illustrate in
Section IV, this criterion is able to identify the right
number of source even when Kis overestimated, by
putting some rank-one factors to zero.
B. Normalization and identifiability of min-vol β-NMF
As mentioned above, under some appropriate conditions
on V=W H , minimum-volume NMF models will
provably recover the ground-truth (W, H )that generated
V, up to permutation and scaling of the rank-one fac-
tors. The first identifiability results for minimum-volume
NMF models assumed that the entries in each column
of Hsum to one, that is, that HTe=ewhere eis the
all-one column vector whose dimension is clear from
the context, meaning that His column stochastic [8],
[13]. Under this condition, each column of Vlies in the
convex hull of the columns of W; see Figure 2 for an
illustration.
Fig. 2: Geometric interpretation of NMF when K= 3
and His column stochastic [7].
Under the three assumptions that (1) His column
stochastic, (2) Wis full column rank, and (3) Hsatis-
fies the sufficiently scattered condition, minimizing the
volume of conv(W)such that V=W H recover the
true underlying factors, up to permutation and scaling.
Intuitively, the sufficiently scattered condition requires
Hto be sparse enough so that data points are located on
the facets of conv(W); see Appendix A for a formal
definition. The sufficiently scattered condition makes
sense for most audio source data sets as it is reasonable
to assume that, for most time points, only a few sources
are active hence His sparse; see [9] for more details
on the sufficiently scattered condition. Note that the
sufficiently scattered condition is a generalization of the
separability condition which requires W=V(:,J)for
some index set Jof size K[14]. However, separability
is a much stronger assumption as it requires that, for
each sources, there exists a time point where only
that source is active. Note that although min-vol NMF
guarantees identifiability, the corresponding optimization
4
problem (2) is still hard to solve in general, as for the
original NMF problem [15].
Despite this nice result, the constraint HTe=emakes
the NMF model less general and does not apply to all
data sets. In the case where the data does not naturally
belong to a convex hull, one needs to normalize the
data points so that their entries sum to one so that
HTe=ecan be assumed without loss of generality (in
the noiseless case). This normalization can sometimes
increase the noise and might greatly influence the solu-
tion, hence are usually not recommended in practice; see
the discussion in [9].
In [7], authors show that identifiability still holds when
the condition that His column stochastic is relaxed
to Hbeing row stochastic. As opposed to column
stochasticity, row stochasticity of Hcan be assumed
without loss of generality since any factorization W H
can be properly normalized so that this assumption holds.
In fact, W H =PK
k=1(akW(:, k))(H(k, :)/ak)for any
ak>0for k= 1, . . . , K. In other terms, letting Abe the
diagonal matrix with A(k, k) = ak=Pn
j=1 H(k, j)for
k= 1, . . . , K, we have W H = (W A)(A1H) = W0H0
where H0=A1His row stochastic.
Similarly as done in [7], we prove in this paper that
requiring Wto be column stochastic (which can also be
made without loss of generality) also leads to identifia-
bility. Geometrically, the columns of Ware constrained
to be on the unit simplex. Minimizing the volume still
makes a lot of sense: we want the columns of Wto be as
close as possible to one another within the unit simplex.
In Appendix A, we prove the following theorem.
Theorem 1. Assume V=W#H#with rank(V) = K,
W#0and H#satisfies the sufficiently scattered
condition (Definition 2 in Appendix A). Then the optimal
solution of
min
WRF×K,HRK×Nlogdet WTW(3)
such that V=W H, W Te=e, H 0,
recovers (W#, H#)up to permutation and scaling.
Proof. See Appendix A.
In noiseless conditions, replacing WTe=ewith He =
ein (3) leads to the same identifiability result; see [7,
Theorem 1]. Therefore, in noiseless conditions and under
the conditions of Theorem 1, both models return the
same solution up to permutation and scaling. However,
in the presence of noise, we have observed that the two
models may behave very differently. In fact, we advocate
that the constraint WTe=eis better suited for noisy
real-world problems, which we have observed on many
numerical examples. In fact, we have observed that the
normalization WTe=eis much less sensitive to noise
and returns much better solutions. The reason is mostly
twofold:
(i) As described above, using the normalization He =e
amounts to multiply Wby a diagonal matrix whose
entries are the `1norms of the rows of H. Therefore,
the columns of Wthat correspond to dominating (resp.
dominated) sources, that is, sources with much more
(resp. less) power and/or active at many (resp. few)
time points, will have much higher (resp. lower) norm.
Therefore, the term logdet(WTW+δI)is much more
influenced by the dominating sources and will have
difficulties to penalize the dominated sources. In other
terms, the use of the term logdet(WTW+δI)with the
normalization He =eimplicitly requires that the rank-
one factors W(:, k)H(k, :) for k= 1, . . . , K are well
balanced, that is, have similar norms. This is not the
case for many real (audio) signals.
(ii) As it will be explained in Section III, the update of
Wneeds the computation of the matrix Ywhich is the
inverse of WTW+δI–this terms appears in the gradient
with respect to Wof the objective function. The numeri-
cal stability for such operations is related to the condition
number of WTW+δI. For a `1normalization on the
columns of W, the condition number is bounded above
as follows: cond(WTW+δI) = σmax (WTW+δI)
σmin(WTW+δI )=
σmax(W)2+δ
σmin(W)2+δ(Kmaxk||W(:,k)||2)2+δ
δ1 + K
δ, where
σmin(W)and σmax (W)are the smallest and largest
singular values of W, respectively. In the numerical
experiments, we use δ= 1. On the other hand, the
normalization He =emay lead to arbitrarily large
values for the condition number of WTW+δI, which
we have observed numerically on several examples. This
issue can be mitigated with the use of the normalization
He =ρe for some ρ > 0sufficiently large for which
identifiabilty still holds [7]. However, it still performs
worse because of the first reason explained above.
For these reasons, we believe that the model (3) would
also be better suited (compared to the normalization
on H) in other contexts; for example for document
classification [16].
III. ALGORITHM FOR MIN-VOL β-NMF
Most NMF algorithms alternatively update Hfor W
fixed and vice versa, and we adopt this strategy in this
paper. For Wfixed, (2) is equivalent to standard NMF
5
and we will use the MU that have already been derived
in the literature [3], [5].
To tackle (2) for Hfixed, let us consider
min
W0F(W) = Dβ(V|W H ) + λlogdet(WTW+δI).
(4)
Note that, for now, we have discarded the normalization
on the columns of W. In our algorithm, we will use
the update for Wobtained by solving (4) as a descent
direction along with a line search procedure to integrate
the constraint on W. This will ensure that the objective
function Fis non-increasing at each iteration. In the
following sections we derive MU for Wthat decrease
the objective in (4). We follow the standard majorization-
minimization framework [17]. First, an auxiliary func-
tion, which we denote ¯
F, is constructed so that it
majorizes the objective. An auxiliary function for Fat
point ˜
Wis defined as follows.
Definition 1. The function ¯
F(W|˜
W) : ×Ris
an auxiliary function for F(W) : Ω Rat ˜
Wif
the conditions ¯
F(W|˜
W)F(W)for all Wand
¯
F(˜
W|˜
W) = F(˜
W)are satisfied.
Then, the optimization of Fcan be replaced by an
iterative process that minimizes ¯
F. More precisely, the
new iterate W(i+1) is computed by minimizing exactly
the auxiliary function at the previous iterate W(i). This
guarantees Fto decrease at each iteration.
Lemma 1. Let W, W (i)0, and let ¯
Fbe an auxiliary
function for Fat W(i). Then Fis non-increasing under
the update W(i+1) =argmin
W0
¯
F(W|W(i)).
Proof. In fact, we have by definition that
F(W(i)) = ¯
F(W(i)|W(i))min
W
¯
F(W|W(i)) =
¯
F(W(i+1)|W(i))F(W(i+1) ).
The most difficult part in using the majorization-
minimization framework is to design an auxiliary func-
tion that is easy to optimize. Usually such auxiliary
functions are separable (that is, there is no interaction
between the variables so that each entry of Wcan be
updated independently) and convex.
A. Separable auxiliary functions for β-divergences
For the sake of completeness, we briefly recall the
auxiliary function proposed in [5] for the data fitting
term. It consists in majorizing the convex part of the β-
divergence using Jensen’s inequality and majorizing the
concave part by its tangent (first-order Taylor approxi-
mation). We have
dβ(x|y) = ˇ
dβ(x|y) + ˆ
dβ(x|y) + ¯
dβ(x|y),(5)
where ˇ
dis convex function of y,ˆ
dis a concave function
of yand ¯
dis a constant of y; see Table I.
TABLE I: Differentiable convex-concave-constant de-
composition of the β-divergence under the form (5) [5].
ˇ
d(x|y)ˆ
d(x|y)¯
d(x)
β= 0 xy1log(y)x(log(x)1)
β[1,2] dβ(x|y)0 0
The function Dβ(V|W H )can be written as
PfDβ(vf|wfH)where vfand wfare respectively the
fth row of Vand W. Therefore we only consider the
optimization over one specific row of W. To simplify
notation, we denote iterates w(i+1) (next iterate) and
w(i)(current iterate) as wand ˜w, respectively.
Lemma 2 ([5]).Let ˜v= ˜wH and ˜wbe such that ˜vn>0
for all nand ˜wk>0for all k. Then the function
G(w|˜w) = X
n"X
k
˜wkhkn
˜vn
ˇ
d(vn|˜vn
wk
˜wk
)#+¯
d(vn)
+"ˆ
d0(vn|˜vn)X
k
(wk˜wk)hkn +ˆ
d(vn|˜vn)#
(6)
is an auxiliary function for Pnd(vn|[wH]n)at ˜w.
B. A separable auxiliary function for the minimum-
volume regularizer
The minimum-volume regularizer logdet(WTW+δI)
is a non-convex function. However, it can be upper-
bounded using the fact that logdet(.)is a concave func-
tion so that its first-order Taylor approximation provides
an upper bound; see for example [10]. For any positive-
definite matrices Aand BRK×K, we have:
logdet (B)logdet (A) + trace A1(BA)
= trace A1B+ logdet (A)K .
This implies that for any W, Z RF×K, we have
logdet(WTW+δI)l(W, Z),(7)
where l(W, Z) = trace Y W TW+logdet Y1K,
Y= (ZTZ+δI)1with δ > 0. Note that ZTZ+δI
is positive definite hence is invertible and its inverse Y
is also positive definite. Finally l(W, Z)is an auxiliary
function for logdet(WTW+δI)at Z. However, it is
6
quadratic and not separable hence non-trivial to optimize
over the nonnegative orthant. The non-constant part of
l(W, Z)can be written as PfwfY wT
fwhere wfis
the fth row of W. Henceforth we will focus on one
particular row vector wwith l(w) = wTY w which will
be further considered as a column vector of size K×1.
Lemma 3. Let w, ˜wRK
+be such that ˜wk>0for
all k,Y=Y+Ywith Y+= max (Y, 0) and Y=
max (Y, 0), and Φ ( ˜w)be the diagonal matrix Φ ( ˜w) =
Diag 2[Y+˜w+Y˜w]
[ ˜w]where [A]
[B]is the component-wise
division between Aand B, and w=w˜w. Then
¯
l(w|˜w) = l( ˜w)+∆wTl( ˜w) + 1
2wTΦ( ˜w)∆w, (8)
is a separable auxiliary function for l(w)=wTY w at ˜w.
Proof. See Appendix B.
Remark 1 (Choice of the auxiliary function).A simpler
choice for the auxiliary function would be to replace
Φ( ˜w)with 2λmax(Y)Iwhere λmax(Y)is the largest
eigenvalue of Y(the constant 2appears because l(w) =
wTY w while there is a factor 1/2in front of Φ( ˜w)).
However, it would lead to a worse approximation. In
particular if Yis a diagonal matrix (since Y0,
these diagonal elements are positive), our choice gives
Φ( ˜w)=2Yfor any ˜w > 0, meaning that the auxiliary
function matches perfectly the function l(w). This would
not be the case for the choice 2λmax(Y)I(unless Yis
a scaling of the identity matrix).
C. Auxiliary function for min-vol β-NMF
Based on the auxiliary functions presented in Sections
III-A and III-B, we can directly derive a separable
auxiliary function ¯
F(W|˜
W)for min-vol β-NMF (2).
Corollary 1. For W, H 0,λ > 0,Y= ( ˜
WT˜
W+
δI)1with δ > 0and the constant c= logdet Y1+
K, the function
¯
F(W|˜
W) = X
f
G(wf|˜wf) + λ
X
f
¯
l(wf|˜wf) + c
,
where Gis given by (6) and ¯
lby (8), is a convex and
separable auxiliary function for F(W) = Dβ(V|W H)+
λlogdet(WTW+δI)at ˜
W.
Proof. This follows directly from Lemma 2, Equa-
tion (7) and Lemma 3.
In the following section, we provide explicitly MU for
the KL divergence (β= 1) by finding a closed-form
solution for the minimization of ¯
F. In Appendix C, we
provide the MU for the IS divergence (β= 0). Due to the
lack of space, the other cases are not treated explicitly
but can be in a similar way. For the same reason, we
will only compare KL NMF models in the numerical
experiments (Section IV).
D. Algorithm for min-vol KL-NMF
As before, let us focus on a single row of W, denoted
w, as the objective function F(W)is separable by
row. For β= 1, the derivative of the auxiliary function
¯
F(w|˜w)with respect to a specific coefficient wkis
given by wk¯
F(w|˜w) = Pnhkn Pnhkn ˜wkvn
wk˜vn+
2λ[Y˜w]k+ 2λhDiag Y+˜w+Y˜w
˜wikwk
2λhDiag Y+˜w+Y˜w
˜wik˜wk. Due to the separability,
we set the derivative to zero to obtain the closed-form
solution, which is given in Table II in matrix form.
Note that although the closed-form solution has a nega-
tive term in the numerator of the multiplicative factor
(see Table II), they always remain nonnegative given
that V, H and ˜
Ware nonnegative. In fact, the term
before the minus sign is always larger than the term
after the minus sign: JF,N HT4λ(˜
W Y )is squared
(component wise) and added a positive term, hence the
component-wise square root of that result is larger than
JF,N HT4λ(˜
W Y ).
Algorithm 1 summarizes our algorithm to tackle (2) for
β= 1 which we refer to as min-vol KL-NMF. Note
that the update of H(step 4) is the one from [3]. More
importantly, note that we have incorporated a line-search
for the update of W. In fact, although the MU for W
are guaranteed to decrease the objective function, they
do not guarantee that Wremains normalized, that is,
that ||W(:, k)||1= 1 for all k. Hence, we normalize
Wafter it is updated (step 10), and we normalize H
accordingly so that W H remains unchanged. When this
normalization is performed, the β-divergence part of Fis
unchanged but the minimum-volume penalty will change
so that the objective function Fmight increase. In order
to guarantee non-increasingness, we integrate a simple
backtracking line-search procedure; see steps 11-16 of
Algorithm 1. In summary, our MU provide a descent
direction that preserved nonnegativity of the iterates, and
we use a projection and a simple backtracking line-
search to guarantee the monotonicity of the objective
7
TABLE II: Multiplicative update for min-vol KL-NMF.
W=˜
W
"[JF,N HT4λ(˜
W Y )].2+8λ˜
W(Y++Y) [V]
[˜
W H]HT!#.1
2
(JF,N HT4λ(˜
W Y ))
[4λ˜
W(Y++Y)] ,
where AB(resp. [A]
[B]) is the Hadamard product (resp. division) between Aand B,A()is the element-wise αexponent of A,
JF,N is the F-by-Nall-one matrix, and Y=Y+Y= ( ˜
WT˜
W+δI)1with δ > 0,Y+, Y 0,λ > 0.
function, as in standard projected gradient descent meth-
ods.
Algorithm 1 min-vol KL-NMF
Input: A matrix VRM×T, an initialization H
RK×T
+, an initialization WRM×K, a factor-
ization rank K, a maximum number of iterations
maxiter, min-vol weight λ > 0and δ > 0
Output: A rank-KNMF (W, H)of VW H with
W0and H0.
1: γ= 1, Y =WTW+δI1
2: for i= 1 : maxiter do
3: % Update of matrix H
4: HH[WT([V]
[W H])]
[WTJF,N ]
5: % Update of matrix W
6: YWTW+δI1
7: Y+max (Y, 0)
8: Ymax (Y, 0)
9: W+is updated according to Table II
10: (W+
γ, Hγ) = normalize (W+, H)
11: % Line-search procedure
12: while FW+
γ, Hγ> F (W, H)do
13: γγ×0.8
14: W+
γ(1 γ)W+γW +
15: (W+
γ, Hγ)normalize W+
γ, H
16: end while
17: (W, H)(W+
γ, Hγ)
18: % Update of γto avoid a vanishing stepsize
19: γmin (1, γ ×1.2)
20: end for
It can be verified that the computational complexity of
the min-vol KL-NMF is asymptotically equivalent to the
standard MU for β-NMF, that is, it requires O(F N K)
operations per iteration. Indeed, all the main operations
include matrix products with a complexity of O(F N K)
and element-wise operations on matrices of size F×K
or K×N. Note that the inversion of the K-by-Kmatrix
(WTW+δI)requires OK3operations which is
dominated by O(F N K)since Kmin(F, N )(in fact,
typically Kmin(F, N )hence this term is negligible).
Therefore, although Algorithm 1 will be slower than the
baseline KL-NMF (that is, the standard MU) because
of the additional terms to be computed and the line-
search, the asymptotical computational cost is the same;
see Table IV for runtime comparison.
IV. NUMERICAL EXPERIMENTS
In this section we report an experimental comparative
study of baseline KL-NMF, min-vol KL-NMF (Al-
gorithm 1) and sparse KL-NMF [18] applied to the
spectrogram of two monophonic piano sequences and
a synthetic mix of a bass and drums. For the two
monophonic piano sequences, the audio signals are true
life signals with standard quality. Note that the sequences
are made of pure piano notes, the number Kshould
therefore correspond to the number of notes present into
the mixed signals. The comparative study is performed
for several values of Kwith a focus on the case
where the factorization rank Kis overestimated. For
all simulations, random initializations are used for W
and H, the best results among 5 runs are kept for the
comparative study. In all cases, we use a Hamming
window of size F=1024, and 50% overlap between two
frames. Sparse KL-NMF has a similar structure as min-
vol KL-NMF, with a penalty parameter for the sparsity
enforcing regularization. To tune these two parameters,
we have used the same strategy for both methods: we
manually tried a wide range of values and report the best
results. The code is available from bit.ly/minvolKLNMF
(code written in MATLAB R2017a), and can be used
to rerun directly all experiments below. They were run
on a laptop computer with Intel Core i7-7500U CPU @
2.70GHz 4 and 32GB memory.
a) Mary had a little lamb: The first audio sample is
the first measure of “Mary had a little lamb”. The
sequence is composed of three notes; E4,D4and C4,
played all at once. The recorded signal is 4.7 seconds
long and downsampled to fs= 16000Hz yielding
T=75200 samples. STFT of the input signal xyields a
temporal resolution of 16ms and a frequency resolution
of 31.25Hz, so that the amplitude spectrogram Vhas
N=294 frames and F=257 frequency bins. The musical
score is shown on Figure 3. All NMF algorithms were
run for 200 iterations which allowed them to converge.
Figure 4 presents the columns of W(dictionary matrix)
8
Fig. 3: Musical score of “Mary had a little lamb”.
(a) Columns of W(b) Rows of H
Fig. 4: Comparative study of baseline KL-NMF (top),
min-vol KL-NMF (middle) and sparse KL-NMF (bot-
tom) applied to “Mary had a little lamb” amplitude
spectrogram with K=3.
and the rows of Hfor baseline KL-NMF and min-vol
KL-NMF with K= 3. Figure 5 presents the time-
frequency masking coefficients. These coefficients are
computed as follows
mask(k)
f,n =
ˆ
X(k)
f,n
Pkˆ
X(k)
f,n
with k= 1, ..., K ,
where ˆ
X(k)=W(:, k)H(k, :) is the estimated source
k. The masks are nonnegative and sum to one for
each pair (f, n). This representation allows to identify
visually whether the NMF algorithm was able to sep-
arate the sources properly. All the simulations give a
nice separation with similar results for Wand H. The
activations are coherent with the sequences of the notes.
However, Figure 5 shows that min-vol KL-NMF and
sparse KL-NMF provide a better separation in terms
of time-frequency localization compared to the baseline
KL-NMF.
(a) baseline KL-NMF
(b) min-vol KL-NMF
(c) sparse KL-NMF
Fig. 5: Masking coefficients obtained with baseline KL-
NMF (top), min-vol KL-NMF (middle) and sparse KL-
NMF (bottom) applied to “Mary had a little lamb”
amplitude spectrogram with K=3.
We now perform the same experiment but using K=7.
Figure 6 presents the results. This situation corresponds
to the situation where the factorization rank is overes-
timated. Figure 7 presents the time-frequency masking
coefficients. We observe that min-vol KL-NMF is able
to extract the three notes correctly and set automatically
to zero three source estimates (more precisely, three rows
of Hare set to zero, while the corresponding columns of
Whave entries equal to one another as ||W(:, k)||1= 1
for all k) while baseline KL-NMF and sparse KL-NMF
split the notes in all the sources. One can observe that a
fourth note is identified in all simulations (see isolated
9
(a) Columns of W(b) Rows of H
Fig. 6: Comparative study of baseline KL-NMF (top),
min-vol KL-NMF (middle) and sparse KL-NMF (bot-
tom) applied to “Mary had a little lamb” amplitude
spectrogram with K=7.
peaks on Figure 7-(b), second row of Hfrom the top)
and corresponds to the noise within the piano just before
triggering a specific note (in particular, the hammer
noise). This observation is confirmed by the fact that
the amplitude is proportional to the natural strength of
the fingers playing the notes. In this scenario, with K
is overestimated, min-vol KL-NMF outperforms baseline
KL-NMF and sparse KL-NMF.
b) Prelude of Bach: The second audio sample corre-
sponds to the first 30 seconds of “Prelude and Fugue
No.1 in C major” from J. S. Bach played by Glenn
Gould1. The audio sample is a sequence of 13 notes:
B3,C4,D4,E4,F#
4,G4,A4,C5,D5,E5,F5,G5,A5.
The recorded signal is downsampled to fs= 11025Hz
yielding T=330750 samples. STFT of the input signal x
yields a temporal resolution of 46ms and a frequency
resolution of 10.76Hz, so that the amplitude spectro-
gram Vhas N=647 frames and F=513 frequency bins.
The musical score is presented on Figure 8. All NMF
algorithms were run for 300 iterations which allowed
1https://www.youtube.com/watch?v=ZlbK5r5mBH4
(a) baseline KL-NMF
(b) min-vol KL-NMF
(c) sparse KL-NMF
Fig. 7: Masking coefficients obtained with baseline KL-
NMF (top), min-vol KL-NMF (middle) and sparse KL-
NMF (bottom) applied to “Mary had a little lamb”
amplitude spectrogram with K=7.
them to converge. Figure 9 presents the results obtained
for Wand Hwith a factorization rank K= 16,
hence overestimated. We observe that min-vol KL-NMF
automatically sets three components to zero (with *
symbol on Figure 9) while 13 source estimates are
determined. The analysis of the fundamentals (maximum
peak frequency) of the 13 source estimates correspond to
the theoretical fundamentals of the 13 notes mentioned
earlier. Note that using baseline KL-NMF or sparse KL-
NMF led to same conclusions as for the first audio
10
2 3 4
5 6 7 8
c
&
&
˙
œ
Jœ
œœœœœœ
˙
œ
Jœ
œœœœœœ
˙
œ
Jœ
œœœœœœ≈
˙
œ
Jœ
œœœœœœ ≈
˙
œ
Jœ
œœœœœœ≈
˙
œ
Jœ
œœœœœœ ≈
˙
œ
Jœ
œœœœœœ
˙
œ
Jœ
œœœœœœ
˙
œ
Jœ
œœœœœœ
˙
œ
Jœ
œœœœœœ
˙
œ
Jœ
œ# œœœœœ
˙
œ
Jœ
œœœœœœ
˙
œ
Jœ
œœœœœœ
˙
œ
Jœ
œœœœœœ
˙
œ
Jœ
œœœœœœ
˙
œ
Jœ
œœœœœœ
Fig. 8: Musical score of the sample “Prelude and Fugue
No.1 in C major”.
sample; these two algorithms generate as many source
estimates as imposed by the rank of factorization while
min-vol KL-NMF algorithm preserves the integrity of
the 13 sources. Additionally, the activations are coherent
with the sequences of the notes. Figure 10 shows (on a
limited time interval) that the estimate sequence follows
the sequence defined in the score. Note that a threshold
and permutations on rows of Hwas used to improve
visibility.
c) Bass and drums: The third audio signal is a synthetic
mix of a bass and drums2. The audio signal is downsam-
pled to fs=16000Hz yielding T=104821 samples. STFT
of the input signal xyields a temporal resolution of
32ms and a frequency resolution of 15.62Hz, so that the
amplitude spectrogram Vhas N=206 frames and F=513
frequency bins. For this synthetic mix, we have access
to the true sources under the form of two audio files.
Therefore, we can estimate the quality of the separation
with standard metrics, namely the signal to distortion
ratios (SDR), the source to interference ratios (SIR) and
the sources to artifacts ratios (SAR) [19]. They have been
computed with the toolbox BSS Eval3. The metrics are
expressed in dB and the higher they are the better is the
separation. Algorithms min-vol KL-NMF, baseline KL-
NMF and sparse KL-NMF have been considered for this
comparative study. A factorization rank equal to two is
used. It is clear that the rank-one approximation is too
simplistic for these sources but the goal is to compare
the algorithms and show that min-vol KL-NMF is able
to find a better solution even in this simplified context.
All NMF algorithms were run for 400 iterations which
allowed them to converge. Table III shows the results.
Except for SAR metric for the second source (drums),
min-vol KL-NMF outperforms baseline KL-NMF and
sparse KL-NMF.
d) Runtime performance: Let us compare the runtime
of baseline KL-NMF, min-vol KL-NMF (Algorithm 1)
and sparse KL-NMF [18]. The algorithms are compares
2http://isse.sourceforge.net/demos.html
3http://bass-db.gforge.inria.fr/bss eval/
(a) Columns of W
(b) Rows of H
Fig. 9: Factors matrices Wand Hobtained with min-vol
KL-NMF with factorization rank K=16 on the sample
“Prelude and Fugue No.1 in C major”.
on the three examples presented in paragraphs IV-0a and
IV-0b:
Setup ]1: sample “Mary had a little lamb” with
K= 3, 200 iterations.
Setup ]2: sample “Mary had a little lamb” with
K= 7, 200 iterations.
Setup ]3: “Prelude and Fugue No.1 in C major”
with K= 16, 300 iterations.
For each test setup, the algorithms are run for the same
20 random initializations of Wand H. Table IV reports
the average and standard deviation of the runtime (in
seconds) over these 20 runs. We observe that the runtime
of min-vol KL-NMF (Algorithm 1) is slower but not
significantly so, as expected. In particular, on the larger
setup ]3, it is less than three times slower than the
standard MU.
11
TABLE III: SDR, SIR and SAR metrics comparison for results obtained with baseline KL-NMF and min-vol
KL-NMF on a synthetic mix of bass and drums
Algorithms Source 1: bass Source 2: drums
SDR(dB) SIR(dB) SAR(dB) SDR(dB) SIR(dB) SAR(dB)
min-vol KL-NMF -1.14 0.12 7.78 9.60 19.8 10.09
baseline KL-NMF -4.26 -1.39 2.64 7.97 9.00 15.25
sparse KL-NMF -4.69 -1.73 2.33 7.89 8.96 14.98
Fig. 10: Validation of the estimate sequence obtained
with min-vol KL-NMF with factorization rank K=16 on
the sample “Prelude and Fugue No.1 in C major”.
TABLE IV: Runtime performance in seconds of baseline
KL-NMF, min-vol KL-NMF (Algorithm 1) and sparse
KL-NMF [18]. The table reports the average and stan-
dard deviation over 20 random initializations for three
experimental setups described in the text.
Algorithms runtime in seconds
setup ]1 setup ]2 setup ]3
baseline KL-NMF 0.44±0.03 0.43±0.01 3.81±0.19
min-vol KL-NMF 3.79±0.13 2.39±0.30 10.19±1.28
sparse KL-NMF 0.20±0.02 0.20±0.01 2.21±0.26
V. CONCLUSION AND PERSPECTIVES
In this paper, we have presented a new NMF model of
audio source separation based on the minimization of
a cost function that includes a β-divergence (data fitting
term) and a penalty term that promotes solutions Wwith
minimum volume. We have proved the identifiability
of the model in the exact case, under the sufficiently
scattered condition for the activation matrix H. We have
provided multiplicative updates to tackle this problem
and have illustrated the behaviour of the method on real-
world audio signals. We highlighted the capacity of the
model to deal with the case where Kis overestimated
by setting automatically to zero some components and
give good results for the source estimates.
Further work includes tackling the following questions:
Under which conditions can we prove the identifi-
ability of min-vol β-NMF in the presence of noise,
and the rank-deficient case?
Can we prove that min-vol β-NMF performs model
order selection automatically? Under which condi-
tions? We have observed this behaviour on many
examples, but the proof remains elusive.
Can we design more efficient algorithms?
Further work also includes the use of our new model
min-vol β-NMF for other applications and the design
of more efficient algorithms (for example, that avoid
using a line-search procedure) with stronger convergence
guarantees (beyond the monotonicity of the objective
function).
Acknowledgments: We thank Kejun Huang and Xiao Fu
for helpful discussion on Theorem 1, and giving us the
insight to adapt their proof from [7] to our model (2). We
also thank the reviewers for their insightful comments
that helped us improve the paper.
APPENDIX
A. Sufficiently scattered condition and identifiability
Before giving the definition of the sufficiently scattered
condition from [8], let us first recall an important prop-
erty of the duals of nested cones.
Lemma 4. Let C1and C2be convex cones such that
C1⊆ C2. Then C
2⊆ C
1where C
2and C
1are respectively
the dual cones of C1and C2. The dual of a cone Cis
defined as C=y|xTy0for all x∈ C.
12
Definition 2. (Sufficiently Scattered) A matrix H
RK×N
+is sufficiently scattered if
1) C ⊆ cone (H), and
2) cone (H)bdC={λek|λ0, k = 1, ..., K },
where C=x|xTeK1kxk2is a second
order cone, C=x|xTe≥ kxk2,cone (H) =
{x|x=Hθ, θ 0}is the conic hull of the columns of
H, and bd denotes the boundary of a set.
We can now prove Theorem 1.
Proof of Theorem 1. Recall that W#and H#are the
true latent factors that generated V, with rank(V) =
Kand H#is sufficiently scattered. Let us consider ˆ
W
and ˆ
Ha feasible solution of (3). Since rank(V) = K
and V=ˆ
Wˆ
H, we must have rank( ˆ
W) = rank(ˆ
H) =
K. Hence there exists an invertible matrix ARK×K
such that ˆ
W=W#A1and ˆ
H=AH#. Since ˆ
Wis a
feasible solution of problem (3), we have
eTˆ
W=eTW#A1=eTA1=eT,
where we assumed eTW#=eTwithout loss of gen-
erality since W#0and rank(W#) = K. Note
that eTA1=eTis equivalent to eTA=eT. This
means that matrix Ais column stochastic. Therefore we
have that eTAe =K. Since ˆ
His a feasible solution,
we also have ˆ
H=AH#0. Let us denote by
ajthe jth row of A, and by aT
kthe kth column of
AT. By the definition of the a dual cone, AH#0
means that the rows ajcone(H#)for j= 1, ..., K.
Since H#is sufficiently scattered, cone (H)⊆ C(by
Lemma 4) hence aj∈ C. Therefore we have kajk2
ajeby definition of C. This leads to the following:
|det(A)|=|det(AT)| ≤ QK
k=1
aT
k
2=QK
j=1 kajk2
QK
j=1 ajePK
jaje
KK
=eTAe
KK
= 1. The
first inequality is the Hadamard inequality, the second
inequality is due to aj∈ C, the third inequality
is the arithmetic-geometric mean inequality. Now we
can conclude exactly as is done in [7, Theorem 1] by
showing that matrix Acan only be a permutation matrix
for an optimal solution ( ˆ
W,ˆ
H) of (3), and therefore
identifiability for model (3) holds.
B. Proof of Lemma 3
Separability of ¯
l(w|˜w)holds since Φ ( ˜w)is diagonal.
The condition ¯
l( ˜w|˜w) = l( ˜w)from Definition 1 can be
checked easily. It remains to prove that ¯
l(w|˜w)l(w)
for all w. Let us first rewrite the quadratic function l(w)
using its Taylor expansion at w= ˜w:l(w) = l( ˜w) +
(w˜w)Tl( ˜w) + 1
2(w˜w)T2l( ˜w) (w˜w) =
l( ˜w)+(w˜w)T2Y˜w+1
2(w˜w)T2Y(w˜w). Prov-
ing that ¯
l(w|˜w)l(w)is equivalent to proving that
1
2(w˜w)T[Φ ( ˜w)2Y] (w˜w)0, which boils
down to proving that the matrix [Φ ( ˜w)2Y]is positive
semi-definite. We have Φij( ˜w)=2δij (Y+˜w)i+(Y˜w)i
˜wi,
where δij is the Kronecker symbol. Let us consider
the following matrix: Mij( ˜w) = ˜wi[Φ ( ˜w)2Y]ij ˜wj,
which is a rescaling of [Φ ( ˜w)2Y]. It remains to
show that Mis positive semi-definite4. Since Mis
symmetric and its diagonal entries are non-negative, it
is sufficient to show that Mis diagonally dominant [?,
Proposition 7.2.3], that is,
|Mii| ≥ X
j6=i|Mij |for all i.
We have for all ithat
Mii = 2wiX
jY+
ij +Y
ij wj2wiYiiwi,and
Mij =2wiYij wjfor j6=i.
Since Y+
ij +Y
ij =|Yij |, we have
Mii X
j6=i|Mij |= 2wiX
j|Yij |wj2wiYiiwi
2wiX
j6=i|Yij |wj
= 2wi|Yii|wi2wiYii wi0,
implying that Mis diagonally dominant.
C. Algorithm for min-vol IS-NMF
For β= 0 (IS divergence), the derivative of the auxiliary
function ¯
F(w|˜w)with respect to a specific coefficient wk
is given by:
wk¯
F(w|˜w) = X
n
hkn
˜vnX
n
hkn
˜w2
kvn
w2
k˜v2
n
+ 2λ[Y˜w]k
+ 2λDiag Y+˜w+Y˜w
˜wk
wk
2λDiag Y+˜w+Y˜w
˜wk
˜wk.
4The remainder of the proof was suggested to us by one of the
reviewers, it is more elegant and simpler than our original proof.
13
Let
˜a= 2λDiag Y+˜w+Y˜w
˜wk
,
˜
b=X
n
hkn
˜vn4λY˜wk,
˜
d=X
n
hkn
˜w2
kvn
˜v2
n
.
(9)
Setting the derivative to zero requires to compute the
roots of the following degree-three polynomial ˜aw3
k+
˜
bw2
k+˜
d. We used the procedure developed in [20] which
is based on the explicit calculation of the intermediary
root of a canonical form of cubic. This procedure is able
to provide highly accurate numerical results even for
badly conditioned polynomials. The algorithm for min-
vol IS-NMF follows the same steps as for min-vol KL-
NMF: only the two steps corresponding to the updates
of Wand Hhave to be modified. For the update of H
(step 4), use the standard MU. For the update of W(step
9), use
for f1to F
for k1to K
Compute ˜a,˜
band ˜
daccording to equations (9)
Compute the roots of ˜aw3
k+˜
bw2
k+˜
d
Pick yamong these roots and zero that minimizes
the objective
W+
f,k max 1016, y
end for
end for
REFERENCES
[1] A. Lef`
evre, “M´
ethode d’apprentissage de dictionnaire pour la
s´
eparation de sources audio avec un seul capteur,” Ph.D. disser-
tation, Ecole Normale Sup´
erieure de Cachan, 2012.
[2] P. Magron, “Reconstruction de phase par mod `
eles de signaux :
application `
a la s´
eparation de sources audio,” Ph.D. dissertation,
TELECOM ParisTech, 2016.
[3] D. Lee and H. Seung, “Algorithms for non-negative matrix
factorization,” in NIPS’00 Proceedings of the 13th International
Conference on Neural Information Processing Systems, NIPS.
MIT Press Cambridge, 2000, pp. 535–541.
[4] C. F´
evotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix
factorization with the Itakura-Saito divergence: With application
to music analysis,” Neural computation, vol. 21, no. 3, pp. 793–
830, 2009.
[5] C. F´
evotte and J. Idier, “Algorithms for nonnegative matrix fac-
torization with the β-divergence,Neural computation, vol. 23,
no. 9, pp. 2421–2456, 2011.
[6] G. Zhou, S. Xie, Z. Yang, J.-M. Yang, and Z. He, “Minimum-
volume-constrained nonnegative matrix factorization: Enhanced
ability of learning parts,” IEEE Transactions on Neural Networks,
vol. 22, no. 10, pp. 1626–1637, 2011.
[7] X. Fu, K. Huang, and N. D. Sidiropoulos, “On identifiability
of nonnegative matrix factorization,IEEE Signal Processing
Letters, vol. 25, no. 3, pp. 328–332, 2018.
[8] K. Huang, N. Sidiropoulos, and A. Swami, “Non-negative matrix
factorization revisited: Uniqueness and algorithm for symmet-
ric decomposition,” IEEE Transactions on Signal Processing,
vol. 62, no. 1, pp. 211–224, 2014.
[9] X. Fu, K. Huang, N. Sidiropoulos, and W.-K. Ma, “Nonnegative
matrix factorization for signal and data analytics: Identifiability,
algorithms, and applications,” IEEE Signal Processing Magazine,
vol. 36, pp. 59–80, 2019.
[10] X. Fu, K. Huang, B. Yang, W.-K. Ma, and N. Sidiropoulos,
“Robust volume minimization-based matrix factorization for re-
mote sensing and document clustering,” IEEE Trans. on Signal
Processing, vol. 64, no. 23, p. 6254–6268.
[11] A. Ang and N. Gillis, “Algorithms and comparisons of non-
negative matrix factorization with volume regularization for
hyperspectral unmixing,” Journal of Selected Topics in Applied
Earth Observations and Remote Sensing, 2019, to appear.
[12] V. Leplat, A. Ang, and N. Gillis, “Minimum-volume rank-
deficient nonnegative matrix factorizations,” in IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2019, pp. 3402–3406.
[13] C.-H. Lin, W.-K. Ma, W.-C. Li, C.-Y. Chi, and A. Ambikapathi,
“Identifiability of the simplex volume minimization criterion for
blind hyperspectral unmixing: The no-pure-pixel case,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 53, no. 10,
pp. 5530–5546, 2015.
[14] S. Arora, R. Ge, R. Kannan, and A. Moitra, “Computing a
nonnegative matrix factorization—provably,” SIAM Journal on
Computing, vol. 45, no. 4, pp. 1582–1611, 2016.
[15] S. Vavasis, “On the complexity of nonnegative matrix factoriza-
tion,” SIAM Journal on Optimization, vol. 20, no. 3, pp. 1364–
1377, 2010.
[16] X. Fu, K. Huang, N. D. Sidiropoulos, Q. Shi, and M. Hong,
“Anchor-free correlated topic modeling,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 41, no. 5, pp.
1056–1071, 2019.
[17] Y. Sun, P. Babu, and D. Palomar, “Majorization-minimization
algorithms in signal processing, communications, and machine
learning,” IEEE Transactions on Signal Processing, vol. 65, no. 3,
pp. 794–816, 2017.
[18] J. L. Roux, F. J. Weninger, and J. R. Hershey, “Sparse NMF –
half-baked or well done?” Mitsubishi Electric Research Labora-
tories (MERL), Tech. Rep., 2015.
[19] E. Vincent, R. Gribonval, and C. F´
evotte, “Performance mea-
surement in blind audio source separation,” IEEE Transations on
Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462
– 1469, June 2006.
[20] E. Rechtschaffen, “Real roots of cubics: explicit formula for
quasi-solutions,” The Mathematical Gazette, no. 524, p. 268–276,
2008.
14
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
ABSTRACT In recent years, nonnegative matrix factorization (NMF) with volume regularization has been shown to be a powerful identifiable model; for example for hyperspectral unmixing, document classification, community detection and hidden Markov models. In this paper, we show that minimum-volume NMF (min-volNMF) can also be used when the basis matrix is rank deficient, which is a reasonable scenario for some real-world NMF problems (e.g., for unmixing multispectral images). We propose an alternating fast projected gradient method for minvol NMF and illustrate its use on rank-deficient NMF problems; namely a synthetic data set and a multispectral image. Index Terms— nonnegative matrix factoriztion, minimum volume, identifiability, rank deficiency
Article
Full-text available
In topic modeling, identifiability of the topics is an essential issue. Many topic modeling approaches have been developed under the premise that each topic has an anchor word, which may be fragile in practice, because words and terms have multiple uses; yet it is commonly adopted because it enables identifiability guarantees. Remedies in the literature include using three- or higher-order word co-occurence statistics to come up with tensor factorization models, but identifiability still hinges on additional assumptions. In this work, we propose a new topic identification criterion using second order statistics of the words. The criterion is theoretically guaranteed to identify the underlying topics even when the anchor-word assumption is grossly violated. An algorithm based on alternating optimization, and an efficient primal-dual algorithm are proposed to handle the resulting identification problem. The former exhibits high performance and is completely parameter-free; the latter affords up to 200 times speedup relative to the former, but requires step-size tuning and a slight sacrifice in accuracy. A variety of real text copora are employed to showcase the effectiveness of the approach, where the proposed anchor-free method demonstrates substantial improvements compared to a number of anchor-word based approaches under various evaluation metrics.
Article
Full-text available
Nonnegative matrix factorization (NMF) has become a workhorse for signal and data analytics, triggered by its model parsimony and interpretability. Perhaps a bit surprisingly, the understanding to its model identifiability---the major reason behind the interpretability in many applications such as topic mining and hyperspectral imaging---had been rather limited until recent years. Beginning from the 2010s, the identifiability research of NMF has progressed considerably: Many interesting and important results have been discovered by the signal processing (SP) and machine learning (ML) communities. NMF identifiability has a great impact on many aspects in practice, such as ill-posed formulation avoidance and performance-guaranteed algorithm design. On the other hand, there is no tutorial paper that introduces NMF from an identifiability viewpoint. In this paper, we aim at filling this gap by offering a comprehensive and deep tutorial on model identifiability of NMF as well as the connections to algorithms and applications. This tutorial will help researchers and graduate students grasp the essence and insights of NMF, thereby avoiding typical `pitfalls' that are often times due to unidentifiable NMF formulations. This paper will also help practitioners pick/design suitable factorization tools for their own problems.
Article
Full-text available
In this letter, we propose a new identification criterion that guarantees the recovery of the low-rank latent factors in the nonnegative matrix factorization (NMF) model, under mild conditions. Specifically, using the proposed criterion, it suffices to identify the latent factors if the rows of one factor are \emph{sufficiently scattered} over the nonnegative orthant, while no structural assumption is imposed on the other factor except being full-rank. This is by far the mildest condition under which the latent factors are provably identifiable from the NMF model.
Thesis
Full-text available
De nombreux traitements appliqués aux signaux audio travaillent sur une représentation Temps-Fréquence (TF) des données. Lorsque le résultat de ces algorithmes est un champ spectral d'amplitude, la question se pose, pour reconstituer un signal temporel, d'estimer le champ de phase correspondant. C'est par exemple le cas dans les applications de séparation de sources, qui estiment les spectrogrammes des sources individuelles à partir du mélange ; la méthode dite de filtrage de Wiener, largement utilisée en pratique, fournit des résultats satisfaisants mais est mise en défaut lorsque les sources se recouvrent dans le plan TF.Cette thèse aborde le problème de la reconstruction de phase de signaux dans le domaine TF appliquée à la séparation de sources audio. Une étude préliminaire révèle la nécessité de mettre au point de nouvelles techniques de reconstruction de phase pour améliorer la qualité de la séparation de sources. Nous proposons de baser celles-ci sur des modèles de signaux. Notre approche consiste à exploiter des informations issues de modèles sous-jacents aux données comme les mélanges de sinusoïdes. La prise en compte de ces informations permet de préserver certaines propriétés intéressantes, comme la continuité temporelle ou la précision des attaques. Nous intégrons ces contraintes dans des modèles de mélanges pour la séparation de sources, où la phase du mélange est exploitée. Les amplitudes des sources pourront être supposées connues, ou bien estimées conjointement dans un modèle inspiré de la factorisation en matrices non-négatives complexe. Enfin, un modèle probabiliste de sources à phase non-uniforme est mis au point. Il permet d'exploiter les à priori provenant de la modélisation de signaux et de tenir compte d'une incertitude sur ceux-ci.Ces méthodes sont testées sur de nombreuses bases de données de signaux de musique réalistes. Leurs performances, en termes de qualité des signaux estimés et de temps de calcul, sont supérieures à celles des méthodes traditionnelles. En particulier, nous observons une diminution des interférences entre sources estimées, et une réduction des artéfacts dans les basses fréquences, ce qui confirme l'intérêt des modèles de signaux pour la reconstruction de phase.
Article
Full-text available
In blind hyperspectral unmixing (HU), the pure-pixel assumption is well-known to be powerful in enabling simple and effective blind HU solutions. However, the pure-pixel assumption is not always satisfied in an exact sense, especially for scenarios where pixels are all intimately mixed. In the no pure-pixel case, a good blind HU approach to consider is the minimum volume enclosing simplex (MVES). Empirical experience has suggested that MVES algorithms can perform well without pure pixels, although it was not totally clear why this is true from a theoretical viewpoint. This paper aims to address the latter issue. We develop an analysis framework wherein the perfect identifiability of MVES is studied under the noiseless case. We prove that MVES is indeed robust against lack of pure pixels, as long as the pixels do not get too heavily mixed and too asymmetrically spread. Also, our analysis reveals a surprising and counter-intuitive result, namely, that MVES becomes more robust against lack of pure pixels as the number of endmembers increases. The theoretical results are verified by numerical simulations.
Article
In this paper, we consider nonnegative matrix factorization (NMF) with a regularization that promotes small volume of the convex hull spanned by the basis matrix. We present highly efficient algorithms for three different volume regularizers, and compare them on endmember recovery in hyperspectral unmixing. The NMF algorithms developed in this paper are shown to outperform the state-of-the-art volume-regularized NMF methods, and produce meaningful decompositions on real-world hyperspectral images in situations where endmembers are highly mixed (no pure pixels). Furthermore, our extensive numerical experiments show that when the data is highly separable, meaning that there are data points close to the true endmembers, and there are a few endmembers, the regularizer based on the determinant of the Gramian produces the best results in most cases. For data that is less separable and/or contains more endmembers, the regularizer based on the logarithm of the determinant of the Gramian performs best in general.
Article
This paper gives an overview of the majorization-minimization (MM) algorithmic framework, which can provide guidance in deriving problem-driven algorithms with low computational cost. A general introduction of MM is presented, including a description of the basic principle and its convergence results. The extensions, acceleration schemes, and connection to other algorithmic frameworks are also covered. To bridge the gap between theory and practice, upperbounds for a large number of basic functions, derived based on the Taylor expansion, convexity, and special inequalities, are provided as ingredients for constructing surrogate functions. With the pre-requisites established, the way of applying MM to solving specific problems is elaborated by a wide range of applications in signal processing, communications, and machine learning.
Article
In the nonnegative matrix factorization (NMF) problem we are given an n×mn \times m nonnegative matrix M and an integer r>0r > 0. Our goal is to express M as AWA W, where A and W are nonnegative matrices of size n×rn \times r and r×mr \times m, respectively. In some applications, it makes sense to ask instead for the product AW to approximate M, i.e. (approximately) minimize MAWF\left\lVert{M - AW}_F\right\rVert, where F\left\lVert\right\rVert_F, denotes the Frobenius norm; we refer to this as approximate NMF. This problem has a rich history spanning quantum mechanics, probability theory, data analysis, polyhedral combinatorics, communication complexity, demography, chemometrics, etc. In the past decade NMF has become enormously popular in machine learning, where A and W are computed using a variety of local search heuristics. Vavasis recently proved that this problem is NP-complete. (Without the restriction that A and W be nonnegative, both the exact and approximate problems can be solved optimally via the singular value decomposition.) We initiate a study of when this problem is solvable in polynomial time. Our results are the following: 1. We give a polynomial-time algorithm for exact and approximate NMF for every constant r. Indeed NMF is most interesting in applications precisely when r is small. 2. We complement this with a hardness result, that if exact NMF can be solved in time (nm)o(r)(nm)^{o(r)}, 3-SAT has a subexponential-time algorithm. This rules out substantial improvements to the above algorithm. 3. We give an algorithm that runs in time polynomial in n, m, and r under the separablity condition identified by Donoho and Stodden in 2003. The algorithm may be practical since it is simple and noise tolerant (under benign assumptions). Separability is believed to hold in many practical settings. To the best of our knowledge, this last result is the first example of a polynomial-time algorithm that provably works under a non-trivial condition on the input and we believe that this will be an interesting and important direction for future work.
Article
This paper considers \emph{volume minimization} (VolMin)-based structured matrix factorization (SMF). VolMin is a factorization criterion that decomposes a given data matrix into a basis matrix times a structured coefficient matrix via finding the minimum-volume simplex that encloses all the columns of the data matrix. Recent work showed that VolMin guarantees the identifiability of the factor matrices under mild conditions that are realistic in a wide variety of applications. This paper focuses on both theoretical and practical aspects of VolMin. On the theory side, exact equivalence of two independently developed sufficient conditions for VolMin identifiability is proven here, thereby providing a more comprehensive understanding of this aspect of VolMin. On the algorithm side, computational complexity and sensitivity to outliers are two key challenges associated with real-world applications of VolMin. These are addressed here via a new VolMin algorithm that handles volume regularization in a computationally simple way, and automatically detects and {iteratively downweights} outliers, simultaneously. Simulations and real-data experiments using a remotely sensed hyperspectral image and the Reuters document corpus are employed to showcase the effectiveness of the proposed algorithm.