Content uploaded by Andersen Ang
Author content
All content in this area was uploaded by Andersen Ang on Apr 29, 2020
Content may be subject to copyright.
Content uploaded by Nicolas Gillis
Author content
All content in this area was uploaded by Nicolas Gillis on Jul 05, 2019
Content may be subject to copyright.
IEEE TRANSACTIONS ON SIGNAL PROCESSING , ISSUE XX, MONTH 2020 1
Blind Audio Source Separation with
Minimum-Volume Beta-Divergence NMF
Valentin Leplat, Nicolas Gillis, Andersen M.S. Ang*
Abstract—Considering a mixed signal composed of various
audio sources and recorded with a single microphone, we
consider on this paper the blind audio source separation
problem which consists in isolating and extracting each
of the sources. To perform this task, nonnegative matrix
factorization (NMF) based on the Kullback-Leibler and
Itakura-Saito β-divergences is a standard and state-of-the-
art technique that uses the time-frequency representation of
the signal. We present a new NMF model better suited for
this task. It is based on the minimization of β-divergences
along with a penalty term that promotes the columns of the
dictionary matrix to have a small volume. Under some mild
assumptions and in noiseless conditions, we prove that this
model is provably able to identify the sources. In order
to solve this problem, we propose multiplicative updates
whose derivations are based on the standard majorization-
minimization framework. We show on several numerical
experiments that our new model is able to obtain more
interpretable results than standard NMF models. Moreover,
we show that it is able to recover the sources even when
the number of sources present into the mixed signal is
overestimated. In fact, our model automatically sets sources
to zero in this situation, hence performs model order
selection automatically.
Index Terms—nonnegative matrix factorization,
β-divergences, minimum-volume regularization,
identifiability, blind audio source separation, model
order selection
I. INTRODUCTION
Blind audio source separation concerns the techniques
used to extract unknown signals called sources from a
mixed audio signal x. In this paper, we assume that
the audio signal is recorded with a single microphone.
Considering a mixed signal composed of various audio
sources, the blind audio source separation consists in
isolating and extracting each of the sources on the
* Department of Mathematics and Operational Research, Facult´
e
Polytechnique, Universit´
e de Mons, Rue de Houdain 9, 7000 Mons,
Belgium. Authors acknowledge the support by the Fonds de la
Recherche Scientifique - FNRS and the Fonds Wetenschappelijk On-
derzoek - Vlanderen (FWO) under EOS Project no O005318F-RG47,
and by the European Research Council (ERC starting grant no 679515).
E-mails: {valentin.leplat, nicolas.gillis, manshun.ang}@umons.ac.be.
Manuscript received in July 2019. Accepted April 2020.
basis of the single recording. Usually, the only known
information is the number of estimated sources present
in the mixed signal. The blind source separation problem
is said to be underdetermined as there are fewer sensors
(only one in our case) than sources. It then appears nec-
essary to find additional information to make the problem
well posed. The most common technique used for this
kind of problem is to get some form of redundancy in
the mixed signal in order to make it overdetermined.
This is typically done by computing the spectrogram
which represents the signal in the time and frequency
domains simultaneously (splitting the signals into over-
lapping time frames). The computation of spectrograms
can be summarized as follows: short time segments are
extracted from the signal and multiplied element wise
by a window function or “smoothing” window of size
F. Successive windows overlap by a fraction of their
length, which is usually taken as 50%. On each of these
segments, a discrete Fourier transform is computed and
stacked column-by-column in a matrix X. Thus, from a
one-dimensional signal x∈RT, we obtain a complex
matrix X∈CF×Ncalled spectrogram where F×N'
2T(due to the 50% overlap between windows). Note that
the length of the window determines the shape of the
spectrogram. These preliminary operations correspond
to computing the short time Fourier transform (STFT),
which is given by the following formula: for 1≤f≤F
and 1≤n≤N,Xf,n =PF−1
j=0 wjxnL+je(−i2πf j
F),
where w∈RFis the smoothing window of size F,Lis
a shift parameter (also called hop size), and H=F−Lis
the overlap parameter. The number of rows corresponds
to the frequency resolution. Letting fsbe the sampling
rate of the audio signal, consecutive rows correspond to
frequency bands that are fs/F Hz apart.
The time-frequency representation of a signal highlights
two of its fundamental properties: sparsity and redun-
dancy. Sparsity comes from the fact that most real
signals are not active at all frequencies at all time points.
Redundancy comes from the fact that frequency patterns
of the sources repeat over time. Mathematically, this
means that the spectrogram is a low-rank matrix. These
arXiv:1907.02404v2 [eess.SP] 28 Apr 2020
two fundamental properties led sound source separation
techniques to integrate algorithms such as nonnegative
matrix factorization (NMF). Such techniques retrieve
sensible solutions even for single-channel signals.
A. Mixing assumptions
Given Ksource signals s(k)∈RTfor 1≤k≤K,
we assume the acquisition process is well modelled by
a linear instantaneous mixing model:
x(t) =
K
X
k=1
s(k)(t) with t= 0, ..., T −1.(1)
Therefore, for each time index t, the mixed signal x(t)
from a single microphone is the sum of the Ksource
signals. It is standard to assume that microphones are
linear as long as the recorded signals are not too loud.
If signals are too loud, they are usually clipped. The
mixing process is modelled as instantaneous as opposed
to convolutive used to take into account sound effects
such as reverberation. The source separation problem
consist in finding source estimates ˆs(k)of s(k)sources
for all k∈ {1, . . . , K}. Let us denote Sthe linear STFT
operator, and let S†be its conjugate transpose. We have
S†S=F I , where Iis the identity matrix of appropriate
dimension. For the remainder of this paper, S†stands for
the inverse short time Fourier transform. Note that the
term inverse is not meant in a mathematical sense. Indeed
the STFT is not a surjective transformation from RTto
CF×N. In other words, each spectrogram or each matrix
with complex entries is not necessarily the STFT of a
real signal; see [1] and [2] for more details. By applying
the STFT operator Sto (1), we obtain the mixing model
in the time-frequency domain :
X=S(x(t)) = S K
X
k=1
s(k)(t)!=
K
X
k=1
S(k),
where S(k)is the STFT of the source k, that is, the
spectrogram of source k. To identify the sources, we
use in this paper the amplitude spectrogram V=|X| ∈
RF×N
+defined as Vfn =|Xfn |for all f,n. We assume
that V=PK
k=1 S(k), which means that there is no
sound cancellation between the sources, which is usually
the case in most signals. Finally, we assume that the
source spectrograms S(k)are well approximated by
nonnegative rank-one matrices. This leads to the NMF
model described in the next section. Note that a source
can be made of several rank-one factors in which case
a post-processing step will have to recombine them a
posteriori (e.g., looking at the correspondence in the
activation of the sources over time). Note also that we
focus on the NMF stage of the source separation which
factorizes Vinto the source spectrograms. For the phases
reconstruction, which is a highly non-trivial problem, we
consider a naive reconstruction procedure consisting in
keeping the same phase as the input mixture for each
source [1].
B. NMF for audio source separation
Given a non-negative matrix V∈RF×N
+(the spec-
trogram) and a positive integer Kmin(F, N)(the
number of sources, called the factorization rank), NMF
aims to compute two non-negative matrices Wwith K
columns and Hwith Krows such that V≈W H. NMF
approximate each column of Vby a linear combination
of the columns of Wweighted by the components of
the corresponding column of H[3]. When the matrix V
corresponds to the amplitude spectrogram or the power
spectrogram of an audio signal, we have that
•Wis referred as the dictionary matrix and each column
corresponds to the spectral content of a source, and
•His the activation matrix specifying if a source is
active at a certain time frame and in which intensity.
In other words, each rank-one factor W(:, k)H(k, :) will
correspond to a source: the kth column W(:, k)of Wis
the spectral content of source k, and the kth row H(k, :)
of His its activation over time. To compute Wand
H, NMF requires to solve the following optimization
problem
min
W≥0,H≥0D(V|W H ) = X
f,n
d(Vfn|[W H ]fn ),
where A≥0means that Ais component-wise nonneg-
ative, and d(x|y)is an appropriate measure of fit. In
audio source separation, a common measure of fit is the
discrete β-divergence denoted dβ(x|y)and equal to
1
β(β−1) xβ+ (β−1) yβ−βxyβ−1for β6= 0,1,
xlog x
y−x+yfor β= 1,
x
y−log x
y−1for β= 0.
For β= 2, this the standard squared Euclidean distance,
that is, the squared Frobenius norm ||V−W H ||2
F. For
β= 1 and β= 0, the β-divergence corresponds to the
Kullback-Leibler (KL) divergence and the Itakura-Saito
(IS) divergence, respectively. The error measure which
should be chosen accordingly with the noise statistic
assumed on the data. The Frobenius norm assumes
i.i.d. Gaussian noise, KL divergence assumes additive
2
Poisson noise, and the IS divergence assumes multi-
plicative Gamma noise [4]. The β-divergence dβ(x|y)
is homogeneous of degree β:dβ(λx|λy) = λβdβ(x|y).
It implies that factorizations obtained with β > 0(such
as the Euclidean distance or the KL divergence) will
rely more heavily on the largest data values and less
precision is to be expected in the estimation of the low-
power components. The IS divergence (β= 0) is scale-
invariant that is dIS (λx|λy) = dI S (x|y)[5]. The IS
divergence is the only one in the β-divergences family
to possess this property. It implies that time-frequency
areas of low power are as important in the divergence
computation as the areas of high power. This property
is interesting in audio source separation as low-power
frequency bands can perceptually contribute as much as
high-power frequency bands. Note that both KL and IS
divergences are more adapted to audio source separation
than Euclidean distance as it is built on logarithmic
scale as human perception; see [1] and [5]. Moreover,
the β-divergence is only convex with respect to W(or
H) if β≥1. Otherwise, the objective function is non-
convex. This implies that, for β < 1, even the problem
of inferring Hwith Wfixed is non-convex. For more
details on β-divergences; see [5].
C. Contribution and outline of the paper
In Section II, we propose a new NMF model, referred
to as minimum-volume β-NMF (min-vol β-NMF), to
tackle the audio source separation problem. This model
penalizes the columns of the dictionary matrix Wso that
their convex hull has a small volume. To the best of our
knowledge, this model is novel in two aspects: (1) it is
the first time a minimum-volume penalty is associated
with a β-divergence for β6= 2 and it is the first time such
models are used in the context of audio source separa-
tion, and (2) as opposed to most previously proposed
minimum-volume NMF models, our model imposes a
normalization constraints on the factor Winstead of H.
As far as we know, the only other paper that used a
normalization of Wis [6] but the authors did not justify
this choice compared to the normalization of H(the
choice seems arbitrary, motivated by the ‘elimination
of the norm indeterminacy’), nor provided theoretical
guarantees. In this paper, we explain why normalization
of Wis a better choice in practice, and we prove that,
under some mild assumptions and in the noiseless case,
this model provably identify the sources; see Theorem 1.
To the best of our knowledge, this is the first result of
this type in the audio source separation literature. In
Section III, we propose an algorithm to tackle min-vol
β-NMF, focusing on the KL and IS divergences. The
algorithm is based on multiplicative updates (MU) that
are derived using the standard majorization-minimization
framework, and that monotonically decrease the objec-
tive function. In Section IV, we present several numerical
experiments, comparing min-vol β-NMF with standard
NMF and sparse NMF. The two mains conclusions are
that (1) minimum-volume β-NMF performs consistently
better to identify the sources, and (2) as opposed to NMF
and sparse NMF, min-vol β-NMF is able to detect when
the factorization rank is overestimated by automatically
setting sources to zero.
II. MINIMUM-VO LU ME NMF WITH β-DIVERGENCES
In this section, we present a new model of separation
based on the minimization of β-divergences including a
penalty term promoting solutions with minimum volume
spanned by the columns of the dictionary matrix W.
Section II-A recalls the geometric interpretation of NMF
which motivated the use of a minimum volume penalty
on the dictionary W. Section II-B discusses the new
proposed normalization compared to previous minimum
volume NMF models, and proves that min-vol β-NMF
provably recovers the true factors (W, H)under mild
conditions and in the noiseless case; see Theorem 1.
A. Geometry and the min-vol β-NMF model
As mentioned earlier, V=W H means that each
column of Vis a linear combination of the columns
of Wweighted by the components of the corresponding
column of H; in fact, vn=W hnfor n= 1, ..., N , where
vndenotes the nth column of data matrix V. This gives
to NMF a nice geometric interpretation: for all n
vn∈cone(W) = v∈RF|v=W θ, θ ≥0,
meaning that the columns of Vare contained in the con-
vex cone generated by the columns of W; see Figure 1
for an illustration. From this interpretation, it follows
that, in general, NMF decompositions are not unique
because there exists several (often, infinitely many) sets
of columns of Wthat span the convex cone generated
by the data points; see for example [8] for more details.
Hence, NMF is in most cases ill-posed because the
optimal solution is not unique. In order to make the
solution unique (up to permutation and scaling of the
columns of Wand the rows of H) hence making the
problem well-posed and the parameters (W, H)of the
problem identifiable, a key idea is to look for a solution
Wwith minimum volume. Intuitively, we will look for
the cone cone(W)containing the data points and as close
3
Fig. 1: Geometric interpretation of NMF for K= 3 [7].
as possible to these data points. The use of minimum-
volume NMF has lead to a new class of NMF methods
that outperforms existing ones in many applications such
as document analysis and blind hyperspectral unmixing;
see the recent survey [9]. Note that minimum-volume
NMF implicitly enhances the factor Hto be sparse: the
fact that Whas a small volume implies that many data
points will be located on the facets of the cone(W)hence
Hwill be sparse.
Hence, in this paper, we consider the following model,
referred to as min-vol β-NMF:
min
W(:,j)∈∆F∀j,H ≥0Dβ(V|W H) + λvol(W),(2)
where ∆F=nx∈RF
+PF
i=1 xi= 1ois the unit sim-
plex, λis a penalty parameter and vol(W)is a function
that measures the volume spanned by the columns of W.
In this paper, we use vol(W) = logdet(WTW+δI),
where δis a small positive constant that prevents
logdet(WTW)to go to −∞ when Wtends to a
rank-deficient matrix (that is, when r=rank(W)<
K). The reason for using such a measure is that
pdet (WTW)/K!is the volume of the convex hull of
the columns of Wand the origin. This measure is one
of the most widely used ones, and has been shown to
perform very well in practice [10], [11]. Moreover, the
criterion logdet(WTW+δI)is able to distinguish two
rank-deficient solutions and favour solutions for Wwith
smaller volume [12]. Finally, as we will illustrate in
Section IV, this criterion is able to identify the right
number of source even when Kis overestimated, by
putting some rank-one factors to zero.
B. Normalization and identifiability of min-vol β-NMF
As mentioned above, under some appropriate conditions
on V=W H , minimum-volume NMF models will
provably recover the ground-truth (W, H )that generated
V, up to permutation and scaling of the rank-one fac-
tors. The first identifiability results for minimum-volume
NMF models assumed that the entries in each column
of Hsum to one, that is, that HTe=ewhere eis the
all-one column vector whose dimension is clear from
the context, meaning that His column stochastic [8],
[13]. Under this condition, each column of Vlies in the
convex hull of the columns of W; see Figure 2 for an
illustration.
Fig. 2: Geometric interpretation of NMF when K= 3
and His column stochastic [7].
Under the three assumptions that (1) His column
stochastic, (2) Wis full column rank, and (3) Hsatis-
fies the sufficiently scattered condition, minimizing the
volume of conv(W)such that V=W H recover the
true underlying factors, up to permutation and scaling.
Intuitively, the sufficiently scattered condition requires
Hto be sparse enough so that data points are located on
the facets of conv(W); see Appendix A for a formal
definition. The sufficiently scattered condition makes
sense for most audio source data sets as it is reasonable
to assume that, for most time points, only a few sources
are active hence His sparse; see [9] for more details
on the sufficiently scattered condition. Note that the
sufficiently scattered condition is a generalization of the
separability condition which requires W=V(:,J)for
some index set Jof size K[14]. However, separability
is a much stronger assumption as it requires that, for
each sources, there exists a time point where only
that source is active. Note that although min-vol NMF
guarantees identifiability, the corresponding optimization
4
problem (2) is still hard to solve in general, as for the
original NMF problem [15].
Despite this nice result, the constraint HTe=emakes
the NMF model less general and does not apply to all
data sets. In the case where the data does not naturally
belong to a convex hull, one needs to normalize the
data points so that their entries sum to one so that
HTe=ecan be assumed without loss of generality (in
the noiseless case). This normalization can sometimes
increase the noise and might greatly influence the solu-
tion, hence are usually not recommended in practice; see
the discussion in [9].
In [7], authors show that identifiability still holds when
the condition that His column stochastic is relaxed
to Hbeing row stochastic. As opposed to column
stochasticity, row stochasticity of Hcan be assumed
without loss of generality since any factorization W H
can be properly normalized so that this assumption holds.
In fact, W H =PK
k=1(akW(:, k))(H(k, :)/ak)for any
ak>0for k= 1, . . . , K. In other terms, letting Abe the
diagonal matrix with A(k, k) = ak=Pn
j=1 H(k, j)for
k= 1, . . . , K, we have W H = (W A)(A−1H) = W0H0
where H0=A−1His row stochastic.
Similarly as done in [7], we prove in this paper that
requiring Wto be column stochastic (which can also be
made without loss of generality) also leads to identifia-
bility. Geometrically, the columns of Ware constrained
to be on the unit simplex. Minimizing the volume still
makes a lot of sense: we want the columns of Wto be as
close as possible to one another within the unit simplex.
In Appendix A, we prove the following theorem.
Theorem 1. Assume V=W#H#with rank(V) = K,
W#≥0and H#satisfies the sufficiently scattered
condition (Definition 2 in Appendix A). Then the optimal
solution of
min
W∈RF×K,H∈RK×Nlogdet WTW(3)
such that V=W H, W Te=e, H ≥0,
recovers (W#, H#)up to permutation and scaling.
Proof. See Appendix A.
In noiseless conditions, replacing WTe=ewith He =
ein (3) leads to the same identifiability result; see [7,
Theorem 1]. Therefore, in noiseless conditions and under
the conditions of Theorem 1, both models return the
same solution up to permutation and scaling. However,
in the presence of noise, we have observed that the two
models may behave very differently. In fact, we advocate
that the constraint WTe=eis better suited for noisy
real-world problems, which we have observed on many
numerical examples. In fact, we have observed that the
normalization WTe=eis much less sensitive to noise
and returns much better solutions. The reason is mostly
twofold:
(i) As described above, using the normalization He =e
amounts to multiply Wby a diagonal matrix whose
entries are the `1norms of the rows of H. Therefore,
the columns of Wthat correspond to dominating (resp.
dominated) sources, that is, sources with much more
(resp. less) power and/or active at many (resp. few)
time points, will have much higher (resp. lower) norm.
Therefore, the term logdet(WTW+δI)is much more
influenced by the dominating sources and will have
difficulties to penalize the dominated sources. In other
terms, the use of the term logdet(WTW+δI)with the
normalization He =eimplicitly requires that the rank-
one factors W(:, k)H(k, :) for k= 1, . . . , K are well
balanced, that is, have similar norms. This is not the
case for many real (audio) signals.
(ii) As it will be explained in Section III, the update of
Wneeds the computation of the matrix Ywhich is the
inverse of WTW+δI–this terms appears in the gradient
with respect to Wof the objective function. The numeri-
cal stability for such operations is related to the condition
number of WTW+δI. For a `1normalization on the
columns of W, the condition number is bounded above
as follows: cond(WTW+δI) = σmax (WTW+δI)
σmin(WTW+δI )=
σmax(W)2+δ
σmin(W)2+δ≤(√Kmaxk||W(:,k)||2)2+δ
δ≤1 + K
δ, where
σmin(W)and σmax (W)are the smallest and largest
singular values of W, respectively. In the numerical
experiments, we use δ= 1. On the other hand, the
normalization He =emay lead to arbitrarily large
values for the condition number of WTW+δI, which
we have observed numerically on several examples. This
issue can be mitigated with the use of the normalization
He =ρe for some ρ > 0sufficiently large for which
identifiabilty still holds [7]. However, it still performs
worse because of the first reason explained above.
For these reasons, we believe that the model (3) would
also be better suited (compared to the normalization
on H) in other contexts; for example for document
classification [16].
III. ALGORITHM FOR MIN-VOL β-NMF
Most NMF algorithms alternatively update Hfor W
fixed and vice versa, and we adopt this strategy in this
paper. For Wfixed, (2) is equivalent to standard NMF
5
and we will use the MU that have already been derived
in the literature [3], [5].
To tackle (2) for Hfixed, let us consider
min
W≥0F(W) = Dβ(V|W H ) + λlogdet(WTW+δI).
(4)
Note that, for now, we have discarded the normalization
on the columns of W. In our algorithm, we will use
the update for Wobtained by solving (4) as a descent
direction along with a line search procedure to integrate
the constraint on W. This will ensure that the objective
function Fis non-increasing at each iteration. In the
following sections we derive MU for Wthat decrease
the objective in (4). We follow the standard majorization-
minimization framework [17]. First, an auxiliary func-
tion, which we denote ¯
F, is constructed so that it
majorizes the objective. An auxiliary function for Fat
point ˜
Wis defined as follows.
Definition 1. The function ¯
F(W|˜
W) : Ω ×Ω→Ris
an auxiliary function for F(W) : Ω →Rat ˜
W∈Ωif
the conditions ¯
F(W|˜
W)≥F(W)for all W∈Ωand
¯
F(˜
W|˜
W) = F(˜
W)are satisfied.
Then, the optimization of Fcan be replaced by an
iterative process that minimizes ¯
F. More precisely, the
new iterate W(i+1) is computed by minimizing exactly
the auxiliary function at the previous iterate W(i). This
guarantees Fto decrease at each iteration.
Lemma 1. Let W, W (i)≥0, and let ¯
Fbe an auxiliary
function for Fat W(i). Then Fis non-increasing under
the update W(i+1) =argmin
W≥0
¯
F(W|W(i)).
Proof. In fact, we have by definition that
F(W(i)) = ¯
F(W(i)|W(i))≥min
W
¯
F(W|W(i)) =
¯
F(W(i+1)|W(i))≥F(W(i+1) ).
The most difficult part in using the majorization-
minimization framework is to design an auxiliary func-
tion that is easy to optimize. Usually such auxiliary
functions are separable (that is, there is no interaction
between the variables so that each entry of Wcan be
updated independently) and convex.
A. Separable auxiliary functions for β-divergences
For the sake of completeness, we briefly recall the
auxiliary function proposed in [5] for the data fitting
term. It consists in majorizing the convex part of the β-
divergence using Jensen’s inequality and majorizing the
concave part by its tangent (first-order Taylor approxi-
mation). We have
dβ(x|y) = ˇ
dβ(x|y) + ˆ
dβ(x|y) + ¯
dβ(x|y),(5)
where ˇ
dis convex function of y,ˆ
dis a concave function
of yand ¯
dis a constant of y; see Table I.
TABLE I: Differentiable convex-concave-constant de-
composition of the β-divergence under the form (5) [5].
ˇ
d(x|y)ˆ
d(x|y)¯
d(x)
β= 0 xy−1log(y)x(log(x)−1)
β∈[1,2] dβ(x|y)0 0
The function Dβ(V|W H )can be written as
PfDβ(vf|wfH)where vfand wfare respectively the
fth row of Vand W. Therefore we only consider the
optimization over one specific row of W. To simplify
notation, we denote iterates w(i+1) (next iterate) and
w(i)(current iterate) as wand ˜w, respectively.
Lemma 2 ([5]).Let ˜v= ˜wH and ˜wbe such that ˜vn>0
for all nand ˜wk>0for all k. Then the function
G(w|˜w) = X
n"X
k
˜wkhkn
˜vn
ˇ
d(vn|˜vn
wk
˜wk
)#+¯
d(vn)
+"ˆ
d0(vn|˜vn)X
k
(wk−˜wk)hkn +ˆ
d(vn|˜vn)#
(6)
is an auxiliary function for Pnd(vn|[wH]n)at ˜w.
B. A separable auxiliary function for the minimum-
volume regularizer
The minimum-volume regularizer logdet(WTW+δI)
is a non-convex function. However, it can be upper-
bounded using the fact that logdet(.)is a concave func-
tion so that its first-order Taylor approximation provides
an upper bound; see for example [10]. For any positive-
definite matrices Aand B∈RK×K, we have:
logdet (B)≤logdet (A) + trace A−1(B−A)
= trace A−1B+ logdet (A)−K .
This implies that for any W, Z ∈RF×K, we have
logdet(WTW+δI)≤l(W, Z),(7)
where l(W, Z) = trace Y W TW+logdet Y−1−K,
Y= (ZTZ+δI)−1with δ > 0. Note that ZTZ+δI
is positive definite hence is invertible and its inverse Y
is also positive definite. Finally l(W, Z)is an auxiliary
function for logdet(WTW+δI)at Z. However, it is
6
quadratic and not separable hence non-trivial to optimize
over the nonnegative orthant. The non-constant part of
l(W, Z)can be written as PfwfY wT
fwhere wfis
the fth row of W. Henceforth we will focus on one
particular row vector wwith l(w) = wTY w which will
be further considered as a column vector of size K×1.
Lemma 3. Let w, ˜w∈RK
+be such that ˜wk>0for
all k,Y=Y+−Y−with Y+= max (Y, 0) and Y−=
max (−Y, 0), and Φ ( ˜w)be the diagonal matrix Φ ( ˜w) =
Diag 2[Y+˜w+Y−˜w]
[ ˜w]where [A]
[B]is the component-wise
division between Aand B, and ∆w=w−˜w. Then
¯
l(w|˜w) = l( ˜w)+∆wT∇l( ˜w) + 1
2∆wTΦ( ˜w)∆w, (8)
is a separable auxiliary function for l(w)=wTY w at ˜w.
Proof. See Appendix B.
Remark 1 (Choice of the auxiliary function).A simpler
choice for the auxiliary function would be to replace
Φ( ˜w)with 2λmax(Y)Iwhere λmax(Y)is the largest
eigenvalue of Y(the constant 2appears because l(w) =
wTY w while there is a factor 1/2in front of Φ( ˜w)).
However, it would lead to a worse approximation. In
particular if Yis a diagonal matrix (since Y0,
these diagonal elements are positive), our choice gives
Φ( ˜w)=2Yfor any ˜w > 0, meaning that the auxiliary
function matches perfectly the function l(w). This would
not be the case for the choice 2λmax(Y)I(unless Yis
a scaling of the identity matrix).
C. Auxiliary function for min-vol β-NMF
Based on the auxiliary functions presented in Sections
III-A and III-B, we can directly derive a separable
auxiliary function ¯
F(W|˜
W)for min-vol β-NMF (2).
Corollary 1. For W, H ≥0,λ > 0,Y= ( ˜
WT˜
W+
δI)−1with δ > 0and the constant c= logdet Y−1+
K, the function
¯
F(W|˜
W) = X
f
G(wf|˜wf) + λ
X
f
¯
l(wf|˜wf) + c
,
where Gis given by (6) and ¯
lby (8), is a convex and
separable auxiliary function for F(W) = Dβ(V|W H)+
λlogdet(WTW+δI)at ˜
W.
Proof. This follows directly from Lemma 2, Equa-
tion (7) and Lemma 3.
In the following section, we provide explicitly MU for
the KL divergence (β= 1) by finding a closed-form
solution for the minimization of ¯
F. In Appendix C, we
provide the MU for the IS divergence (β= 0). Due to the
lack of space, the other cases are not treated explicitly
but can be in a similar way. For the same reason, we
will only compare KL NMF models in the numerical
experiments (Section IV).
D. Algorithm for min-vol KL-NMF
As before, let us focus on a single row of W, denoted
w, as the objective function F(W)is separable by
row. For β= 1, the derivative of the auxiliary function
¯
F(w|˜w)with respect to a specific coefficient wkis
given by ∇wk¯
F(w|˜w) = Pnhkn −Pnhkn ˜wkvn
wk˜vn+
2λ[Y˜w]k+ 2λhDiag Y+˜w+Y−˜w
˜wikwk−
2λhDiag Y+˜w+Y−˜w
˜wik˜wk. Due to the separability,
we set the derivative to zero to obtain the closed-form
solution, which is given in Table II in matrix form.
Note that although the closed-form solution has a nega-
tive term in the numerator of the multiplicative factor
(see Table II), they always remain nonnegative given
that V, H and ˜
Ware nonnegative. In fact, the term
before the minus sign is always larger than the term
after the minus sign: JF,N HT−4λ(˜
W Y −)is squared
(component wise) and added a positive term, hence the
component-wise square root of that result is larger than
JF,N HT−4λ(˜
W Y −).
Algorithm 1 summarizes our algorithm to tackle (2) for
β= 1 which we refer to as min-vol KL-NMF. Note
that the update of H(step 4) is the one from [3]. More
importantly, note that we have incorporated a line-search
for the update of W. In fact, although the MU for W
are guaranteed to decrease the objective function, they
do not guarantee that Wremains normalized, that is,
that ||W(:, k)||1= 1 for all k. Hence, we normalize
Wafter it is updated (step 10), and we normalize H
accordingly so that W H remains unchanged. When this
normalization is performed, the β-divergence part of Fis
unchanged but the minimum-volume penalty will change
so that the objective function Fmight increase. In order
to guarantee non-increasingness, we integrate a simple
backtracking line-search procedure; see steps 11-16 of
Algorithm 1. In summary, our MU provide a descent
direction that preserved nonnegativity of the iterates, and
we use a projection and a simple backtracking line-
search to guarantee the monotonicity of the objective
7
TABLE II: Multiplicative update for min-vol KL-NMF.
W=˜
W
"[JF,N HT−4λ(˜
W Y −)].2+8λ˜
W(Y++Y−) [V]
[˜
W H]HT!#.1
2
−(JF,N HT−4λ(˜
W Y −))
[4λ˜
W(Y++Y−)] ,
where AB(resp. [A]
[B]) is the Hadamard product (resp. division) between Aand B,A(.α)is the element-wise αexponent of A,
JF,N is the F-by-Nall-one matrix, and Y=Y+−Y−= ( ˜
WT˜
W+δI)−1with δ > 0,Y+≥, Y −≥0,λ > 0.
function, as in standard projected gradient descent meth-
ods.
Algorithm 1 min-vol KL-NMF
Input: A matrix V∈RM×T, an initialization H∈
RK×T
+, an initialization W∈RM×K, a factor-
ization rank K, a maximum number of iterations
maxiter, min-vol weight λ > 0and δ > 0
Output: A rank-KNMF (W, H)of V≈W H with
W≥0and H≥0.
1: γ= 1, Y =WTW+δI−1
2: for i= 1 : maxiter do
3: % Update of matrix H
4: H←H[WT([V]
[W H])]
[WTJF,N ]
5: % Update of matrix W
6: Y←WTW+δI−1
7: Y+←max (Y, 0)
8: Y−←max (−Y, 0)
9: W+is updated according to Table II
10: (W+
γ, Hγ) = normalize (W+, H)
11: % Line-search procedure
12: while FW+
γ, Hγ> F (W, H)do
13: γ←γ×0.8
14: W+
γ←(1 −γ)W+γW +
15: (W+
γ, Hγ)←normalize W+
γ, H
16: end while
17: (W, H)←(W+
γ, Hγ)
18: % Update of γto avoid a vanishing stepsize
19: γ←min (1, γ ×1.2)
20: end for
It can be verified that the computational complexity of
the min-vol KL-NMF is asymptotically equivalent to the
standard MU for β-NMF, that is, it requires O(F N K)
operations per iteration. Indeed, all the main operations
include matrix products with a complexity of O(F N K)
and element-wise operations on matrices of size F×K
or K×N. Note that the inversion of the K-by-Kmatrix
(WTW+δI)requires OK3operations which is
dominated by O(F N K)since K≤min(F, N )(in fact,
typically Kmin(F, N )hence this term is negligible).
Therefore, although Algorithm 1 will be slower than the
baseline KL-NMF (that is, the standard MU) because
of the additional terms to be computed and the line-
search, the asymptotical computational cost is the same;
see Table IV for runtime comparison.
IV. NUMERICAL EXPERIMENTS
In this section we report an experimental comparative
study of baseline KL-NMF, min-vol KL-NMF (Al-
gorithm 1) and sparse KL-NMF [18] applied to the
spectrogram of two monophonic piano sequences and
a synthetic mix of a bass and drums. For the two
monophonic piano sequences, the audio signals are true
life signals with standard quality. Note that the sequences
are made of pure piano notes, the number Kshould
therefore correspond to the number of notes present into
the mixed signals. The comparative study is performed
for several values of Kwith a focus on the case
where the factorization rank Kis overestimated. For
all simulations, random initializations are used for W
and H, the best results among 5 runs are kept for the
comparative study. In all cases, we use a Hamming
window of size F=1024, and 50% overlap between two
frames. Sparse KL-NMF has a similar structure as min-
vol KL-NMF, with a penalty parameter for the sparsity
enforcing regularization. To tune these two parameters,
we have used the same strategy for both methods: we
manually tried a wide range of values and report the best
results. The code is available from bit.ly/minvolKLNMF
(code written in MATLAB R2017a), and can be used
to rerun directly all experiments below. They were run
on a laptop computer with Intel Core i7-7500U CPU @
2.70GHz 4 and 32GB memory.
a) Mary had a little lamb: The first audio sample is
the first measure of “Mary had a little lamb”. The
sequence is composed of three notes; E4,D4and C4,
played all at once. The recorded signal is 4.7 seconds
long and downsampled to fs= 16000Hz yielding
T=75200 samples. STFT of the input signal xyields a
temporal resolution of 16ms and a frequency resolution
of 31.25Hz, so that the amplitude spectrogram Vhas
N=294 frames and F=257 frequency bins. The musical
score is shown on Figure 3. All NMF algorithms were
run for 200 iterations which allowed them to converge.
Figure 4 presents the columns of W(dictionary matrix)
8
Fig. 3: Musical score of “Mary had a little lamb”.
(a) Columns of W(b) Rows of H
Fig. 4: Comparative study of baseline KL-NMF (top),
min-vol KL-NMF (middle) and sparse KL-NMF (bot-
tom) applied to “Mary had a little lamb” amplitude
spectrogram with K=3.
and the rows of Hfor baseline KL-NMF and min-vol
KL-NMF with K= 3. Figure 5 presents the time-
frequency masking coefficients. These coefficients are
computed as follows
mask(k)
f,n =
ˆ
X(k)
f,n
Pkˆ
X(k)
f,n
with k= 1, ..., K ,
where ˆ
X(k)=W(:, k)H(k, :) is the estimated source
k. The masks are nonnegative and sum to one for
each pair (f, n). This representation allows to identify
visually whether the NMF algorithm was able to sep-
arate the sources properly. All the simulations give a
nice separation with similar results for Wand H. The
activations are coherent with the sequences of the notes.
However, Figure 5 shows that min-vol KL-NMF and
sparse KL-NMF provide a better separation in terms
of time-frequency localization compared to the baseline
KL-NMF.
(a) baseline KL-NMF
(b) min-vol KL-NMF
(c) sparse KL-NMF
Fig. 5: Masking coefficients obtained with baseline KL-
NMF (top), min-vol KL-NMF (middle) and sparse KL-
NMF (bottom) applied to “Mary had a little lamb”
amplitude spectrogram with K=3.
We now perform the same experiment but using K=7.
Figure 6 presents the results. This situation corresponds
to the situation where the factorization rank is overes-
timated. Figure 7 presents the time-frequency masking
coefficients. We observe that min-vol KL-NMF is able
to extract the three notes correctly and set automatically
to zero three source estimates (more precisely, three rows
of Hare set to zero, while the corresponding columns of
Whave entries equal to one another as ||W(:, k)||1= 1
for all k) while baseline KL-NMF and sparse KL-NMF
split the notes in all the sources. One can observe that a
fourth note is identified in all simulations (see isolated
9
(a) Columns of W(b) Rows of H
Fig. 6: Comparative study of baseline KL-NMF (top),
min-vol KL-NMF (middle) and sparse KL-NMF (bot-
tom) applied to “Mary had a little lamb” amplitude
spectrogram with K=7.
peaks on Figure 7-(b), second row of Hfrom the top)
and corresponds to the noise within the piano just before
triggering a specific note (in particular, the hammer
noise). This observation is confirmed by the fact that
the amplitude is proportional to the natural strength of
the fingers playing the notes. In this scenario, with K
is overestimated, min-vol KL-NMF outperforms baseline
KL-NMF and sparse KL-NMF.
b) Prelude of Bach: The second audio sample corre-
sponds to the first 30 seconds of “Prelude and Fugue
No.1 in C major” from J. S. Bach played by Glenn
Gould1. The audio sample is a sequence of 13 notes:
B3,C4,D4,E4,F#
4,G4,A4,C5,D5,E5,F5,G5,A5.
The recorded signal is downsampled to fs= 11025Hz
yielding T=330750 samples. STFT of the input signal x
yields a temporal resolution of 46ms and a frequency
resolution of 10.76Hz, so that the amplitude spectro-
gram Vhas N=647 frames and F=513 frequency bins.
The musical score is presented on Figure 8. All NMF
algorithms were run for 300 iterations which allowed
1https://www.youtube.com/watch?v=ZlbK5r5mBH4
(a) baseline KL-NMF
(b) min-vol KL-NMF
(c) sparse KL-NMF
Fig. 7: Masking coefficients obtained with baseline KL-
NMF (top), min-vol KL-NMF (middle) and sparse KL-
NMF (bottom) applied to “Mary had a little lamb”
amplitude spectrogram with K=7.
them to converge. Figure 9 presents the results obtained
for Wand Hwith a factorization rank K= 16,
hence overestimated. We observe that min-vol KL-NMF
automatically sets three components to zero (with *
symbol on Figure 9) while 13 source estimates are
determined. The analysis of the fundamentals (maximum
peak frequency) of the 13 source estimates correspond to
the theoretical fundamentals of the 13 notes mentioned
earlier. Note that using baseline KL-NMF or sparse KL-
NMF led to same conclusions as for the first audio
10
2 3 4
5 6 7 8
c
&
&
≈
˙≈
Ϫ
Jœ
œœœœœœ≈
˙≈
Ϫ
Jœ
œœœœœœ≈
˙≈
Ϫ
Jœ
œœœœœœ≈
˙≈
Ϫ
Jœ
œœœœœœ ≈
˙≈
Ϫ
Jœ
œœœœœœ≈
˙≈
Ϫ
Jœ
œœœœœœ ≈
˙≈
Ϫ
Jœ
œœœœœœ≈
˙≈
Ϫ
Jœ
œœœœœœ
≈
˙≈
Ϫ
Jœ
œœœœœœ≈
˙≈
Ϫ
Jœ
œœœœœœ≈
˙≈
Ϫ
Jœ
œ# œœœœœ≈
˙≈
Ϫ
Jœ
œœœœœœ≈
˙≈
Ϫ
Jœ
œœœœœœ≈
˙≈
Ϫ
Jœ
œœœœœœ≈
˙≈
Ϫ
Jœ
œœœœœœ≈
˙≈
Ϫ
Jœ
œœœœœœ
Fig. 8: Musical score of the sample “Prelude and Fugue
No.1 in C major”.
sample; these two algorithms generate as many source
estimates as imposed by the rank of factorization while
min-vol KL-NMF algorithm preserves the integrity of
the 13 sources. Additionally, the activations are coherent
with the sequences of the notes. Figure 10 shows (on a
limited time interval) that the estimate sequence follows
the sequence defined in the score. Note that a threshold
and permutations on rows of Hwas used to improve
visibility.
c) Bass and drums: The third audio signal is a synthetic
mix of a bass and drums2. The audio signal is downsam-
pled to fs=16000Hz yielding T=104821 samples. STFT
of the input signal xyields a temporal resolution of
32ms and a frequency resolution of 15.62Hz, so that the
amplitude spectrogram Vhas N=206 frames and F=513
frequency bins. For this synthetic mix, we have access
to the true sources under the form of two audio files.
Therefore, we can estimate the quality of the separation
with standard metrics, namely the signal to distortion
ratios (SDR), the source to interference ratios (SIR) and
the sources to artifacts ratios (SAR) [19]. They have been
computed with the toolbox BSS Eval3. The metrics are
expressed in dB and the higher they are the better is the
separation. Algorithms min-vol KL-NMF, baseline KL-
NMF and sparse KL-NMF have been considered for this
comparative study. A factorization rank equal to two is
used. It is clear that the rank-one approximation is too
simplistic for these sources but the goal is to compare
the algorithms and show that min-vol KL-NMF is able
to find a better solution even in this simplified context.
All NMF algorithms were run for 400 iterations which
allowed them to converge. Table III shows the results.
Except for SAR metric for the second source (drums),
min-vol KL-NMF outperforms baseline KL-NMF and
sparse KL-NMF.
d) Runtime performance: Let us compare the runtime
of baseline KL-NMF, min-vol KL-NMF (Algorithm 1)
and sparse KL-NMF [18]. The algorithms are compares
2http://isse.sourceforge.net/demos.html
3http://bass-db.gforge.inria.fr/bss eval/
(a) Columns of W
(b) Rows of H
Fig. 9: Factors matrices Wand Hobtained with min-vol
KL-NMF with factorization rank K=16 on the sample
“Prelude and Fugue No.1 in C major”.
on the three examples presented in paragraphs IV-0a and
IV-0b:
•Setup ]1: sample “Mary had a little lamb” with
K= 3, 200 iterations.
•Setup ]2: sample “Mary had a little lamb” with
K= 7, 200 iterations.
•Setup ]3: “Prelude and Fugue No.1 in C major”
with K= 16, 300 iterations.
For each test setup, the algorithms are run for the same
20 random initializations of Wand H. Table IV reports
the average and standard deviation of the runtime (in
seconds) over these 20 runs. We observe that the runtime
of min-vol KL-NMF (Algorithm 1) is slower but not
significantly so, as expected. In particular, on the larger
setup ]3, it is less than three times slower than the
standard MU.
11
TABLE III: SDR, SIR and SAR metrics comparison for results obtained with baseline KL-NMF and min-vol
KL-NMF on a synthetic mix of bass and drums
Algorithms Source 1: bass Source 2: drums
SDR(dB) SIR(dB) SAR(dB) SDR(dB) SIR(dB) SAR(dB)
min-vol KL-NMF -1.14 0.12 7.78 9.60 19.8 10.09
baseline KL-NMF -4.26 -1.39 2.64 7.97 9.00 15.25
sparse KL-NMF -4.69 -1.73 2.33 7.89 8.96 14.98
Fig. 10: Validation of the estimate sequence obtained
with min-vol KL-NMF with factorization rank K=16 on
the sample “Prelude and Fugue No.1 in C major”.
TABLE IV: Runtime performance in seconds of baseline
KL-NMF, min-vol KL-NMF (Algorithm 1) and sparse
KL-NMF [18]. The table reports the average and stan-
dard deviation over 20 random initializations for three
experimental setups described in the text.
Algorithms runtime in seconds
setup ]1 setup ]2 setup ]3
baseline KL-NMF 0.44±0.03 0.43±0.01 3.81±0.19
min-vol KL-NMF 3.79±0.13 2.39±0.30 10.19±1.28
sparse KL-NMF 0.20±0.02 0.20±0.01 2.21±0.26
V. CONCLUSION AND PERSPECTIVES
In this paper, we have presented a new NMF model of
audio source separation based on the minimization of
a cost function that includes a β-divergence (data fitting
term) and a penalty term that promotes solutions Wwith
minimum volume. We have proved the identifiability
of the model in the exact case, under the sufficiently
scattered condition for the activation matrix H. We have
provided multiplicative updates to tackle this problem
and have illustrated the behaviour of the method on real-
world audio signals. We highlighted the capacity of the
model to deal with the case where Kis overestimated
by setting automatically to zero some components and
give good results for the source estimates.
Further work includes tackling the following questions:
•Under which conditions can we prove the identifi-
ability of min-vol β-NMF in the presence of noise,
and the rank-deficient case?
•Can we prove that min-vol β-NMF performs model
order selection automatically? Under which condi-
tions? We have observed this behaviour on many
examples, but the proof remains elusive.
•Can we design more efficient algorithms?
Further work also includes the use of our new model
min-vol β-NMF for other applications and the design
of more efficient algorithms (for example, that avoid
using a line-search procedure) with stronger convergence
guarantees (beyond the monotonicity of the objective
function).
Acknowledgments: We thank Kejun Huang and Xiao Fu
for helpful discussion on Theorem 1, and giving us the
insight to adapt their proof from [7] to our model (2). We
also thank the reviewers for their insightful comments
that helped us improve the paper.
APPENDIX
A. Sufficiently scattered condition and identifiability
Before giving the definition of the sufficiently scattered
condition from [8], let us first recall an important prop-
erty of the duals of nested cones.
Lemma 4. Let C1and C2be convex cones such that
C1⊆ C2. Then C∗
2⊆ C∗
1where C∗
2and C∗
1are respectively
the dual cones of C1and C2. The dual of a cone Cis
defined as C∗=y|xTy≥0for all x∈ C.
12
Definition 2. (Sufficiently Scattered) A matrix H∈
RK×N
+is sufficiently scattered if
1) C ⊆ cone (H), and
2) cone (H)∗∩bdC∗={λek|λ≥0, k = 1, ..., K },
where C=x|xTe≥√K−1kxk2is a second
order cone, C∗=x|xTe≥ kxk2,cone (H) =
{x|x=Hθ, θ ≥0}is the conic hull of the columns of
H, and bd denotes the boundary of a set.
We can now prove Theorem 1.
Proof of Theorem 1. Recall that W#and H#are the
true latent factors that generated V, with rank(V) =
Kand H#is sufficiently scattered. Let us consider ˆ
W
and ˆ
Ha feasible solution of (3). Since rank(V) = K
and V=ˆ
Wˆ
H, we must have rank( ˆ
W) = rank(ˆ
H) =
K. Hence there exists an invertible matrix A∈RK×K
such that ˆ
W=W#A−1and ˆ
H=AH#. Since ˆ
Wis a
feasible solution of problem (3), we have
eTˆ
W=eTW#A−1=eTA−1=eT,
where we assumed eTW#=eTwithout loss of gen-
erality since W#≥0and rank(W#) = K. Note
that eTA−1=eTis equivalent to eTA=eT. This
means that matrix Ais column stochastic. Therefore we
have that eTAe =K. Since ˆ
His a feasible solution,
we also have ˆ
H=AH#≥0. Let us denote by
ajthe jth row of A, and by aT
kthe kth column of
AT. By the definition of the a dual cone, AH#≥0
means that the rows aj∈cone(H#)∗for j= 1, ..., K.
Since H#is sufficiently scattered, cone (H)∗⊆ C∗(by
Lemma 4) hence aj∈ C∗. Therefore we have kajk2≤
ajeby definition of C. This leads to the following:
|det(A)|=|det(AT)| ≤ QK
k=1
aT
k
2=QK
j=1 kajk2≤
QK
j=1 aje≤PK
jaje
KK
=eTAe
KK
= 1. The
first inequality is the Hadamard inequality, the second
inequality is due to aj∈ C∗, the third inequality
is the arithmetic-geometric mean inequality. Now we
can conclude exactly as is done in [7, Theorem 1] by
showing that matrix Acan only be a permutation matrix
for an optimal solution ( ˆ
W,ˆ
H) of (3), and therefore
identifiability for model (3) holds.
B. Proof of Lemma 3
Separability of ¯
l(w|˜w)holds since Φ ( ˜w)is diagonal.
The condition ¯
l( ˜w|˜w) = l( ˜w)from Definition 1 can be
checked easily. It remains to prove that ¯
l(w|˜w)≥l(w)
for all w. Let us first rewrite the quadratic function l(w)
using its Taylor expansion at w= ˜w:l(w) = l( ˜w) +
(w−˜w)T∇l( ˜w) + 1
2(w−˜w)T∇2l( ˜w) (w−˜w) =
l( ˜w)+(w−˜w)T2Y˜w+1
2(w−˜w)T2Y(w−˜w). Prov-
ing that ¯
l(w|˜w)≥l(w)is equivalent to proving that
1
2(w−˜w)T[Φ ( ˜w)−2Y] (w−˜w)≥0, which boils
down to proving that the matrix [Φ ( ˜w)−2Y]is positive
semi-definite. We have Φij( ˜w)=2δij (Y+˜w)i+(Y−˜w)i
˜wi,
where δij is the Kronecker symbol. Let us consider
the following matrix: Mij( ˜w) = ˜wi[Φ ( ˜w)−2Y]ij ˜wj,
which is a rescaling of [Φ ( ˜w)−2Y]. It remains to
show that Mis positive semi-definite4. Since Mis
symmetric and its diagonal entries are non-negative, it
is sufficient to show that Mis diagonally dominant [?,
Proposition 7.2.3], that is,
|Mii| ≥ X
j6=i|Mij |for all i.
We have for all ithat
Mii = 2wiX
jY+
ij +Y−
ij wj−2wiYiiwi,and
Mij =−2wiYij wjfor j6=i.
Since Y+
ij +Y−
ij =|Yij |, we have
Mii −X
j6=i|Mij |= 2wiX
j|Yij |wj−2wiYiiwi
−2wiX
j6=i|Yij |wj
= 2wi|Yii|wi−2wiYii wi≥0,
implying that Mis diagonally dominant.
C. Algorithm for min-vol IS-NMF
For β= 0 (IS divergence), the derivative of the auxiliary
function ¯
F(w|˜w)with respect to a specific coefficient wk
is given by:
∇wk¯
F(w|˜w) = X
n
hkn
˜vn−X
n
hkn
˜w2
kvn
w2
k˜v2
n
+ 2λ[Y˜w]k
+ 2λDiag Y+˜w+Y−˜w
˜wk
wk
−2λDiag Y+˜w+Y−˜w
˜wk
˜wk.
4The remainder of the proof was suggested to us by one of the
reviewers, it is more elegant and simpler than our original proof.
13
Let
˜a= 2λDiag Y+˜w+Y−˜w
˜wk
,
˜
b=X
n
hkn
˜vn−4λY−˜wk,
˜
d=−X
n
hkn
˜w2
kvn
˜v2
n
.
(9)
Setting the derivative to zero requires to compute the
roots of the following degree-three polynomial ˜aw3
k+
˜
bw2
k+˜
d. We used the procedure developed in [20] which
is based on the explicit calculation of the intermediary
root of a canonical form of cubic. This procedure is able
to provide highly accurate numerical results even for
badly conditioned polynomials. The algorithm for min-
vol IS-NMF follows the same steps as for min-vol KL-
NMF: only the two steps corresponding to the updates
of Wand Hhave to be modified. For the update of H
(step 4), use the standard MU. For the update of W(step
9), use
for f←1to F
for k←1to K
Compute ˜a,˜
band ˜
daccording to equations (9)
Compute the roots of ˜aw3
k+˜
bw2
k+˜
d
Pick yamong these roots and zero that minimizes
the objective
W+
f,k ←max 10−16, y
end for
end for
REFERENCES
[1] A. Lef`
evre, “M´
ethode d’apprentissage de dictionnaire pour la
s´
eparation de sources audio avec un seul capteur,” Ph.D. disser-
tation, Ecole Normale Sup´
erieure de Cachan, 2012.
[2] P. Magron, “Reconstruction de phase par mod `
eles de signaux :
application `
a la s´
eparation de sources audio,” Ph.D. dissertation,
TELECOM ParisTech, 2016.
[3] D. Lee and H. Seung, “Algorithms for non-negative matrix
factorization,” in NIPS’00 Proceedings of the 13th International
Conference on Neural Information Processing Systems, NIPS.
MIT Press Cambridge, 2000, pp. 535–541.
[4] C. F´
evotte, N. Bertin, and J.-L. Durrieu, “Nonnegative matrix
factorization with the Itakura-Saito divergence: With application
to music analysis,” Neural computation, vol. 21, no. 3, pp. 793–
830, 2009.
[5] C. F´
evotte and J. Idier, “Algorithms for nonnegative matrix fac-
torization with the β-divergence,” Neural computation, vol. 23,
no. 9, pp. 2421–2456, 2011.
[6] G. Zhou, S. Xie, Z. Yang, J.-M. Yang, and Z. He, “Minimum-
volume-constrained nonnegative matrix factorization: Enhanced
ability of learning parts,” IEEE Transactions on Neural Networks,
vol. 22, no. 10, pp. 1626–1637, 2011.
[7] X. Fu, K. Huang, and N. D. Sidiropoulos, “On identifiability
of nonnegative matrix factorization,” IEEE Signal Processing
Letters, vol. 25, no. 3, pp. 328–332, 2018.
[8] K. Huang, N. Sidiropoulos, and A. Swami, “Non-negative matrix
factorization revisited: Uniqueness and algorithm for symmet-
ric decomposition,” IEEE Transactions on Signal Processing,
vol. 62, no. 1, pp. 211–224, 2014.
[9] X. Fu, K. Huang, N. Sidiropoulos, and W.-K. Ma, “Nonnegative
matrix factorization for signal and data analytics: Identifiability,
algorithms, and applications,” IEEE Signal Processing Magazine,
vol. 36, pp. 59–80, 2019.
[10] X. Fu, K. Huang, B. Yang, W.-K. Ma, and N. Sidiropoulos,
“Robust volume minimization-based matrix factorization for re-
mote sensing and document clustering,” IEEE Trans. on Signal
Processing, vol. 64, no. 23, p. 6254–6268.
[11] A. Ang and N. Gillis, “Algorithms and comparisons of non-
negative matrix factorization with volume regularization for
hyperspectral unmixing,” Journal of Selected Topics in Applied
Earth Observations and Remote Sensing, 2019, to appear.
[12] V. Leplat, A. Ang, and N. Gillis, “Minimum-volume rank-
deficient nonnegative matrix factorizations,” in IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2019, pp. 3402–3406.
[13] C.-H. Lin, W.-K. Ma, W.-C. Li, C.-Y. Chi, and A. Ambikapathi,
“Identifiability of the simplex volume minimization criterion for
blind hyperspectral unmixing: The no-pure-pixel case,” IEEE
Transactions on Geoscience and Remote Sensing, vol. 53, no. 10,
pp. 5530–5546, 2015.
[14] S. Arora, R. Ge, R. Kannan, and A. Moitra, “Computing a
nonnegative matrix factorization—provably,” SIAM Journal on
Computing, vol. 45, no. 4, pp. 1582–1611, 2016.
[15] S. Vavasis, “On the complexity of nonnegative matrix factoriza-
tion,” SIAM Journal on Optimization, vol. 20, no. 3, pp. 1364–
1377, 2010.
[16] X. Fu, K. Huang, N. D. Sidiropoulos, Q. Shi, and M. Hong,
“Anchor-free correlated topic modeling,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 41, no. 5, pp.
1056–1071, 2019.
[17] Y. Sun, P. Babu, and D. Palomar, “Majorization-minimization
algorithms in signal processing, communications, and machine
learning,” IEEE Transactions on Signal Processing, vol. 65, no. 3,
pp. 794–816, 2017.
[18] J. L. Roux, F. J. Weninger, and J. R. Hershey, “Sparse NMF –
half-baked or well done?” Mitsubishi Electric Research Labora-
tories (MERL), Tech. Rep., 2015.
[19] E. Vincent, R. Gribonval, and C. F´
evotte, “Performance mea-
surement in blind audio source separation,” IEEE Transations on
Audio, Speech, and Language Processing, vol. 14, no. 4, pp. 1462
– 1469, June 2006.
[20] E. Rechtschaffen, “Real roots of cubics: explicit formula for
quasi-solutions,” The Mathematical Gazette, no. 524, p. 268–276,
2008.
14