Diarization of Telephone Conversations Using Factor Analysis
ABSTRACT We report on work on speaker diarization of telephone conversations which was begun at the Robust Speaker Recognition Workshop held at Johns Hopkins University in 2008. Three diarization systems were developed and experiments were conducted using the summed-channel telephone data from the 2008 NIST speaker recognition evaluation. The systems are a Baseline agglomerative clustering system, a Streaming system which uses speaker factors for speaker change point detection and traditional methods for speaker clustering, and a Variational Bayes system designed to exploit a large number of speaker factors as in state of the art speaker recognition systems. The Variational Bayes system proved to be the most effective, achieving a diarization error rate of 1.0% on the summed-channel data. This represents an 85% reduction in errors compared with the Baseline agglomerative clustering system. An interesting aspect of the Variational Bayes approach is that it implicitly performs speaker clustering in a way which avoids making premature hard decisions. This type of soft speaker clustering can be incorporated into other diarization systems (although causality has to be sacrificed in the case of the Streaming system). With this modification, the Baseline system achieved a diarization error rate of 3.5% (a 50% reduction in errors).
-
Citations (0)
- Cited In (1)
-
Conference Proceeding: Extending the Task of Diarization to Speaker Attribution.
INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, August 27-31, 2011; 01/2011
Page 1
IEEE JOURNAL OF SPECIAL TOPICS IN SIGNAL PROCESSING1
Diarization of Telephone Conversations using
Factor Analysis
Patrick Kenny, Douglas Reynolds and Fabio Castaldo
EDICS Category: SPE-SPKR
Abstract—We report on work on speaker diarization of tele-
phone conversations which was begun at the Robust Speaker
Recognition Workshop held at Johns Hopkins University in 2008.
Three diarization systems were developed and experiments were
conducted using the summed-channel telephone data from the
2008 NIST speaker recognition evaluation. The systems are a
Baseline agglomerative clustering system, a Streaming system
which uses speaker factors for speaker change point detection
and traditional methods for speaker clustering, and a Variational
Bayes system designed to exploit a large number of speaker
factors as in state of the art speaker recognition systems. The
Variational Bayes system proved to be the most effective, achiev-
ing a diarization error rate of 1.0% on the summed-channel data.
This represents an 85% reduction in errors compared with the
Baseline agglomerative clustering system. An interesting aspect
of the Variational Bayes approach is that it implicitly performs
speaker clustering in a way which avoids making premature
hard decisions. This type of soft speaker clustering can be
incorporated into other diarization systems (although causality
has to be sacrificed in the case of the Streaming system). With
this modification, the Baseline system achieved a diarization error
rate of 3.5% (a 50% reduction in errors).
Index Terms—Diarization, speaker recognition, speaker seg-
mentation, clustering, speaker factors, channel factors, varia-
tional Bayes
I. INTRODUCTION
In recent years, factor analysis methods have proved to be
very effective in speaker recognition. This is particularly true
of telephone speech, thanks to the availability of large tele-
phone speech corpora for training factor analysis models [1],
[2], [3], [4]. It is therefore natural to try to bring factor analysis
methods to bear on the problem of diarization of telephone
conversations. This problem was chosen as one of the themes
of the Robust Speaker Recognition Workshop held at Johns
Hopkins University in the summer of 2008.
Three diarization systems were developed in the workshop:
a Baseline system, and two factor analysis based systems
which we refer to as the Streaming system and the Variational
Bayes system. We will describe these systems in detail in
Sections II, III and IV. The Baseline system uses the tradi-
tional approach of speaker segmentation with the Bayesian
information criterion (BIC) followed by agglomerative speaker
clustering [5]. It is non-causal in the sense that an entire
Copyright (c) 2008 IEEE. Personal use of this material is permitted. How-
ever, permission to use this material for any other purposes must be obtained
from the IEEE by sending a request to pubs-permissions@ieee.org.
Patrick Kenny is with the Centre de recherche informatique de Montr´ eal,
Douglas Reynolds is with MIT Lincoln Labs and Fabio Castaldo is with the
Politecnico di Torino.
speech file has to be processed before any diarization decisions
can be made. The Streaming system performs speaker change
point detection using a sliding window and a small number
of speaker factors (e.g. 20) [6]. It has the advantage that it
can be configured to run in real time (with a latency) and
it determines the number of speakers dynamically. Like the
Streaming system, the Variational Bayes system was inspired
by the success of factor analysis in speaker recognition, but
it is designed to exploit much larger numbers of speaker
factors and it is non-causal. It incorporates some aspects of
the Baseline system and it builds on Valente’s pioneering work
on Variational Bayes speaker diarization [7]. The Variational
Bayes system proved to be (by far) the most effective of the
three and it is the principal focus of this article.
Bayesian speaker diarization posits a hierarchical generative
model for turn taking in a given conversation and aims to
infer the number of participants and who is speaking at a
given time using only probabilistic methods, all of which
ultimately reduce to the sum and product rules for combining
probabilities. In practice such an approach quickly runs into in-
tractable integrals and posteriors so fast approximate inference
methods such as Variational Bayes or Expectation Propagation
have to be invoked [7], [8], [9]. Our principal contribution in
this article is to show how the prior distribution on Gaussian
Mixture Models (GMMs) used by Valente in constructing
the hierarchical generative model for a conversation in his
Variational Bayes approach to speaker diarization can be
replaced by the eigenvoice and eigenchannel priors used in
factor analysis based speaker recognition, and that this leads
to excellent results in diarizing two-party telephone conver-
sations. Although eigenchannels are widely used in speaker
recognition, eigenvoices have a longer pedigree in speech
recognition [10], and the results we will present indicate
that eigenvoices are the key ingredient in our approach to
diarization.
In the first four subsections of Section IV, we provide suffi-
cient background material and mathematical detail to explain
how our approach can be implemented. Readers unfamiliar
with the Variational Bayes method will find a more leisurely
account, including complete mathematical derivations, in the
report [11] which was made available on-line prior to the
workshop. Readers familiar with Variational Bayes will rec-
ognize that these derivations are essentially mechanical, as is
the case in most applications of the Variational Bayes method
[12].
Very similar derivations can be found in [13] where the
0000–0000/00$00.00 c ? IEEE
Page 2
2 IEEE JOURNAL OF SPECIAL TOPICS IN SIGNAL PROCESSING
Variational Bayes method is brought to bear on the problem
of calculating the posterior distribution of the hidden variables
in joint factor analysis. Joint factor analysis involves three
types of hidden variable — speaker factors, channel factors
and indicator variables which account for the alignment of
frames with GMM mixture components. The alignment of
frames with mixture components is assumed to be inherited
from a Universal Background Model (UBM) in [1]; this has
the advantage the joint posterior of the other hidden variables
can be calculated in closed form (it is actually Gaussian) [14].
However the calculation is both complicated and compu-
tationally burdensome and the assumption that frames can
be aligned using a UBM seems unsatisfactory — a speaker
and channel dependent GMM ought to be used instead. The
authors in [13] show how a Variational Bayes approach deals
handily with both of these problems. Even in the case where
the alignment is carried out with a UBM (for computational
reasons, for example) the Variational Bayes approach turns
out to be much less computationally demanding than the exact
(non-iterative) approach and the Variational approximation is
of very high quality in practice. (It can be shown that, at
convergence, the Variational posterior has the same mean as
the exact posterior. It is only the posterior covariance that
is not calculated exactly.) In fact, the Variational posterior
calculations in this situation are the same as the Gauss-
Seidel calculations presented by Vogt in
these calculations in the Variational Bayes framework has the
advantage of guaranteeing convergence, a question which was
left open in [4]). Thus Variational Bayes seems to provide
the best way of formulating joint factor analysis for speaker
recognition as well as for speaker diarization.
As for our experiments, we used summed channel telephone
data from the NIST 2008 speaker recognition evaluation (SRE)
as a test set. This consists of 2215 telephone conversations,
each involving just two speakers, of approximately five min-
utes duration (≈ 200 hours in total). This data was chosen
since it enabled us to derive reference diarizations, needed for
measuring diarization error rates, by using time marks from
the speech recognition transcripts produced on each channel
separately. Since this data served as the test set for one of the
speaker detection tasks in the 2008 SRE we were also able to
measure the effect of diarization errors on speaker recognition
performance (Section V).
[4] (but casting
II. BASELINE SYSTEM
The Baseline system consists of three stages: speaker
change point detection, speaker clustering and Viterbi re-
segmentation. This is the most widely used approach to
speaker diarization [5]. The acoustic features for the Baseline
system consisted of 13 raw cepstral coefficients c0,...,c12
(without any type of normalization).
A. Speaker change point detection
In the first stage, speaker change points are detected using a
Bayesian Information Criterion (BIC) based distance between
abutting windows of feature vectors. This technique searches
for change points within a window using a penalized likelihood
ratio test to determine whether the data in the window is better
modeled by a single distribution (no change point) or by two
different distributions (change point). If a speaker change point
is found, the window is reset and the search restarted. If no
change point is found, the window is increased and the search
is redone. Full covariance Gaussians are used as distribution
models.
B. Agglomerative speaker clustering
The purpose of the speaker clustering stage is to associate
or cluster segments from the same speaker together. The
clustering ideally produces one cluster for each speaker in
the audio with all segments from a given speaker in a single
cluster. Agglomerative speaker clustering is a hierarchical
procedure consisting of the following steps:
0) Initialize leaf clusters of tree with speech segments.
1) Compute pair-wise distances between each cluster.
2) Merge closest clusters.
3) Update distances of remaining clusters to new cluster.
4) Iterate steps 1-3 until a stopping criterion is met.
The clusters are represented by a single full covariance
Gaussian. Since we have prior knowledge that there are just
two speakers present in the audio, we stop when we reach two
clusters. (Otherwise a BIC based stopping criterion is used.)
C. Viterbi re-segmentation
The Viterbi re-segmentation stage uses the Baum-Welch
algorithm and the speaker clusters produced by agglomerative
clustering to train Gaussian Mixture Models for each speaker,
re-segments the data using the Viterbi algorithm and iterates
this procedure until convergence. We used 32 Gaussians for
these GMMs. (It is well known in speaker diarization that
a small number of Gaussians is sufficient to characterize a
speaker in a given acoustic environment. This is different from
the situation in speaker recognition where very large numbers
of Gaussians have proved to be necessary to characterize
speakers in arbitrary acoustic environments.) The total number
of re-segmentation iterations performed was 20 and, on each
Viterbi iteration, 5 iterations of Baum-Welch re-estimation
were performed.
Viterbi re-segmentation (using unnormalized cepstral coef-
ficients without derivatives) was found to significantly help all
three diarization systems, although the improvement was less
dramatic in the case of the Streaming system.
D. Soft speaker clustering
Since it is a greedy method, agglomerative speaker cluster-
ing is prone to making premature hard decisions and so can
lead to sub-optimal results. This tendency can be mitigated by
a soft speaker clustering method inspired by the Variational
Bayes framework (equations (16) and (17) below).
After an initial diarization has been carried out we have two
GMMs (each having 32 components), one for each speaker.
Using these GMMs, for each speaker segment we can calculate
a likelihood for each of the two speakers participating in the
Page 3
IEEE JOURNAL OF SPECIAL TOPICS IN SIGNAL PROCESSING3
conversation.By applying Bayes rule and assuming equal prior
probabilities, we can convert these likelihoods into posterior
probabilities and use these posteriors to weight the Baum-
Welch statistics extracted from the segment.
In this way we obtain, for each speaker, a set of weighted
Baum-Welch statistics, one for each segment. Pooling across
segments, we synthesize a set of Baum-Welch statistics for
the speaker and use these synthetic Baum-Welch statistics to
re-estimate the speaker’s GMM.
Whereas Viterbi re-segmentation makes hard decisions as
to which segments are assigned to each speaker, this type
of GMM training avoids such hard decisions. In practice, its
effectiveness depends critically on appropriately normalizing
the segment likelihoods referred to above before converting
them to posterior probabilities using Bayes rule. The nor-
malization procedure that we actually used was to divide the
log likelihoods by the number of frames in the segment and
multiply the result by 3.
E. Protocol
The primary performance measure that we used in evaluat-
ing the diarization systems is the NIST diarization error rate
(DER). This is calculated by aligning a reference diarization
output with a system diarization output and computing a time
weighted combination of miss, false alarm and speaker error.1
In evaluating DERs we took the reference speech activity
marks as given, we ignored intervals containing overlapped
speech and we ignored errors of less than 250 ms in the
locations of segment boundaries. These are the traditional
conventions used in evaluating diarization performance on
two-way on telephone conversations [15]. Note that, although
overlapped speech intervals do not count in evaluating DERs,
the diarization systems do have to contend with overlapped
speech in performing speaker segmentation and clustering.
It is well known that diarization systems typically exhibit
wide variations in performance across test files. Thus In
reporting results for a given system on the NIST 2008 summed
channel data, we will show the standard deviation of the DER
(calculated over all test files) in addition to the mean DER.
F. Baseline Results
DER results for the Baseline system are reported in Table I.
Incorporating the soft clustering procedure in the Baseline
system is seen to reduce the diarization error rate by almost
50%.
TABLE I
MEAN AND STANDARD DEVIATION OF DIARIZATION ERROR RATES (DER)
ON THE NIST 2008 SUMMED CHANNEL TELEPHONE DATA FOR THE
BASELINE SYSTEM; σ REFERS TO THE STANDARD DEVIATION OF THE
DIARIZATION ERRORS.
mean DER (%)
6.8
3.5
σ (%)
12.3
8.0
Baseline without soft-clustering
Baseline with soft-clustering
1DER
spring/code/md-eval-v21.pl.
scoringcodeavailable atwww.nist.gov/speech/tests/rt/2006-
III. STREAMING SYSTEM
Speaker diarization using factor analysis was first introduced
in [6] using a stream-based approach. This technique performs
an on-line diarization where a conversation is seen as a stream
of fixed duration time slices. The system operates in a causal
fashion by producing segmentation and clustering for a given
slice without requiring the following slices. Speakers detected
in the current slice are compared with previously detected
speakers to determine if a new speaker has been detected or
previous models should be updated.
Given an audio slice, a stream of cepstral coefficients
and their first derivatives are extracted. With a small sliding
window (about one second), a new stream of speaker factors
is computed and used to perform the slice segmentation. The
dimension of the speaker factor space is quite small (e.g. 20)
compared with the number used in speaker recognition (e.g.
300) due to the short estimation window. Figure 1 shows
the stream of speaker factors in a slice where two different
speakers are present.
In this new space, a clustering of the stream of speaker
factors is done, producing a single multivariate Gaussian for
each speaker. A BIC criterion is used to determine how many
speakers there are in the slice. A Hidden Markov Model
(HMM) using the Gaussian for each state associated to a
speaker is built and through the Viterbi algorithm a slice
segmentation is obtained.
In addition to the segmentation, a 256-component Gaussian
Mixture Model (GMM) in the acoustic space is created for
each speaker found in the audio slice. These models are used in
the last step, slice clustering, where we determine if a speaker
in the current audio slice was present in previous slices, or is
a new one.
Fig. 1.
speaker factors in a slice using a sliding window (the first two parameters
are shown here). The distribution of points shows a bimodal behavior when
two speakers are present so each speaker can be effectively modeled by means
of a single Gaussian. (In this instance, the trajectory for one of the speakers
is concentrated on the left and the trajectory for the other on the right.) The
final slice labeling is obtained using the Viterbi algorithm.
The Streaming system computes the trajectory of a vector of
Using an approximation to the Kullback-Leibler divergence,
we find the closest speaker model built in previous slices to
each speaker model in the current slice. If the divergence is
below a threshold the previous model is adapted using the
model created in the current slice, otherwise the current model
is added to the set of speaker models found in the audio.
Page 4
4IEEE JOURNAL OF SPECIAL TOPICS IN SIGNAL PROCESSING
The final segmentation and speakers found from the on-line
processing can further be refined using Viterbi re-segmentation
over the entire file, as explained in Section II-C.
A. Streaming Results
The Streaming system was implemented using unnor-
malized mel cepstral coefficients and their derivatives
c1,...,c12,∆c0,...,∆c12as features.
DER results obtained with the Streaming system are shown
in Table II. Of the three systems tested, the Streaming system
had the best performance out of the box (first line), with
some further gains with the non-causal Viterbi re-segmentation
(second line). The Viterbi re-segmentation step was added to
ensure a fair comparison with the other two systems although
it is inconsistent with the primary purpose of the Streaming
system, namely incremental on-line decision making. We did
not test the (equally non-causal) soft-clustering technique with
the Streaming system.
TABLE II
MEAN AND STANDARD DEVIATION OF DIARIZATION ERROR RATES (DER)
ON THE NIST 2008 SUMMED CHANNEL TELEPHONE DATA FOR THE
STREAMING SYSTEM; σ REFERS TO THE STANDARD DEVIATION OF THE
DIARIZATION ERRORS.
mean DER (%)
5.8
4.6
σ (%)
11.1
8.8
Streaming without Viterbi
Streaming + Viterbi
IV. VARIATIONAL BAYES SYSTEM
A. Motivation
As we have just seen, effective speaker change point de-
tection can be achieved using as little as 20 speaker factors.
However the number of speaker factors used in state of
the art speaker recognition systems is typically much larger,
e.g. 300 [1]. With factor analysis models of this size, it
is difficult to process speech data incrementally; in practice
factor analysis methods have to be applied in batch mode
and extracting speaker and channel factors from a speech
file is computationally burdensome. On the other hand, most
speaker diarization algorithms work with very short intervals
of speech, typically a few seconds long. For example it may be
required to determine whether a speaker change point occurs
at a given frame or whether the same speaker is talking in
two short segments. Thus it is not obvious how the two
technologies — speaker diarization and large scale factor
analysis — can be integrated.
The Variational Bayes method of speaker diarization devel-
oped by Valente [7] is a natural way of solving this problem. In
the technical report [11], we modified Valente’s formulation
in order to incorporate the factor analysis priors defined by
eigenvoices and eigenchannels [1], and we simplified it to
take account of the fact that we are dealing with a diarization
problem in which the number of speakers is given.
The advantages of the Variational Bayes approach are that
it is fully probabilistic, it comes with EM-like convergence
guarantees and it avoids making premature hard decisions
as in agglomerative speaker clustering. Furthermore Bayesian
methods are automatically regularized in the sense that, in
theory at least, they are not subject to the overfitting prob-
lems which maximum likelihood methods are prone to. Thus
Bayesian model selection can be used to determine the number
of speakers participating in a conversation without having to
resort to BIC-like fudge factors [7]. Since our test bed consists
of two-way telephone conversations we did not get to explore
this possibility but we will return to this question briefly in
Section IV-D.
B. Informal description
We assume at the outset that we are given a conversa-
tion involving just two speakers and that speaker change
points are given. The diarization problem is formulated as
one of calculating, for each speaker segment, the posterior
probabilities of the events that one speaker or the other is
talking in the segment, as illustrated in Figure 2. The initial
speaker segmentation need not be very accurate. (To begin
with, a uniform segmentation into 1 second intervals after
removing silences can be used; this assumption can be relaxed
in a second pass after Viterbi re-segmentation, as described
in Section II-C.) We refer to these posterior probabilities
as segment posteriors. Once these segment posteriors have
been calculated, it is a straightforward matter to make a hard
decision as to which of the two speakers is talking at any given
time and this gives a solution to the diarization problem.
1 sec
q , q
1
2
q = Posterior probability that speaker 1 is talking
1
Segment Posteriors
Fig. 2.
which just one speaker is talking; these segments are shown here as 1 sec
intervals. The object of the Variational Bayes algorithm is to estimate a pair
of posterior probabilities for each speaker segment. In the text, the posterior
probability that speaker s is talking in segment m is denoted by qms.
Assume that the speech file has been partitioned into segments in
The Variational Bayes approach is fully probabilistic. It aims
to solve the diarization problem by a consistent application
of the rules of probability (marginalization and conditioning)
using a hierarchical generative model of the speech data that
contains three types of hidden random variable whose roles
are to specify
1) The assignment of segments to speakers.
2) The parameters of speaker GMMs.
3) The assignment of frames to Gaussians in the speaker
GMMs.
In his handling of point 2), Valente used a fully Bayesian
treatment of the problem of GMM estimation in which all
Page 5
IEEE JOURNAL OF SPECIAL TOPICS IN SIGNAL PROCESSING5
of the GMM parameters — mixture weights, mean vectors
and covariance matrices — are treated as random variables
having appropriate prior distributions. (Fully Bayesian GMM
estimation is the paradigmatic example of Variational Bayesian
inference [9], [16].) Our reason for borrowing the Variational
Bayesian framework is that it enables us to substitute eigen-
voice and eigenchannel priors on GMMs in place of 2) and
hence to bring large scale factor analysis methods to bear on
the speaker diarization problem.
The collective experience of workers in speaker recognition
using speaker GMMs with very large numbers of Gaussians
has been that mixture weights and covariance matrices can be
treated as speaker-independent but a large body of research
has been devoted to finding powerful prior distributions for
Gaussian mean vectors [1], [2], [3], [4]. We describe briefly
how these priors are constructed and how they are used in
speaker recognition. Recall that the term supervector is used
to refer to the concatenation of the mean vectors in a Gaussian
mixture model. The assumption in eigenvoice modeling is that
speaker supervectors have a Gaussian distribution of the form
s
=
m + V y.
(1)
Here s is a randomly chosen speaker dependent supervector;
m is a speaker-independent supervector; V is a rectangular
matrix of low rank whose columns are referred to as eigen-
voices; the vector y has a standard normal distribution; and
the entries of y are the speaker factors. In Bayesian terms,
(1) is a highly informative prior distribution: supervectors
are of extremely high dimension in practice but (1) confines
speaker supervectors to a low dimensional affine subspace of
the supervector space. On the other hand, the factorial priors
on GMM parameters used by Valente impose relatively weak
constraints on speaker models.
To model channel effects in speaker recognition, (1) is
modified as follows:
s
=
m + Ux + V y.
(2)
Here s is a randomly chosen speaker and channel dependent
supervector; U is a rectangular matrix of low rank whose
columns are referred to as eigenchannels; the vector x has
a standard normal distribution; and the entries of x are the
channel factors.
In speaker recognition (2) is used in different ways in
enrollment and testing. In enrolling a given target speaker,
both x and y are estimated from the enrollment data but x is
discarded so as to obtain an estimate of the speaker’s super-
vector which is independent of channel effects; in matching
a target speaker against a given test utterance, x is treated as
random. It turns out that incorporating Ux into the Variational
Bayes diarization framework presents no extra difficulty since
(2) is formally equivalent to (1) as can be seen by writing it
as
s
=
m +?
UV
??
x
y
?
.
(3)
So although we will perform experiments with eigenchannels
as well as eigenvectors, we will only refer to equation (1) in
subsequent development.
In our version of point 2), each of the two speakers in
the given conversation will be represented by a hidden vector
of speaker factors. In the course of calculating the segment
posteriors referred to in Figure 2, we will also calculate two
speaker posteriors as illustrated in Figure 3. For each of
the two speakers, the corresponding speaker posterior is a
multivariate Gaussian distribution on speaker factors which
models the location of the speaker in the speaker factor space.
The mean of this distribution can be thought of as a point
estimate of the speaker’s location and the covariance matrix
as a measure of the uncertainty in this point estimate. The
multivariate Gaussian assumption here is the same as in the
Streaming system (illustrated in Figure 1) but it is supposed
to hold over the whole speech file rather than on a slice-by-
slice basis. Other differences are that the Variational Bayes
system uses Gaussians of dimension 300 rather than 20 and
covariance matrices which are full rather than diagonal.
Mean = Point estimate of speaker factors
Variance = Uncertainty
Speaker Posteriors
Speaker 1
Speaker 2
Fig. 3.
of speaker factors. The mean of this distribution provides a point estimate of
the speaker’s supervector, according to (1). The covariance matrix models the
uncertainty of the speaker’s location in the speaker factor space. The mean
vector and precision matrix of the posterior for speaker s are denoted by as
and Λs in the text.
Each speaker is represented by a posterior distribution on the space
We referred in point 3) above to a third type of hidden
variable whose role is to specify the alignment of speech
frames with Gaussians in speaker GMMs. Unlike Valente, we
use a Universal Background Model to carry out the alignment
(thus we treat the alignment of frames with Gaussians in
speaker GMMs as deterministic). This strategy has proved
to be very successful in speaker recognition [1]. It has the
advantage that Baum-Welch statistics extracted from each of
the speaker segments with the UBM are sufficient statistics
for all of the probability calculations and greatly alleviates
the computational burden of our version of Variational Bayes
speaker diarization.
It is beyond the scope of this article to attempt a full
exploration of the ways in which the Variational Bayes method
can be applied to the diarization problem but it may be of
interest to sketch some directions for future research. A glance
at the paper [17] will be enough to convince the reader of the
power of Variational Bayes for speech processing applications.
It is shown there how a unified probabilistic framework can
be developed for denoising and dereverberation of speech
signals using an informative prior distribution on clean speech
signals and Bayesian inference. A particularly striking aspect