Content uploaded by Jean-François Bonastre
Author content
All content in this area was uploaded by Jean-François Bonastre on Sep 29, 2014
Content may be subject to copyright.
E-HMM approach for learning and adapting sound models
for speaker indexing
Sylvain Meignier, Jean-Franc¸ois Bonastre, St
´
ephane Igounet
LIA/CERI Universit´e d’Avignon, Agroparc,
BP 1228, 84911 Avignon Cedex 9, France.
sylvain.meignier, jean-francois.bonastre @lia.univ-avignon.fr,
stephane.igounet@univ-avignon.fr.
Abstract
This paper presents an iterative process for blind
speaker indexing based on a HMM. This process detects
and adds speakers one after the other to the evolutive
HMM (E-HMM). The use of this HMM approach takes
advantage of the different components of AMIRAL au-
tomatic speaker recognition system (ASR system: fron-
tend processing, learning, loglikelihood ratio computing)
from LIA. The proposed solution reduces the miss detec-
tion of short utterances by exploiting all the information
(detected speakers) as soon as it is available.
The proposed system was tested on N-speaker seg-
mentation task of NIST 2001 evaluation campaign. Ex-
periments were carried out to validate the speakers de-
tection. Moreover, these tests measure the influence of
parameters used for speaker models learning.
1. Introduction
Seeking within a recording the speech sequences uttered
by a given speaker is one of the main tasks of docu-
ment indexing. Segmentation systems first detect breaks
in audio streams and then cluster in homogeneous sound
classes the segments according to those breaks.
In automatic speaker recognition, segmentation con-
sist in finding all the speakers, as well as the beginning
and the end of their contributions. The speaker segmen-
tation problem is commonly approachedby two methods.
The first method (described in [1] and [2]) is com-
posed of two steps. The former locates the signal breaks
which are caused by speakers changes. The latter deter-
mines and labels the utterances using a clustering algo-
rithm.
The second method (as done in [3] and [1]) uses an
automatic speaker recognition (ASR) system. Breaks
detections and clustering tasks are carried out simulta-
neously. The system has to determine speakers present
within a given message as well as the utterances of each
of them.
RAVOL project: financial support from Conseil g´en´eral de la
r´egion Provence Alpes Cˆote d’Azur and DigiFrance.
No a priori information on speakers is used in these
two approaches, i.e. the speaker models have to be built
during the process. Therefore these methods are well
adapted to the tasks of blind segmentation.
In this article, we propose a system adapted from the
second method for blind segmentation tasks. The conver-
sation is modeled by a Markov Model (like [3]). During
the segmentation process, the MarkovModel is expanded
with each new detection of sound class.
The proposed system was tested on N-speaker seg-
mentation task of NIST 2001 evaluation campaign [9]
which uses the CALLHOME database. The experiments
in this paper are done using a half of this database to se-
lect the best fitting parameters for speaker models learn-
ing. The other half remains to validate the choice of those
parameters.
2. Segmentation model
2.1. Structure of the segmentation model
The signal to segment consists in a sequence of observa-
tion vectors
.
The changes of sound classes are represented by a
hidden Markov model (HMM). In this application the
sound classes represent a speaker. Each HMM state char-
acterize a class of sound and their transitions model the
changes of classes.
The HMM
is defined by :
Let be a set of states.
Let be a set of transition probabilities
between the states.
Let be the set of . Let the state be asso-
ciated with a sound model
of the sound class
. Each state is then associated with a set
of emission probabilities according to . Let
be an observation from , is the probability
calculated from
for .
The HMM is fully connected.
Transition probabilities are established according to
a set of rules complying with the three following condi-
tions:
(1)
2.2. Detection of sound classes and segmentation
model building
The Segmentation model is generated by an iterative pro-
cess, which detectedand added a new state
at each stage
(i.e.
). We refer to evolutive HMM as E-HMM [4].
At the process initialization stage
figure 2,
HMM is
. is composed
of a single state ”1”, which is associated with a sound
model
(
1
) learned from the whole signal . At the
end of the initialization process, a first trivial segmenta-
tion
is generated. In fact,
each observation
is simply labeled with the only sound
class
. This segmentation is composed of a single
segmentwhich will bechallenged at the following stages.
Adapting speaker models
Adding a new speaker model
Assessing the stop criterion
Steps 1 & 2
Step 3
Step 4
Figure 1: Diagram of the segmentation process.
The process (e.g. stage2&3figure 2) is divided in 4
steps (figure 1) for each stage
( ):
Step 1 A new state
is added to the set (
). Transition probabilities are adapted to
take into account the new number of states. Then,
the new HMM
is obtained.
Step 2 The sound model
is estimated from a subset
of observation
2
. is selected
1
is the sound model for the not yet detected speakers (i.e. all).
2
In this work, each subset has a 3 sec. duration (i.e. is fixed).
such as :
ArgMax
(2)
then,
is the rank of the subset
maximizing the probabilities
product for the sound class
.
Moreover, the segmentation
is computed: the
subset
is relabeled to the
sound class
.
(3)
Step 3 In this step, the process iteratively adapts the pa-
rameters of HMM
:
(a) For each
in , the sound model
is adapted according to data which were af-
fected to it in the segmentation
.
(b) The set
of emission probabilities are re-
computed.
(c) Viterbi algorithm is applied to obtain a new
version of segmentation
according to the
HMM.
The Viterbi path
is com-
puted.
(4)
If a gain is observed between two loops in
3, the process returns to
Step 4 Lastly, the stop criterion is assessed: if
(5)
then a new stage starts back to step ”1”.
Note: the probability of
is reestimated with
the transition
of the model , because topolo-
gies of segmentation models
and must be
comparable.
3. Automatic Speaker recognition System
The sound models and emission probabilities are calcu-
lated by the AMIRAL ASR system developed at LIA
[5]. Emission probabilities are computed on fixed-length
blocks of 0.3 second. Each emission probability is nor-
malized by the world model.
L1
L2
L1
L2
L1
L2
L1
L2
L3
L1
L2
L3
L1
L1
L2
The best subset is
used to learn L3 model,
a new HMM is built
Process initialisation
Process : steps 1 & 2
L1
The best subset is
used to learn L2 model,
a new HMM is built
Process : step 3
Viterbi Viterbi
Process : steps 1 & 2
Process : step 3
Viterbi
L1
L2
L3
Process : step 4
L1
L2
Viterbi
No gain observed,
the adaptation of the
L2 model is stoped
L1
L2
Best 2 speakers indexing Best one speaker indexing
A gain is observed, a
new speaker will be added
Step 2: adding speaker L2
Step 3: adding speaker L3
Viterbi
No gain observed,
the adaptation of the
L3 model is stoped
Process : step 4
L1
L2
L3
Best 2 speakers indexing
No gain is observed, the
process stops and return
the 2 speaker indexing
Best 3 speakers indexing
Models Adaptation
Models Adaptation
Stop criterion
Stop criterion
L1 L2
L1
L1
L2
According to the subset seleted,
this indexing is obtained
L1
L1 L2
L1 L2
L3
L1
L2
L3
According to the subset seleted,
we obtain this indexing
t
t t
ttt
t
t
t
t
tt
tt
Step 1: adding speaker L1
Figure 2: Example of segmentation for a 2 speakers test.
Acoustic parameterization (16 cepstral coefficients
and 16
-cepstral coefficients) is carried out using the
SPRO moduledevelopedbythe ELISA consortium
3
[10].
The sound classes are modeled by gaussian mixture
models (GMM) with 128 components and diagonal co-
variancematrices [7], adapted from a backgroundmodel.
The sound model
is first estimated over a sub-
sequenceof 3 seconds (sec. 2.2 -Step 2). Then, the sound
model
is adapted from the segments which are labeled
3
ELISA consortium is composed of European research laboratories
which work on a common platform. Members of ELISA for the partici-
pation to NIST 2001 is : ENST (France), IRISA (France), LIA(France),
RMA (Belgium).
by the sound class (sec. 2.2 - Step 3a).
The adaptation scheme for training speaker models is
based on the maximum a posteriori method (MAP). For
each Gaussian
, Mean of sound model is a lin-
ear combination between the estimated
and the corre-
sponding mean
in background model . Mean
is estimated according to the data of sound class :
(6)
Neither the weights, nor covariance matrices are
adapted. The sound model uses the weights and co-
variance matrices of the background model.
4. Experiments
The proposed approach was experimented on the N-
speakers segmentation task during NIST 2001 evaluation
campaign [9]. The results are shown in sec. 4.5. More-
over, development experiments on learning method are
reported in sec. 4.4.
4.1. Databases
N-speakers evaluation corpus (described in [8] and [9]) is
composed of 500 conversational speech tests drawn from
CALLHOME corpus. Tests of varying length (
10 min-
utes) are taken from 6 languages. The exact number of
speakers is not provided (but is less than 10).
The NIST Corpus is divided in two parts of 250 tests
named Dev and Eva.
NIST provided a separate development corpus
(named train
ch) composed of 48 conversational speech
samples extracted from CALLHOME corpus. The
train
ch corpus permitted to learn the background model
(wld
ch).
A second separate development data set (train
sb)
is composed of 472 trials made of up to 100 speakers
from Switchboard2. train
sb permitted to learn the back-
ground model wld
sb.
4.2. Experiments
Experiments were carried out to estimate the influence
of
parameter of MAP learning, applied on both wld ch
and wld
sb background models.
Moreover, a reference experiment based on a trivial
segmentation (only one segment) named trivial is gener-
ated.
The results are obtained with parameters:
The transition probabilities are estimated with
(Eq. 1).
The MAP parameters (Eq. 6) are:
4.3. Scoring measures
Two evaluation measures are considered:
The mean of differences between the esti-
mated number of speakers
and the real number
of speakers
in the test .
(7)
NIST speaker segmentation scoring is described in
[9]. Scoring is computed on the NIST reference
segments with only one speaker speaking. This
score corresponds to a segmentation error.
4.4. Results
In order to compare the influence of
parameter, figures
3 and 4 are shown respectively the segmentation scores
and the mean
obtained with both backgroundmod-
els. The best results are presented in tables 1 and 2 (re-
spectively for Dev and Eva).
The results on Eva corpus is close to the results ob-
tained on Dev corpus.
When the weight is close to 1, the result becomes
equivalent to the trivial segmentation. The systems
mainly attribute the data to only one class, besides the
process does not add new speaker models (
is near
-1.5).
When the weight is 0, the adaptation learning be-
comes equivalent to a EM-ML training with one iteration
[6], but initialized using the corresponding background
model. This weight gives the best result (24.01%) for
the wld
sb background model. Although sufficient data
is provided to compute background model, the train
sb
data is very different from the data of CALLHOME cor-
pus (Dev and Eva).
The background model wld
cl take advantage of the
MAP learning method. The best result (25.5%) is ob-
tained for an
. Few data is used to learn this back-
groundmodel which is not efficient to generalize Dev and
Eva data of CALLHOME corpus.
For both background models, the mean
of dif-
ferences between the estimated number of speakers and
the real number of speakers is quite good (near 0.5 for
best results with wld
ch and wld sb).
Corpus background model score (%)
Dev wld sb 0 24.01 0.58
Dev wld ch 0.3 25.50 0.71
Table 1: Best results for the backgroundmodels
Corpus background model score (%)
Eva wld sb 0 23.42 0.54
Eva wld ch 0.3 25.17 0.44
Table 2: Validation of the results obtain on Dev Corpus
4.5. NIST 2001 results
Two systems were presented to NIST 2001 N-speaker
segmentation. They use wld
sb background model to
background model learned on SWITCHBORD 2
background model learned on CALLHOME
trivial test
15
20
25
30
35
40
0 0.2 0.4 0.6 0.8 1
(score)
(weight of background model)
15
20
25
30
35
40
0 0.2 0.4 0.6 0.8 1
(score)
(weight of background model)
Figure 3: NIST N-speakers segmentation score (%): in-
fluence of
parameter.
wld_sb : background model learned on SWITCHBOARD 2
wld_ch : bacground model learned on CALLHOME
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1
(weight of background model)
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1
(weight of background model)
diff
(m )
Figure 4: NIST N-speakers :influence of param-
eter on MAP.
adaptspeaker modelswith an
parameterequal to 0. The
difference between the system is the value of
parame-
ter used to compute HMM transition probabilities. In the
first system LIA10,
is equal to 0.6, In the second one,
.
Note:
and parameter was estimated before NIST
2001on train
ch corpus accordingto the evaluationrules.
Tables 3 and 4 show the results of LIA10 and LIA20
as well as the results of the trivial segmentation.
Tables 3 show the scores computed by the number
of speakers present. Systems are well adapted for tests
where a lot of speakers speak. The number of speaker are
detected correctly. However the scores (22% and 24%) is
close to the score of the trivial system (26%) for 2 speak-
ers tests.
LIA10 and LIA20 scores is almost equal for the dif-
ferent speaker languages. The chosen learning method is
well adapted when speaker language is unknown.
5. Summary
In this article, the segmentation system uses an evolutive
HMM to model the conversation and to determine au-
tomatically the sound classes present in messages. The
approach is based on an iterative algorithm which de-
tects and adds the sound models one by one. At each
stage, a segmentation is proposed, according to available
knowledge. This segmentation is called into question at
the following iteration until the optimal segmentation is
reached.
Within sight of the results, the system behaves satis-
factorily. Experiments showed that MAP training is well
adapted for the selected task of segmentation. As for the
weight between the background and the estimated sound
model, it has a influence on the segmentation error.
Further work will focus on this two points, by adapt-
ing the background model data of CALLHOME to the
SWITCHBOARD background model and by introducing
an explicit duration model into the HMM to improve the
speaker detection.
6. References
[1] P. Delacourt, D. Kryze, C.J. Wellekens. Use of sec-
ond order statistic for speaker-based segmentation,
EUROSPEECH, 1999.
[2] H. Gish, H-H Siu, R. Rohlicek. Segregation of
speakers for speech recognition and speaker iden-
tification, ICASSP, pages 873-876, 1991.
[3] L. Wilcox, D. Kimber, and F. Chen, Audio indexing
using speaker identification, SPIE, pages 149-157,
July, 1994.
[1] K. S¨onmez, L. Heck, M. Weintraub, Speaker track-
ing and detection with multiple speakers, EU-
ROSPEECH, 1999.
[4] S. Meignier,J.-F. Bonastre, C. Fredouille,T. Merlin,
Evolutive HMM for Multi-Speaker Tracking Sys-
tem, ICASSP, june 2000.
[5] C. Fredouille,J.-F. Bonastre, T. Merlin, AMIRAL: a
block-segmentalmulti-recognizer approach for Au-
tomatic Speaker Recognition, Digital Signal Pro-
cessing, Vol.10, Num.1-3, pp.172-197 Jan.-Apr.
2000.
[6] D. Dempster, N. Laird, D. Rubin, Maximum like-
lihood from incomplete data via EM algorithm, J.
Roy. Stat. Soc., Vol. 39, pp 1-38, 1977.
[7] D. A. Reynolds, Speaker identification and verifica-
tion using gaussian mixture speaker models, Speech
Communication, pp 91-108, Aug. 1995.
System Dev+Eva Files 2-spkrs 3-spkrs 4-spkrs 5-spkrs 6-spkrs 7-spkrs
500 303 136 43 10 6 2
trivial 38 26 39 49 50 56 61
LIA10 24 22 25 23 30 35 37
LIA20 24 24 24 22 29 38 34
Table 3: NIST 2001 % scores for N-speaker segmentation task by speakers number in each test
System arabic english german japanese mandarin spanish
95 56 67 68 118 96
trivial 40 24 30 33 42 42
LIA10 24 23 19 26 25 26
LIA20 22 26 24 26 24 25
Table 4: NIST 2001 % scores for N-speaker segmentation task by different languages
[8] The 2000 NIST Speaker Recognition Eval-
uation Plan, http://www.nist.gov/speech/
tests/spk/2000/doc/spk-2000-plan-v1.0.htm.
[9] The NIST Year 2001 Speaker Recogni-
tion Evaluation Plan, http://www.nist.gov/
speech/tests/spk/2001/doc/2001-spkrec-evalplan-
v53.pdf.
[10] Elisa Consortium, Overview of the ELISA con-
sortium research activities, Odyssey, 2001. 91-108,
Aug. 1995.