ArticlePDF Available

E-HMM approach for learning and adapting sound models for speaker indexing

Authors:
  • National Institute for Research in Digital Science and Technology

Abstract and Figures

This paper presents an iterative process for blind speaker indexing based on a HMM. This process detects and adds speakers one after the other to the evolutive HMM (E-HMM). The use of this HMM approach takes advantage of the different components of AMIRAL au- tomatic speaker recognition system (ASR system: fron- tend processing, learning, loglikelihood ratio computing) from LIA. The proposed solution reduces the miss detec- tion of short utterances by exploiting all the information (detected speakers) as soon as it is available. The proposed system was tested on N-speaker seg- mentation task of NIST 2001 evaluation campaign. Ex- periments were carried out to validate the speakers de- tection. Moreover, these tests measure the influence of parameters used for speaker models learning.
Content may be subject to copyright.
E-HMM approach for learning and adapting sound models
for speaker indexing
Sylvain Meignier, Jean-Franc¸ois Bonastre, St
´
ephane Igounet
LIA/CERI Universit´e d’Avignon, Agroparc,
BP 1228, 84911 Avignon Cedex 9, France.
sylvain.meignier, jean-francois.bonastre @lia.univ-avignon.fr,
stephane.igounet@univ-avignon.fr.
Abstract
This paper presents an iterative process for blind
speaker indexing based on a HMM. This process detects
and adds speakers one after the other to the evolutive
HMM (E-HMM). The use of this HMM approach takes
advantage of the different components of AMIRAL au-
tomatic speaker recognition system (ASR system: fron-
tend processing, learning, loglikelihood ratio computing)
from LIA. The proposed solution reduces the miss detec-
tion of short utterances by exploiting all the information
(detected speakers) as soon as it is available.
The proposed system was tested on N-speaker seg-
mentation task of NIST 2001 evaluation campaign. Ex-
periments were carried out to validate the speakers de-
tection. Moreover, these tests measure the influence of
parameters used for speaker models learning.
1. Introduction
Seeking within a recording the speech sequences uttered
by a given speaker is one of the main tasks of docu-
ment indexing. Segmentation systems first detect breaks
in audio streams and then cluster in homogeneous sound
classes the segments according to those breaks.
In automatic speaker recognition, segmentation con-
sist in finding all the speakers, as well as the beginning
and the end of their contributions. The speaker segmen-
tation problem is commonly approachedby two methods.
The first method (described in [1] and [2]) is com-
posed of two steps. The former locates the signal breaks
which are caused by speakers changes. The latter deter-
mines and labels the utterances using a clustering algo-
rithm.
The second method (as done in [3] and [1]) uses an
automatic speaker recognition (ASR) system. Breaks
detections and clustering tasks are carried out simulta-
neously. The system has to determine speakers present
within a given message as well as the utterances of each
of them.
RAVOL project: financial support from Conseil g´en´eral de la
egion Provence Alpes Cˆote d’Azur and DigiFrance.
No a priori information on speakers is used in these
two approaches, i.e. the speaker models have to be built
during the process. Therefore these methods are well
adapted to the tasks of blind segmentation.
In this article, we propose a system adapted from the
second method for blind segmentation tasks. The conver-
sation is modeled by a Markov Model (like [3]). During
the segmentation process, the MarkovModel is expanded
with each new detection of sound class.
The proposed system was tested on N-speaker seg-
mentation task of NIST 2001 evaluation campaign [9]
which uses the CALLHOME database. The experiments
in this paper are done using a half of this database to se-
lect the best fitting parameters for speaker models learn-
ing. The other half remains to validate the choice of those
parameters.
2. Segmentation model
2.1. Structure of the segmentation model
The signal to segment consists in a sequence of observa-
tion vectors
.
The changes of sound classes are represented by a
hidden Markov model (HMM). In this application the
sound classes represent a speaker. Each HMM state char-
acterize a class of sound and their transitions model the
changes of classes.
The HMM
is defined by :
Let be a set of states.
Let be a set of transition probabilities
between the states.
Let be the set of . Let the state be asso-
ciated with a sound model
of the sound class
. Each state is then associated with a set
of emission probabilities according to . Let
be an observation from , is the probability
calculated from
for .
The HMM is fully connected.
Transition probabilities are established according to
a set of rules complying with the three following condi-
tions:
(1)
2.2. Detection of sound classes and segmentation
model building
The Segmentation model is generated by an iterative pro-
cess, which detectedand added a new state
at each stage
(i.e.
). We refer to evolutive HMM as E-HMM [4].
At the process initialization stage
figure 2,
HMM is
. is composed
of a single state ”1”, which is associated with a sound
model
(
1
) learned from the whole signal . At the
end of the initialization process, a first trivial segmenta-
tion
is generated. In fact,
each observation
is simply labeled with the only sound
class
. This segmentation is composed of a single
segmentwhich will bechallenged at the following stages.
Adapting speaker models
Adding a new speaker model
Assessing the stop criterion
Steps 1 & 2
Step 3
Step 4
Figure 1: Diagram of the segmentation process.
The process (e.g. stage2&3figure 2) is divided in 4
steps (figure 1) for each stage
( ):
Step 1 A new state
is added to the set (
). Transition probabilities are adapted to
take into account the new number of states. Then,
the new HMM
is obtained.
Step 2 The sound model
is estimated from a subset
of observation
2
. is selected
1
is the sound model for the not yet detected speakers (i.e. all).
2
In this work, each subset has a 3 sec. duration (i.e. is fixed).
such as :
ArgMax
(2)
then,
is the rank of the subset
maximizing the probabilities
product for the sound class
.
Moreover, the segmentation
is computed: the
subset
is relabeled to the
sound class
.
(3)
Step 3 In this step, the process iteratively adapts the pa-
rameters of HMM
:
(a) For each
in , the sound model
is adapted according to data which were af-
fected to it in the segmentation
.
(b) The set
of emission probabilities are re-
computed.
(c) Viterbi algorithm is applied to obtain a new
version of segmentation
according to the
HMM.
The Viterbi path
is com-
puted.
(4)
If a gain is observed between two loops in
3, the process returns to
Step 4 Lastly, the stop criterion is assessed: if
(5)
then a new stage starts back to step ”1”.
Note: the probability of
is reestimated with
the transition
of the model , because topolo-
gies of segmentation models
and must be
comparable.
3. Automatic Speaker recognition System
The sound models and emission probabilities are calcu-
lated by the AMIRAL ASR system developed at LIA
[5]. Emission probabilities are computed on fixed-length
blocks of 0.3 second. Each emission probability is nor-
malized by the world model.
L1
L2
L1
L2
L1
L2
L1
L2
L3
L1
L2
L3
L1
L1
L2
The best subset is
used to learn L3 model,
a new HMM is built
Process initialisation
Process : steps 1 & 2
L1
The best subset is
used to learn L2 model,
a new HMM is built
Process : step 3
Viterbi Viterbi
Process : steps 1 & 2
Process : step 3
Viterbi
L1
L2
L3
Process : step 4
L1
L2
Viterbi
No gain observed,
the adaptation of the
L2 model is stoped
L1
L2
Best 2 speakers indexing Best one speaker indexing
A gain is observed, a
new speaker will be added
Step 2: adding speaker L2
Step 3: adding speaker L3
Viterbi
No gain observed,
the adaptation of the
L3 model is stoped
Process : step 4
L1
L2
L3
Best 2 speakers indexing
No gain is observed, the
process stops and return
the 2 speaker indexing
Best 3 speakers indexing
Models Adaptation
Models Adaptation
Stop criterion
Stop criterion
L1 L2
L1
L1
L2
According to the subset seleted,
this indexing is obtained
L1
L1 L2
L1 L2
L3
L1
L2
L3
According to the subset seleted,
we obtain this indexing
t
t t
ttt
t
t
t
t
tt
tt
Step 1: adding speaker L1
Figure 2: Example of segmentation for a 2 speakers test.
Acoustic parameterization (16 cepstral coefficients
and 16
-cepstral coefficients) is carried out using the
SPRO moduledevelopedbythe ELISA consortium
3
[10].
The sound classes are modeled by gaussian mixture
models (GMM) with 128 components and diagonal co-
variancematrices [7], adapted from a backgroundmodel.
The sound model
is first estimated over a sub-
sequenceof 3 seconds (sec. 2.2 -Step 2). Then, the sound
model
is adapted from the segments which are labeled
3
ELISA consortium is composed of European research laboratories
which work on a common platform. Members of ELISA for the partici-
pation to NIST 2001 is : ENST (France), IRISA (France), LIA(France),
RMA (Belgium).
by the sound class (sec. 2.2 - Step 3a).
The adaptation scheme for training speaker models is
based on the maximum a posteriori method (MAP). For
each Gaussian
, Mean of sound model is a lin-
ear combination between the estimated
and the corre-
sponding mean
in background model . Mean
is estimated according to the data of sound class :
(6)
Neither the weights, nor covariance matrices are
adapted. The sound model uses the weights and co-
variance matrices of the background model.
4. Experiments
The proposed approach was experimented on the N-
speakers segmentation task during NIST 2001 evaluation
campaign [9]. The results are shown in sec. 4.5. More-
over, development experiments on learning method are
reported in sec. 4.4.
4.1. Databases
N-speakers evaluation corpus (described in [8] and [9]) is
composed of 500 conversational speech tests drawn from
CALLHOME corpus. Tests of varying length (
10 min-
utes) are taken from 6 languages. The exact number of
speakers is not provided (but is less than 10).
The NIST Corpus is divided in two parts of 250 tests
named Dev and Eva.
NIST provided a separate development corpus
(named train
ch) composed of 48 conversational speech
samples extracted from CALLHOME corpus. The
train
ch corpus permitted to learn the background model
(wld
ch).
A second separate development data set (train
sb)
is composed of 472 trials made of up to 100 speakers
from Switchboard2. train
sb permitted to learn the back-
ground model wld
sb.
4.2. Experiments
Experiments were carried out to estimate the influence
of
parameter of MAP learning, applied on both wld ch
and wld
sb background models.
Moreover, a reference experiment based on a trivial
segmentation (only one segment) named trivial is gener-
ated.
The results are obtained with parameters:
The transition probabilities are estimated with
(Eq. 1).
The MAP parameters (Eq. 6) are:
4.3. Scoring measures
Two evaluation measures are considered:
The mean of differences between the esti-
mated number of speakers
and the real number
of speakers
in the test .
(7)
NIST speaker segmentation scoring is described in
[9]. Scoring is computed on the NIST reference
segments with only one speaker speaking. This
score corresponds to a segmentation error.
4.4. Results
In order to compare the influence of
parameter, figures
3 and 4 are shown respectively the segmentation scores
and the mean
obtained with both backgroundmod-
els. The best results are presented in tables 1 and 2 (re-
spectively for Dev and Eva).
The results on Eva corpus is close to the results ob-
tained on Dev corpus.
When the weight is close to 1, the result becomes
equivalent to the trivial segmentation. The systems
mainly attribute the data to only one class, besides the
process does not add new speaker models (
is near
-1.5).
When the weight is 0, the adaptation learning be-
comes equivalent to a EM-ML training with one iteration
[6], but initialized using the corresponding background
model. This weight gives the best result (24.01%) for
the wld
sb background model. Although sufficient data
is provided to compute background model, the train
sb
data is very different from the data of CALLHOME cor-
pus (Dev and Eva).
The background model wld
cl take advantage of the
MAP learning method. The best result (25.5%) is ob-
tained for an
. Few data is used to learn this back-
groundmodel which is not efficient to generalize Dev and
Eva data of CALLHOME corpus.
For both background models, the mean
of dif-
ferences between the estimated number of speakers and
the real number of speakers is quite good (near 0.5 for
best results with wld
ch and wld sb).
Corpus background model score (%)
Dev wld sb 0 24.01 0.58
Dev wld ch 0.3 25.50 0.71
Table 1: Best results for the backgroundmodels
Corpus background model score (%)
Eva wld sb 0 23.42 0.54
Eva wld ch 0.3 25.17 0.44
Table 2: Validation of the results obtain on Dev Corpus
4.5. NIST 2001 results
Two systems were presented to NIST 2001 N-speaker
segmentation. They use wld
sb background model to
background model learned on SWITCHBORD 2
background model learned on CALLHOME
trivial test
15
20
25
30
35
40
0 0.2 0.4 0.6 0.8 1
(score)
(weight of background model)
15
20
25
30
35
40
0 0.2 0.4 0.6 0.8 1
(score)
(weight of background model)
Figure 3: NIST N-speakers segmentation score (%): in-
fluence of
parameter.
wld_sb : background model learned on SWITCHBOARD 2
wld_ch : bacground model learned on CALLHOME
2
1.5
1
0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1
(weight of background model)
2
1.5
1
0.5
0
0.5
1
1.5
2
0 0.2 0.4 0.6 0.8 1
(weight of background model)
diff
(m )
Figure 4: NIST N-speakers :inuence of param-
eter on MAP.
adaptspeaker modelswith an
parameterequal to 0. The
difference between the system is the value of
parame-
ter used to compute HMM transition probabilities. In the
rst system LIA10,
is equal to 0.6, In the second one,
.
Note:
and parameter was estimated before NIST
2001on train
ch corpus accordingto the evaluationrules.
Tables 3 and 4 show the results of LIA10 and LIA20
as well as the results of the trivial segmentation.
Tables 3 show the scores computed by the number
of speakers present. Systems are well adapted for tests
where a lot of speakers speak. The number of speaker are
detected correctly. However the scores (22% and 24%) is
close to the score of the trivial system (26%) for 2 speak-
ers tests.
LIA10 and LIA20 scores is almost equal for the dif-
ferent speaker languages. The chosen learning method is
well adapted when speaker language is unknown.
5. Summary
In this article, the segmentation system uses an evolutive
HMM to model the conversation and to determine au-
tomatically the sound classes present in messages. The
approach is based on an iterative algorithm which de-
tects and adds the sound models one by one. At each
stage, a segmentation is proposed, according to available
knowledge. This segmentation is called into question at
the following iteration until the optimal segmentation is
reached.
Within sight of the results, the system behaves satis-
factorily. Experiments showed that MAP training is well
adapted for the selected task of segmentation. As for the
weight between the background and the estimated sound
model, it has a inuence on the segmentation error.
Further work will focus on this two points, by adapt-
ing the background model data of CALLHOME to the
SWITCHBOARD background model and by introducing
an explicit duration model into the HMM to improve the
speaker detection.
6. References
[1] P. Delacourt, D. Kryze, C.J. Wellekens. Use of sec-
ond order statistic for speaker-based segmentation,
EUROSPEECH, 1999.
[2] H. Gish, H-H Siu, R. Rohlicek. Segregation of
speakers for speech recognition and speaker iden-
tication, ICASSP, pages 873-876, 1991.
[3] L. Wilcox, D. Kimber, and F. Chen, Audio indexing
using speaker identication, SPIE, pages 149-157,
July, 1994.
[1] K. S¨onmez, L. Heck, M. Weintraub, Speaker track-
ing and detection with multiple speakers, EU-
ROSPEECH, 1999.
[4] S. Meignier,J.-F. Bonastre, C. Fredouille,T. Merlin,
Evolutive HMM for Multi-Speaker Tracking Sys-
tem, ICASSP, june 2000.
[5] C. Fredouille,J.-F. Bonastre, T. Merlin, AMIRAL: a
block-segmentalmulti-recognizer approach for Au-
tomatic Speaker Recognition, Digital Signal Pro-
cessing, Vol.10, Num.1-3, pp.172-197 Jan.-Apr.
2000.
[6] D. Dempster, N. Laird, D. Rubin, Maximum like-
lihood from incomplete data via EM algorithm, J.
Roy. Stat. Soc., Vol. 39, pp 1-38, 1977.
[7] D. A. Reynolds, Speaker identication and verica-
tion using gaussian mixture speaker models, Speech
Communication, pp 91-108, Aug. 1995.
System Dev+Eva Files 2-spkrs 3-spkrs 4-spkrs 5-spkrs 6-spkrs 7-spkrs
500 303 136 43 10 6 2
trivial 38 26 39 49 50 56 61
LIA10 24 22 25 23 30 35 37
LIA20 24 24 24 22 29 38 34
Table 3: NIST 2001 % scores for N-speaker segmentation task by speakers number in each test
System arabic english german japanese mandarin spanish
95 56 67 68 118 96
trivial 40 24 30 33 42 42
LIA10 24 23 19 26 25 26
LIA20 22 26 24 26 24 25
Table 4: NIST 2001 % scores for N-speaker segmentation task by different languages
[8] The 2000 NIST Speaker Recognition Eval-
uation Plan, http://www.nist.gov/speech/
tests/spk/2000/doc/spk-2000-plan-v1.0.htm.
[9] The NIST Year 2001 Speaker Recogni-
tion Evaluation Plan, http://www.nist.gov/
speech/tests/spk/2001/doc/2001-spkrec-evalplan-
v53.pdf.
[10] Elisa Consortium, Overview of the ELISA con-
sortium research activities, Odyssey, 2001. 91-108,
Aug. 1995.
... Model-based approaches are usually supervised, train on a labeled audio dataset for detecting speaker changes. Early methods include the GMM [26], and hidden Markov models (HMM) [27]. Latter, deep learning models like DNN [28], CNN [29] and bi-direction LSTMs [30] were applied. ...
Article
Full-text available
This study examines how ten different audio features, including MFCC, mel-spectrogram, chroma, and spectral contrast etc., influence speaker change detection (SCD) performance. The analysis is conducted using two unsupervised methods: Bayesian information criterion with Gaussian mixture model (BIC-GMM), a model-based approach, and Kullback-Leibler divergence with Gaussian Mixture Model (KL-GMM), a metric-based approach. Evaluation involved statistical analysis of feature changes in relation to speaker changes (vice versa), supported by comprehensive experimental validation. Experimental results show MFCC as the most effective feature, demonstrating consistently good performance across both methods. Features such as zero crossing rate, chroma, and spectral contrast also showed notable effectiveness within the BIC-GMM framework, while mel-spectrogram consistently ranked as the least influential feature in both approaches. Further analysis revealed that BIC-GMM exhibits greater stability in managing variations in feature performance, whereas KL-GMM is more sensitive to threshold optimization. Nevertheless, KL-GMM achieved competitive results when paired with specific features, such as MFCC and zero crossing rate. These findings offer valuable insights into the impact of feature selection on unsupervised SCD, providing guidance for the development of more robust and accurate algorithms for practical applications.
... However, the requirement to estimate these algorithms using very short segments is the main disadvantage of this strategy, even if the speaker models are refined throughout the process. Mixed strategies have been also proposed (Meignier et al. 2001;Wilcox et al. 1994;Reynolds et al. 2000), in which the segmentation and clustering steps of the classical step-by-step approach were first applied and then refined through a re-segmentation step in which the segment boundaries and sometimes the number of speakers were jointly challenged. A fusion of these steps can be found in (Moraru et al. 2004). ...
Article
Full-text available
In this paper, we address the problem of optimal non-hierarchical clustering in the speaker clustering phase for the speaker diarization task of news broadcasts. A new hybridization combining differential evolution (DE) algorithm and K-means algorithm is proposed and tested on TV news database (TVND). To optimize the classification of speakers, two criteria, namely trace within criterion (TRW) and variance ratio criterion (VRC), were used as clustering validity indices, correcting every possible grouping of speakers’ segments. Concerning the encoding of the classification of clusters to be optimized, it is performed by the cluster centers in DE algorithm. Therefore, a problem of rearrangement of centers in the populations can be generated, which cannot ensure an efficient search by applying evolutionary operators. For this purpose, an efficient heuristic was also proposed for this rearrangement. Non-hybrid DE variants were applied with and without the rearrangement of cluster centers, and compared with the corresponding hybrid K-means variants. The experimental results have showed the high-efficiency of hybrid K-means variants with the rearrangement of cluster centers compared with those without the rearrangement of cluster centers and non-hybrid DE variants. Also, the obtained results using hybrid and non-hybrid DE variants with the rearrangement of cluster centers were quite similar using both TWR and VRC criteria. Moreover, the best efficiency was acquired using hybrid DE variants thanks to these two criteria from which a value of 13.05% of DER has been reached by hybrid b6e6rl variant.
... A single (or alternatively a small number) of clusters is used as seed for the clustering process, which is iteratively split into smaller speaker clusters until a stopping criterion is met and the diarization output is fixed. Examples of such systems are [101,102,103]. ...
Thesis
Full-text available
Speaker diarization (SD) involves the detection of speakers within an audio stream and the intervals during which each speaker is active, i.e. the determination of ‘who spoken when’. The first part of the work presented in this thesis exploits an approach to speaker modelling involving binary keys (BKs) as a solution to SD. BK modelling is efficient and operates without external training data, as it operates using test data alone. The presented contributions include the extraction of BKs based on multi-resolution spectral analysis, the explicit detection of speaker changes using BKs, as well as SD fusion techniques that combine the benefits of both BK and deep learning based solutions. The SD task is closely linked to that of speaker recognition or detection, which involves the comparison of two speech segments and the determination of whether or not they were uttered by the same speaker. Even if many practical applications require their combination, the two tasks are traditionally tackled independently from each other. The second part of this thesis considers an application where SD and speaker recognition solutions are brought together. The new task, coined low latency speaker spotting (LLSS), involves the rapid detection of known speakers within multi-speaker audio streams. It involves the re-thinking of online diarization and the manner by which diarization and detection sub-systems should best be combined.
... The model-based approaches utilize trained models from labeled audio data to detect speaker change points. Among the most common approaches, there are the hidden Markov models [16], the Gaussian mixture models (GMM) [17], and the eigenvoice-based models [18]. The recent advances in deep learning have brought new approaches based on DNNs [19,8], CNNs [20,21], unidirectional [22] or bidirectional [23,24] long short-term memory recurrent neural networks, all yielding state-of-the-art results. ...
Conference Paper
In this paper, a new approach to speaker change point (SCP) detection is presented. This method is suitable for online applications (e.g., real-time broadcast monitoring). It is designed in a series of consecutive experiments, aiming at quality of detection as well as low latency. The resulting scheme utilizes a convolution neural network (CNN), whose output is smoothed by a decoder. The CNN is trained using data complemented by artificial examples to reduce different types of errors, and the decoder is based on a weighted finite state transducer (WFST) with the forced length of the transition model. Results obtained on data taken from the COST278 database show that our online approach yields results comparable with an offline multi-pass LIUM toolkit while operating online with a low latency.
... But in the top down approach, a small number of clusters are initialized and when a speaker cluster is identified, the pre-existing one is spliced into two for further group acceptance. Divisive Hierarchical clustering is one of the top down approach (Meignier et al. 2001). In this paper a model based speaker diarization approach utilizing the Gaussian mixture model (GMM)/Universal background model (UBM) for the speaker diarization with the Cepstral feature parameterization and fuzzy clustering algorithm is proposed. ...
Article
Full-text available
Speaker diarization is the process of determining “who speak when?” with appropriate speaker labels with respect to the time regions where they spoke. Accordingly, in the previous work, a model based speaker diarization using the tangential weighted Mel frequency cepstral coefficients as the feature parameter for the voice activity detection and Lion optimization algorithm for the clustering of the audio streams into speaker group was performed. In this paper, speaker diarization system is proposed using multiple kernel weighted Mel frequency cepstral coefficient (MKMFCC) parameterization and Wu-and-Li Index (WLI)-fuzzy clustering. First, a MKMFCC which utilizes the multiple kernels like the tangential and exponential for weighting the MFCC’s is proposed for the feature parameterization. Second, a clustering algorithm called the WLI-Fuzzy clustering is proposed for grouping the segments of the same speaker groups. The experimentation of the proposed speaker diarization system is carried out over the publically available ELSDSR corpus data set having the audio signal with seven different speakers. The performance evaluation of the proposed speaker diarization system is analysed using the measures such as diarization error rate, F-measure and false alarm rate. The results show that the proposed speaker diarization system proved better for tracking the active speakers from multiple speakers with improved tracking accuracy.
Chapter
In practical applications, many environment-related factors may influence the performance of speaker recognition. There is often no prior knowledge of these factors in advance, which makes the environment-related robustness issue more difficulty. In this chapter, three environment-related factors, background noise, cross channel and multiple-speaker, are summarized and their corresponding robustness issues are discussed.KeywordsSpeaker RecognitionDeep Neural NetworkSpeech EnhancementSpeaker VerificationSpeaker ModelThese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Conference Paper
Full-text available
We describe a speaker tracking and detection system, for Switchboard conversations, that uses a two-speaker and si- lence hidden Markov model (HMM) with a minimum state duration constraint and Gaussian mixture model (GMM) state distributions adapted from a single gender- and hand- set-independent imposter model distribution. Speaker tracking is used to segment speakers for detection, which is carried out by averaging frame scores of the Viterbi path and HNORM'ing via a novel parameter interpolation ex- tension of HNORM for use with files of arbitrary lengths. Use of duration statistics augmenting the acoustic scores is also introduced via a nonlinear combination function. Results are reported on the NIST 1998 Multispeaker de- velopment evaluation dataset.
Article
Full-text available
In this paper, a technique for audio indexing based on speaker identification is proposed. When speakers are known a priori, a speaker index can be created in real time using the Viterbi algorithm to segment the audio into intervals from a single talker. Segmentation is performed using a hidden Markov model network consisting of interconnected speaker sub-networks. Speaker training data is used to initialize sub-networks for each speaker. Sub-networks can also be used to model silence, or non-speech sounds such as a musical theme. When no prior knowledge of the speakers is available, unsupervised segmentation is performed using a nonreal time iterative algorithm. The speaker sub-networks are first initialized, and segmentation is performed by iteratively generating a segmentation using the Viterbi algorithm, and retraining the sub-networks based on the results of the segmentation. Since the accuracy of the speaker segmentation depends on how well the speaker sub-networks are initialize...
Conference Paper
A method for segregating speech from speakers engaged in dialogs is described. The method, assuming no prior knowledge of the speakers, employs a distance measure between speech segments used in conjunction with a clustering algorithm to perform the segregation. Properties of the distance measure are discussed, and an air traffic control application is described
Article
This paper presents high performance speaker identification and verification systems based on Gaussian mixture speaker models: robust, statistically based representations of speaker identity. The identification system is a maximum likelihood classifier and the verification system is a likelihood ratio hypothesis tester using background speaker normalization. The systems are evaluated on four publically available speech databases: TIMIT, NTIMIT, Switchboard and YOHO. The different levels of degradations and variabilities found in these databases allow the examination of system performance for different task domains. Constraints on the speech range from vocabulary-dependent to extemporaneous and speech quality varies from near-ideal, clean speech to noisy, telephone speech. Closed set identification accuracies an the 630 speaker TIMIT and NTIMIT databases were 99.5% and 60.7%, respectively. On a 113 speaker population from the Switchboard database the identification accuracy was 82.8%. Global threshold equal error rates of 0.24%, 7.19%, 5.15% and 0.51% were obtained in verification experiments on the TIMIT, NTIMIT, Switchboard and YOHO databases, respectively.
Article
Fredouille, Corinne, Bonastre, Jean-François, and Merlin, Teva, AMIRAL: A Block-Segmental Multirecognizer Architecture for Automatic Speaker Recognition, Digital Signal Processing10(2000), 172–197.In the wide domain of automatic speech recognition, extracting the relevant information carried by the speech signal is far from easy. Diversity, redundancy, and variability, the main characteristics of the speech signal, make this task particularly difficult. The work reported here presents a multirecognizer architecture designed to cope with this issue in the framework of Automatic Speaker Recognition. This architecture, based on various individual recognizers, exploits different classes of information conveyed by the speech signal. In this paper, two classes of information are investigated: information related to the frequency domain, and “dynamic” information. This multirecognizer architecture is coupled with a block-segmental approach applied on each classifier. The overall system allows us to emphasize the most informative temporal blocks and to discard the least informative ones or those corrupted by noise. The AMIRAL system developed by the LIA integrates both approaches and was tested during the NIST/NSA 1999 speaker recognition evaluations. The results of these experiments for the tasks of Speaker Verification (“One Speaker” and “Two Speakers”) and Speaker Tracking are provided and discussed.
Article
S ummary A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Use of second order statistic for speaker-based segmentation, EUROSPEECH
  • P Delacourt
  • D Kryze
  • C J Wellekens
P. Delacourt, D. Kryze, C.J. Wellekens. Use of second order statistic for speaker-based segmentation, EUROSPEECH, 1999.
  • S Meignier
  • J.-F Bonastre
  • C Fredouille
  • T Merlin
S. Meignier, J.-F. Bonastre, C. Fredouille, T. Merlin, Evolutive HMM for Multi-Speaker Tracking System, ICASSP, june 2000.