Conference PaperPDF Available

Abstract and Figures

This paper presents the latest improvements in Speaker Diarization obtained by ViVoLAB research group for the 2019 DIHARD Diarization Challenge. This evaluation seeks the improvement of the diarization task in adverse conditions. For this purpose, the audio recordings involve multiple scenarios with no restrictions in terms of speakers, overlapped speech nor quality of the audio. Our submission follows the traditional segmentation-clustering-resegmentation pipeline: Speaker embeddings are extracted from acoustic segments with a single speaker on them, later clustered by means of a PLDA. Our contribution in this work is focused on the clustering step. We present results with our Variational Bayes PLDA clustering and our tree-based clustering strategy, which sequentially assigns the different embeddings to its corresponding speaker according to a PLDA model. Both strategies compare multiple diarization hypotheses and choose their candidate one according to a generative criterion. We also analyze the impact of the different available embeddings in the state-of-the-art with both clustering approaches.
Content may be subject to copyright.
ViVoLAB Speaker Diarization System for the DIHARD 2019 Challenge
Ignacio Vi˜
nals, Pablo Gimeno, Alfonso Ortega, Antonio Miguel, Eduardo Lleida
ViVoLab, Arag´
on Institute for Engineering Research (I3A), University of Zaragoza, Spain
{ivinalsb, pablogj, ortega, amiguel, lleida}@unizar.es
Abstract
This paper presents the latest improvements in Speaker Diariza-
tion obtained by ViVoLAB research group for the 2019 DI-
HARD Diarization Challenge. This evaluation seeks the im-
provement of the diarization task in adverse conditions. For
this purpose, the audio recordings involve multiple scenarios
with no restrictions in terms of speakers, overlapped speech nor
quality of the audio. Our submission follows the traditional
segmentation-clustering-resegmentation pipeline: Speaker em-
beddings are extracted from acoustic segments with a single
speaker on them, later clustered by means of a PLDA. Our
contribution in this work is focused on the clustering step. We
present results with our Variational Bayes PLDA clustering and
our tree-based clustering strategy, which sequentially assigns
the different embeddings to its corresponding speaker according
to a PLDA model. Both strategies compare multiple diarization
hypotheses and choose their candidate one according to a gen-
erative criterion. We also analyze the impact of the different
available embeddings in the state-of-the-art with both cluster-
ing approaches.
Index Terms: diarization, DIHARD Challenge, PLDA, Varia-
tional Bayes, Tree search, M-algorithm
1. Introduction
Speech signal is a very rich source of information with multiple
levels of knowledge: from low level speech and speaker infor-
mation to higher levels such as emotion or context. Some of
them are worthy to be extracted. Therefore speech technologies
have developed multiple tasks to process each sort of desired in-
formation. Diarization is the task focused on explaining a given
audio in terms of the active speaker at each time. Thus diariza-
tion must identify the speaker every time somebody talks and
differentiate his/her speech from the others.
Diarization is applied to multiple scenarios, such as tele-
phone, meetings or broadcast. Each one of these scenarios
presents its own particularities (number of speakers, noise, etc.)
making solutions domain dependent. DIHARD 2019 evaluation
seeks the improvement of diarization regardless of the scenario.
For this reason the database counts with audios from multiple
domains such as Youtube, court trials, meetings, etc. These do-
mains are characterized to suffer from adverse conditions as un-
known levels of noise or reverberation. Four conditions are pre-
sented, including single microphone (tracks 1 and 2) and multi-
ple microphone (tracks 3 and 4) scenarios. Odd and even tracks
only differ in the Voice Activity Detection (VAD), provided by
the organization in tracks 1 and 3, and user made in tracks 2 and
4.
For DIHARD 2019 challenge ViVoLAB team has prepared
a submission around the Bottom-Up diarization strategy: First
we divide the input audio into segments with a single speaker
by means of Bayesian Information Criterion [1] approaches.
These segments, converted into a compact representation ei-
ther by linear, e.g. i-vectors [2], or nonlinear solutions, i.e. x-
vectors [3], are clustered afterwards. Regarding clustering, two
strategies are considered: a diarization by means of Variational
Bayes (VB) resegmentation [4, 5, 6] and a new diarization ap-
proach based on tree-modeling and helped by the M-algorithm
[7]. The obtained labels are finally refined by means of HMMs
with eigenvoice priors [8].
The paper is organised as follows. Our Variational Bayes
clustering is explained in Section 2. Section 3 describes the
tree search clustering approach. The experimental setup is com-
mented in Section 4. In Section 5 contains our results. Finally,
we include our conclusions in Section 6.
2. Clustering by means of Variational Bayes
The Bottom-Up approach in diarization consists of di-
viding the given audio into a set of of N segments
S={s1, s2, ..., sN}, compactly represented by the set of em-
beddings Φ={φ1,φ2, ..., φN}. These representations are
clustered so elements from the same speaker are together.
This is the same as labeling the embeddings with a partition
Θ = {θ1, θ2, ..., θN}where elements from the same cluster
share a common label. Therefore clustering is an assignment
task in which we must find those labels Θwhich better explain
the given data Φ. Mathematically:
Θdiar = arg max
Θ
P|Φ)(1)
In this clustering approach [9] we estimate the conditional
distribution P|Φ)in terms of a PLDA [10] model and then
carry out the maximization. This inference of the desired distri-
bution is done by means of the Fully Bayesian PLDA [4] solved
by Variational Bayes (VB).
The Fully Bayesian PLDA is a generalization of the PLDA
model [10]. Given a set of Nelements from Mspeakers,
the original model assumes the assignment labels to be known.
However, in our version a latent variable Θis used instead, be-
ing in charge of the assignment. Hence, a set of Nembeddings
is modeled by Mspeakers as follows:
P(Φ|Θ) =
N
Y
j=1
M
Y
i=1
Nφj|µ+Vyi,W1θij (2)
The hidden variable Θ, representing the speaker labels, is
modeled as a multinomial distribution, with a Dirichlet prior πθ.
The Fully Bayesian approach also substitutes the model param-
eters (µ,Vand W), point estimates in the original, by latent
variables, each one with its own prior α, to gain robustness. Its
Bayesian Network is shown in Fig. 1 and further information
about the model is in the original work [4].
The described model allows the definition of the joint con-
ditional distribution P(Y,Θ|Φ), which models the speaker
variable Yand the speaker labels Θgiven the embeddings Φ.
Copyright © 2019 ISCA
INTERSPEECH 2019
September 15–19, 2019, Graz, Austria
http://dx.doi.org/10.21437/Interspeech.2019-2462988
µW V
α
φjyi
θj
πθ
Ni
M
Figure 1: Bayesian Network of Fully Bayesian PLDA
However, the marginalization of the speaker latent variable Y
is not straightforward.
In consequence we propose the approximation of the pos-
terior by means of Variational Bayes. Its application approx-
imates the joint posterior by a factorial of independent distri-
butions, each one only dependent on a latent variable from our
model. The considered decomposition is:
P(Y,Θ, πθ,µ,V,W,α|Φ)(3)
q(Y)q(Θ) q(πθ)q(µ)q(V)q(W)q(α)(4)
Despite the closed formulation of the factor q(θ), its max-
imization requires an iterative reevaluation of those factors re-
lated to the labels (q(Y),q(Θ) and q(πθ)) obtaining the la-
bels by a cartesian ascent optimization. This solution requires
an initial value for the distributions, with severe implications in
the performance. Therefore multiple seeds are provided, opting
for one of them according to a penalized Lower Bound [6].
3. Tree-based model for diarization
purposes
Our alternative proposal reinterprets the maximization of the
posterior probability P|Φ). Instead of the estimation of the
posterior distribution, we can carry out the maximization of the
loglikelihood of the joint distribution as:
Θdiar = arg max
Θ
P|Φ) = arg max
Θ
P(Φ|Θ) P(Θ) (5)
3.1. The model
In [7] we propose a definition of P(Φ|Θ) where the set of em-
beddings Φ={φ1,φ2, ..., φN}is arranged in a sequence so
the turn jis either assigned to an existing cluster or classified
as the first element of a new cluster. This decision is made ac-
cording to speaker models constructed with the previous em-
beddings in the sequence already classified. Mathematically
P(Φ|Θ) =
N
Y
j=1
M
Y
i=1
Pφj|Φ(j1)
iθij (6)
where φjrepresents the jth embedding and θij the label as-
signing the element jto the speaker cluster i. For this purpose
θij only has a value of one if the embedding jcorresponds to the
cluster i, being zero otherwise. The variable Φ(j1)
idescribes
the previously assigned embeddings to cluster i.
Our definition of the conditional probability
Pφj|Φ(j1)
iis done in terms of the PLDA model,
which uses a latent speaker variable to model the embeddings.
This latent variable is conditioned by the elements already
θ0θ1θ2θ3
y0y1y2y3
φ0φ1φ2φ3
Figure 2: Bayesian Network for out Tree Decoding model for a
4-element sequence
assigned to the hypothesis cluster to be later marginalized.
The imposition of Gaussian distributions makes our desired
distribution Gaussian as
Pφj|Φ(j1)
i=ZP(φj|yij )Pyij |,Φ(j1)
idyij
(7)
=N(φj|µij ,Σij )(8)
whose mean µij and variance Σij are defined in terms of PLDA
parameters (µ,Vand W) by
µij =µ+Vµyij (9)
Σij =W1+VΣyij VT(10)
These terms depend on the posterior distribution of yij, es-
timated in terms of the previous decisions Φ(j1)
i. Its mean
µyij and variance Σyij are given by
µyij = Σyij VTW
j1
X
k=1
θik (φkµ)(11)
Σ1
yij =I+VT
j1
X
k=1
θikWV.(12)
The label prior P(Θ) is modeled based one the Distance-
dependent Chinese Restaurant (DDCR) process [11, 12]. This
process is a probability distribution over partitions, assigning
a sequence of elements to different clusters. The element θj,
at the jth turn, keeps in the last assigned cluster with proba-
bility p0. Otherwise, it is either assigned to an existing cluster
k= 1..K proportionally to the occupation of clusters at time
jor assigned to a new empty cluster with a certain probability.
Hence
Pθj=k|θ(j1)
1
p0if k=θ(j1)
nkif k6=θ(j1) and kK
αif k6=θ(j1) and k=K+ 1
(13)
where nkis the number of times in which variable θwas as-
signed to cluster k. The Bayesian network for the whole model
can be seen in Fig. 2.
The model we have presented has the shape of a decision
tree where in which every node represents an assignment deci-
sion. The number of branches depending on a node represent
the number of decisions we can make. Therefore, each path
from the root of the tree (the first element) to a leaf (the last
embedding) is a possible partition of the set to cluster.
989
3.2. The maximization
The inference of the partition which maximizes the model is
not straightforward. In fact, except for short sequences with
very low number of speakers, a brute force approach comparing
all possible partitions is unfeasible, as analyzed in [13]. More-
over, efficient search algorithms such as Viterbi cannot be ap-
plied here. In our model any decision is conditioned to previous
choices, not allowing us to use the Markov assumption.
An efficient technique to deal with tree structures is the
M algorithm, widely spread in Communications. Each time j
we propagate a handful of Lsurviving paths along all possi-
ble branches from each node. Considering each node has M
branches, i.e. from every node we can transit to the Mcandi-
date speakers, we take into account up to LM candidate paths
to progress through. This selection of paths is ranked in terms of
likelihood, only considering the top Lpaths as the most promis-
ing candidates to propagate to time j+ 1. This process is re-
peated until the end of the audio, when the most promising path
alive, the one with higher log-likelihood, is chosen as the di-
arization labeling. By this technique, we limit an exponential
number of partitions ONMfrom the brute force approach to
a linear problem (O(MNL)).
4. Experimental setup
4.1. Data resources
DIHARD 2019 challenge imposes very low restrictions regard-
ing the data. Except for a few datasets excluded due to be part in
DIHARD subset, any other source of knowledge is allowed. In
our case we have constructed a data pool trying to combine as
many scenarios as possible. The MGB 2015 broadcast dataset
[14] contributes with more than 1000 hours of labeled data from
BBC in many scenarios. These data is combined with AMI [15]
and ICSI [16], both meeting datasets with multiple speakers
recorded with different microphones. Youtube domain is also
included in the pool with Voxceleb I [17] and II [18]. Finally,
DIHARD 2019 development set is also included in the data pool
for adaptation purposes.
4.2. Speaker representation & clustering
Our diarization system works under the Bottom-Up approach
and relying on the embedding-PLDA paradigm. Hence we need
to convert the input audio into tractable representations with
speaker awareness.
Any given audio is first converted into a stream of feature
vectors. 20-coefficient MFCCs arrays are extracted from win-
dows of 25ms, and considering window shifts of 10ms. These
features are then used to carry out the initial segmentation with
Bayesian Information Criterion [1]. Voice Activity Detection
(VAD) information is obtained with a 2-layer BLSTM neural
network, trained with DIHARD development set.
The obtained segments are then converted into compact rep-
resentations. Two main representations are considered. On the
one hand our baseline system follows our contribution in [6] us-
ing i-vectors [2]. A 512 Gaussian GMM is followed by a 200-
dimension T matrix, both trained using MGB, AMI and ICSI.
On the other hand, we also use x-vector [3] networks trained on
VoxCeleb, AMI and ICSI. Multiple architectures varying both
the size of the network and the exact training pool are consid-
ered. Before clustering both sorts of embeddings are centered,
whitened and length-normalized [19].
Finally, clustering is done according to PLDA [10] mod-
Table 1: DER(%) results with the core systems. Results ob-
tained with Variational Bayes (VB PLDA) and the tree-based
sequential clustering (Tree Clustering) approaches. i-vectors
and x-vectors are compared. Results obtained with eval. set.
Embedding VB PLDA Tree Clustering
Track1
i-vector 25.95 25.67
x-vector 25.70 27.50
Track2
i-vector 38.99 38.49
x-vector 38.29 40.07
els. While i-vectors work with 200-dimension PLDAs trained
on AMI, MGB and ICSI, x-vectors are clustered in terms of a
low dimension (50) PLDA, trained only on AMI and ICSI.
4.3. Resegmentation
Results provided by the clustering stage are later refined by
means of resegmentation in order to improve the borders. To
do so we make use of the resegmentation by HMMs and eigen-
voices proposed in [8]. The same i-vector extractor (512 Gaus-
sian and 200-dimension T matrix) from the speaker representa-
tion is considered here.
5. Results
Our experimentation is centered in the single microphone
paradigm, i.e. tracks 1 and 2. For comparison reasons DIHARD
evaluation considers DER (Diarization Error Rate) as the scor-
ing metric for evaluation of performance and ranking.
Our submission can be divided into two stages. Our first
step is the study of the core diarization system, that is, the seg-
mentation and clustering steps. This part includes the analysis
of speaker embeddings, representing the audio information, and
clustering strategies. Table 1 shows the results with our candi-
date core systems.
According to the results in Table 1, the performance of our
core systems is consistent between tracks 1 and 2. Two clus-
tering setups, Tree-based and VBPLDA with i-vectors and x-
vectors respectively, have the best performance in both tasks,
with no significant differences between them. Regarding the
other two systems, while VBPLDA with i-vectors performs
slightly worse, the tree-based x-vector system obtains the worst
score.
Surprisingly i-vectors are still competitive respect to x-
vectors, many times outperforming them despite their well
known performance capabilities. Regarding the clustering tech-
niques, VBPLDA seems to be more robust to configuration
changes than the Tree Clustering, which has obtained the best
and worst results.
Furthermore, we have observed a high domain mismatch
in the development set. As we can see in Table 2, trends of
behaviour are completely different, with i-vectors always out-
performing x-vectors.
Once we have analyzed the core diarization systems, we
add the resegmentation stage, based on HMM with eigenvoices.
The results are included in Table 3. They show a consistent im-
provement of the results obtained by all the core systems, with
benefits around 0.5% in track 1 and 1.0% in track 2. Again our
990
Table 2: DER(%) results with the core systems. Results ob-
tained with Variational Bayes (VB PLDA) and the tree-based
sequential clustering (Tree Clustering) approaches. i-vectors
and x-vectors are compared. Results obtained with dev. set.
Embedding VB PLDA Tree Clustering
Track1
i-vector 24.85 24.65
x-vector 25.03 26.72
Track2
i-vector 31.95 31.47
x-vector 32.55 33.20
Table 3: DER(%) results with the core systems with resegmen-
tation. Results obtained with Variational Bayes (VB PLDA)
and the tree-based sequential clustering (Tree Clustering) ap-
proaches. i-vectors and x-vectors are compared. Results ob-
tained with eval. set.
Embedding VB PLDA Tree Clustering
Track1
i-vector 25.55 25.02
x-vector 25.51 26.71
Track2
i-vector 38.04 37.22
x-vector 37.63 39.03
tree based system is able to obtain the best and worst perfor-
mance. However, this time the Variational Bayes PLDA works
slightly worse with both embeddings, i-vectors and x-vectors.
6. Conclusions
Our work provides a comparative between our two clustering
approaches, the well known Variational Bayes PLDA clustering
and our new Tree-Based clustering. Both have been tested with
two types of embeddings, i-vectors and x-vectors.
The direct comparison between the clustering methods has
shown that our new sequential approach is capable to outper-
formn our Variational Bayes approach. However, those exper-
iments carried out with x-vectors obtained the worst marks, il-
lustrating a greater robustness of the VB strategy. This extra ro-
bustness of the VB approach can be caused by a lower number
of tuneable hyperparameters (1 vs 3). Besides our results with i-
vectors have demonstrated great performance with our two clus-
tering strategies, many times outperforming the x-vectors.
Research carried out with development set has shown great
domain mismatch issues, highly noticeable at PLDA level. This
type of degradation is very adverse for our two clustering ap-
proaches, both PLDA based. However, this mismatch origi-
nated earlier in the embedding extraction step. While PLDA
is commonly adapted in the state-of-the-art, DNN embedding
extraction lacks from some method to adapt some trained net-
work to unmatched data. Otherwise, these neural networks can
include error terms too complex to mitigate afterwards. Fur-
ther research should be done to provide methods to successfully
adapt these relevant models.
7. Acknowledgments
This work has been supported by the Spanish Ministry of
Economy and Competitiveness and the European Social Fund
through the project TIN2017-85854-C4-1-R, Government of
Arag´
on (Reference Group T36 17R) and co-financed with
Feder 2014-2020 ”Building Europe from Arag´
on”. We grate-
fully acknowledge the support of NVIDIA Corporation with the
donation of a Titan Xp GPU.
8. References
[1] S. S. Chen and P. Gopalakrishnam, “Speaker Environment and
Channel Change Detection and Clustering via the Bayesian Infor-
mation Criterion,” in DARPA Broadcast News Worshop, 1998, pp.
127–132.
[2] N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-
let, “Front-End Factor Analysis For Speaker Verification,” IEEE
Transactions on Audio, Speech and Language Processing, vol. 19,
no. 4, pp. 788–798, may 2011.
[3] D. Snyder, P. Ghahremani, D. Povey, D. Garcia-Romero,
Y. Carmiel, and S. Khudanpur, “Deep Neural Network-based
Speaker Embeddings for End-to-end Speaker Verification,” iEEE
Spoken Language Technology Workshop (SLT), pp. 165–170,
2016.
[4] J. Villalba and E. Lleida, “Unsupervised Adaptation of PLDA
By Using Variational Bayes Methods,” in ICASSP, IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing
- Proceedings, 2014, pp. 744–748.
[5] I. Vi˜
nals, A. Ortega, J. Villalba, A. Miguel, and E. Lleida, “Do-
main Adaptation of PLDA models in Broadcast Diarization by
means of Unsupervised Speaker Clustering,” Interspeech, pp.
2829–2833, 2017.
[6] I. Vi˜
nals, P. Gimeno, A. Ortega, A. Miguel, and E. Lleida, “Es-
timation of the Number of Speakers with Variational Bayesian
PLDA in the DIHARD Diarization Challenge,Interspeech, no.
September, pp. 2803–2807, 2018.
[7] I. Vi˜
nals, A. Ortega, A. Miguel, and E. Lleida, “Tree-based
Search Strategy for Clustering in Speaker Diarization Using the
M-Algorithm,” Interspeech 2019 (submitted), 2019.
[8] M. Diez, L. Burget, and P. Matejka, “Speaker Diarization based on
Bayesian HMM with Eigenvoice Priors,Proceedings of Odyssey
2018 - The Speaker and Language Recognition Workshop, no.
June, pp. 147–154, 2018.
[9] J. Villalba, A. Ortega, A. Miguel, and E. Lleida, “Variational
Bayesian PLDA for Speaker Diarization in the MGB Challenge,
in 2015 IEEE Workshop on Automatic Speech Recognition and
Understanding (ASRU), 2015, pp. 667–674.
[10] S. J. D. Prince and J. H. Elder, “Probabilistic Linear Discrimi-
nant Analysis for Inferences About Identity,” in Proceedings of
the IEEE International Conference on Computer Vision, 2007.
[11] D. M. Blei and P. I. Frazier, “Distance Dependent Chinese Restau-
rant Processes,” Journal of Machine Learning Research (JMLR),
vol. 12, pp. 2461–2488, 2011.
[12] A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully Su-
pervised Speaker Diarization,” pp. 2–6, 2018.
[13] N. Br ¨
ummer and E. de Villiers, “The Speaker Partitioning Prob-
lem,” ODYSSEY The Speaker and Language Recognition Work-
shop, no. July, pp. 194–201, 2010.
[14] P. Bell, M. J. F. Gales, T. Hain, J. Kilgour, P. Lanchantin, X. Liu,
A. McParland, S. Renals, O. Saz, M. Wester, and P. C. Woodland,
“The MGB Challenge: Evaluating Multi-Genre Broadcast Media
Recognition,” Proceedings of the IEEE Workshop on Automatic
Speech Recognition and Understanding, ASRU 2015Scottsdale,
Arizona, USA, Dec. 2015, IEEE., vol. 1, no. 1, 2015.
991
[15] J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot,
T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal,
G. Lathoud, M. Lincoln, A. Lisowska, L. McCowan, W. Post,
D. Reidsma, and P. Wellner, “The AMI Meeting Corpus: A
pre-announcement,” Lecture Notes in Computer Science (includ-
ing subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics), vol. 3869 LNCS, pp. 28–39, 2006.
[16] N. Morgan, D. Baron, J. Edwards, D. Ellis, D. Gel-
bart, A. Janin, T. Pfau, E. Shriberg, and A. Stolcke,
“The meeting project at ICSI,” Proceedings of the first
international conference on Human language technology re-
search - HLT ’01, pp. 1–7, 2001. [Online]. Available:
http://portal.acm.org/citation.cfm?doid=1072133.1072203
[17] A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A large-
scale speaker identification dataset,” Proceedings of the Annual
Conference of the International Speech Communication Associa-
tion, INTERSPEECH, vol. 2017-Augus, pp. 2616–2620, 2017.
[18] J. S. Chung, A. Nagrani, and A. Zisserman, “VoxceleB2: Deep
speaker recognition,” Proceedings of the Annual Conference of
the International Speech Communication Association, INTER-
SPEECH, vol. 2018-Septe, no. ii, pp. 1086–1090, 2018.
[19] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of I-vector
Length Normalization in Speaker Recognition Systems,” in Pro-
ceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH, 2011, pp. 249–
252.
992
... Some considered technologies for this subtask are metric-based (Bayesian Information Criterion (BIC) [21], ∆BIC [22], etc.) or models such as Hidden Markov Models (HMMs) [23] and Deep Neural Networks (DNNs) [24]. With respect to the clustering stage, multiple alternatives have been proposed, such as Agglomerative Hierarchical Clustering (AHC) [25,26], mean-shift [27][28][29], K-means [30,31] and Variational Bayes [32], or statistical approaches, such as i-vectors [33] or PLDA [34][35][36][37][38]. ...
... The considered clustering block relies on the idea originally proposed in [36]. It uses a model of the PLDA family to estimate the set of N labels Θ = [θ 1 , . . . ...
... The next alternative, inspired by [11], presents a new approach that simultaneously performs diarization and speaker attribution using the enrollment audio as anchors. For this purpose, we consider a modification of the algorithm proposed in [36] by merging the clustering and identity assignment blocks. The system, whose functionality is represented in Figure 7, starts estimating the embeddings obtained from the inferred segments. ...
Article
Full-text available
The demand of high-quality metadata for the available multimedia content requires the development of new techniques able to correctly identify more and more information, including the speaker information. The task known as speaker attribution aims at identifying all or part of the speakers in the audio under analysis. In this work, we carry out a study of the speaker attribution problem in the broadcast domain. Through our experiments, we illustrate the positive impact of diarization on the final performance. Additionally, we show the influence of the variability present in broadcast data, depicting the broadcast domain as a collection of subdomains with particular characteristics. Taking these two factors into account, we also propose alternative approximations robust against domain mismatch. These approximations include a semisupervised alternative as well as a totally unsupervised new hybrid solution fusing diarization and speaker assignment. Thanks to these two approximations, our performance is boosted around a relative 50%. The analysis has been carried out using the corpus for the Albayzín 2020 challenge, a diarization and speaker attribution evaluation working with broadcast data. These data, provided by Radio Televisión Española (RTVE), the Spanish public Radio and TV Corporation, include multiple shows and genres to analyze the impact of new speech technologies in real-world scenarios.
... The obtained embeddings are modeled in a generative manner according to [30], where a tree-based PLDA clustering is proposed. This solution proposes a Maximum A Posteriori (MAP) estimation of the speaker labels Θ given the set of embeddings Φ. ...
... The model considers a Fully Bayesian PLDA [31] of dimension 100 to model P (Φ|Θ), while the priors [32]. As we did in [30], we interpret P (Φ|Θ) as a tree structure by means of the product rule of probability. Hence, we opt for an optimization of the model according to a sequential manner making use of the M-algorithm [33] to find the best possible path along the tree. ...
... The obtained embeddings are modeled in a generative manner according to [324], where a tree-based Probabilistic Linear Discriminant Analysis (PLDA) clustering is proposed. This approach exploits the higher acoustic similarity of temporally close embeddings by sequentially assigning these representations to the available clusters at each time. ...
Thesis
Full-text available
The increasing use of technological devices and biometric recognition systems in people daily lives has motivated a great deal of research interest in the development of effective and robust systems. However, there are still some challenges to be solved in these systems when Deep Neural Networks (DNNs) are employed. For this reason, this thesis proposes different approaches to address these issues. First of all, we have analyzed the effect of introducing the most widespread DNN architectures to develop systems for face and text-dependent speaker verification tasks. In this analysis, we observed that state-of-the-art DNNs established for many tasks, including face verification, did not perform efficiently for text-dependent speaker verification. Therefore, we have conducted a study to find the cause of this poor performance and we have noted that under certain circumstances this problem is due to the use of a global average layer as pooling mechanism in DNN architectures. Since the order of the phonetic information is relevant in text-dependent speaker verification task, whether a global average pooling is employed, this order is neglected and the results achieved for the verification performance metrics are too high. Hence, the first approach proposed in this thesis is an alignment mechanism which is used to replace the global average pooling. This alignment mechanism allows to keep the temporal structure and to encode the utterance and speaker identity in a supervector. As alignment mechanism, different types of approaches such as Hidden Markov Model (HMM) or Gaussian Mixture Model (GMM) can be used. Moreover, during the development of this mechanism, we also noted that the lack of larger training databases is another important issue to create these systems. Therefore, we have also introduced a new architecture philosophy based on the Knowledge Distillation (KD) approach. This architecture is known as teacher-student architecture and provides robustness to the systems during the training process and against possible overfitting due to the lack of data. In this part, another alternative approach is proposed to focus on the relevant frames of the sequence and maintain the phonetic information, which consists of Multi-head Self-Attention (MSA). The architecture pro- posed to use the MSA layers also introduces phonetic embeddings and memory layers to improve the discrimination between speakers and utterances. Moreover, to complete the architecture with the previous techniques, another approach has been incorporated where two learnable vectors have been introduced which are called class and distillation tokens. Using these tokens during training, temporal information is kept and encoded into the tokens, so that at the end, a global utterance descriptor similar to the supervector is obtained. Apart from the above approaches to obtain robust representations, the other main part of this thesis has focused on introducing new loss functions to train DNN architectures. Traditional loss functions have provided reasonably good results for many tasks, but there are not usually designed to optimize the goal task. For this reason, we have proposed several new loss functions as objective for training DNN architectures which are based on the final verification metrics. The first approach developed for this part is inspired by the Area Under the ROC Curve (AUC). Thus, we have presented a differentiable approximation of this metric called aAUC loss to successfully train a triplet neural network as back-end. However, the selection of the training data has to be carefully done to carry out this back-end, so it involves a high computational cost. Therefore, we have developed several approaches to take advantage of training with a loss function oriented to the goal task but keeping the efficiency and speed of multi-class training. To implement these approaches, the differentiable approximation of the Detection Cost Function (aDCF ) and Cost of Log-Likelihood Ratio (CLLR) verification metrics have been employed as training objective loss. By optimizing DNN architectures to minimize these loss functions, the system learns to reduce errors in decisions and scores produced. The use of these approaches has also shown a better ability to learn more general representations than training with other traditional loss functions. Finally, we have also proposed a new straightforward back-end that employs the information learned by the matrix of the last layer of DNN architecture during training with aDCF loss. Using the matrix of this last layer, an enrollment model with a learnable vector is trained for each enrollment identity to perform the verification process.
... The obtained embeddings are modeled in a generative manner according to [43], where a tree-based probabilistic linear discriminant analysis (PLDA) clustering is proposed. This approach exploits the higher acoustic similarity of temporally close embeddings by sequentially assigning these representations to the available clusters at each time. ...
Article
Full-text available
This paper describes a post-evaluation analysis of the system developed by ViVoLAB research group for the IberSPEECH-RTVE 2020 Multimodal Diarization (MD) Challenge. This challenge focuses on the study of multimodal systems for the diarization of audiovisual files and the assignment of an identity to each segment where a person is detected. In this work, we implemented two different subsystems to address this task using the audio and the video from audiovisual files separately. To develop our subsystems, we used the state-of-the-art speaker and face verification embeddings extracted from publicly available deep neural networks (DNN). Different clustering techniques were also employed in combination with the tracking and identity assignment process. Furthermore, we included a novel back-end approach in the face verification subsystem to train an enrollment model for each identity, which we have previously shown to improve the results compared to the average of the enrollment data. Using this approach, we trained a learnable vector to represent each enrollment character. The loss function employed to train this vector was an approximated version of the detection cost function (aDCF) which is inspired by the DCF widely used metric to measure performance in verification tasks. In this paper, we also focused on exploring and analyzing the effect of training this vector with several configurations of this objective loss function. This analysis allows us to assess the impact of the configuration parameters of the loss in the amount and type of errors produced by the system.
... The obtained embeddings are modeled in a generative manner according to [19], where a tree-based PLDA clustering is proposed. This solution proposes a Maximum A Posteriori (MAP) estimation of the speaker labels Θ given the set of embeddings Φ in order to obtain the diarization labels ΘDIAR: ...
Chapter
This paper describes our proposed system for online speaker diarization suitable for streaming applications. Assuming the availability of an audio segment before the partial result is required, our method exploits this information by combining online clustering and resegmentation. First, the speaker embeddings extracted from an x-vector neural network are labeled using tree-based clustering. Then, when a complete batch of x-vectors is available, a Bayesian resegmentation is applied to refine the clusters further. Moreover, we exploit the fact that both methods share the same statistical framework, adapting the resegmentation step to use the history of the decision tree to avoid permutation label issues. Our approach is evaluated with broadcast TV content from the Albayzin Diarization Challenges. The results show that our system is able to outperform online tree-based clustering and obtain comparable performance with state-of-the-art offline approaches while allowing low-latency requirements for practical streaming services.KeywordsSpeaker DiarizationBatch-online processingX-vector extractorTree-based clusteringVariational Bayes resegmentation
Article
Full-text available
Abstract We present a novel model adaptation approach to deal with data variability for speaker diarization in a broadcast environment. Expensive human annotated data can be used to mitigate the domain mismatch by means of supervised model adaptation approaches. By contrast, we propose an unsupervised adaptation method which does not need for in-domain labeled data but only the recording that we are diarizing. We rely on an inner adaptation block which combines Agglomerative Hierarchical Clustering (AHC) and Mean-Shift (MS) clustering techniques with a Fully Bayesian Probabilistic Linear Discriminant Analysis (PLDA) to produce pseudo-speaker labels suitable for model adaptation. We propose multiple adaptation approaches based on this basic block, including unsupervised and semi-supervised. Our proposed solutions, analyzed with the Multi-Genre Broadcast 2015 (MGB) dataset, reported significant improvements (16% relative improvement) with respect to the baseline, also outperforming a supervised adaptation proposal with low resources (9% relative improvement). Furthermore, our proposed unsupervised adaptation is totally compatible with a supervised one. The joint use of both adaptation techniques (supervised and unsupervised) shows a 13% relative improvement with respect to only considering the supervised adaptation.
Conference Paper
Full-text available
Most existing datasets for speaker identification contain samples obtained under quite constrained conditions, and are usually hand-annotated, hence limited in size. The goal of this paper is to generate a large scale text-independent speaker identification dataset collected 'in the wild'. We make two contributions. First, we propose a fully automated pipeline based on computer vision techniques to create the dataset from open-source media. Our pipeline involves obtaining videos from YouTube; performing active speaker verification using a two-stream synchronization Convolutional Neural Network (CNN), and confirming the identity of the speaker using CNN based facial recognition. We use this pipeline to curate VoxCeleb which contains hundreds of thousands of 'real world' utterances for over 1,000 celebrities. Our second contribution is to apply and compare various state of the art speaker identification techniques on our dataset to establish baseline performance. We show that a CNN based architecture obtains the best performance for both identification and verification.
Article
We develop the distance dependent Chinese restaurant process, a flxible class of distributions over partitions that allows for dependencies between the elements. This class can be used to model many kinds of dependencies between data in infinit clustering models, including dependencies arising from time, space, and network connectivity. We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both fully observed and latent mixture settings. We study its empirical performance with three text corpora. We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better fi to sequential data and network data. We also show that the distance dependent CRP representation of the traditional CRP mixture leads to a faster-mixing Gibbs sampling algorithm than the one based on the original formulation.
Conference Paper
State-of-the-art speaker recognition relays on models that need a large amount of training data. This models are successful in tasks like NIST SRE because there is sufficient data available. However, in real applications, we usually do not have so much data and, in many cases, the speaker labels are unknown. We present a method to adapt a PLDA model from a domain with a large amount of labeled data to another with unlabeled data. We describe a generative model that produces both sets of data where the unknown labels are modeled like latent variables. We used variational Bayes to estimate the hidden variables. We performed experiments adapting a model trained on Switchboard to NIST SRE without labels. The adapted model is evaluated on NIST SRE10. Compared to the non-adapted model, EER improved by 42% and 49% by adapting with 200 and with all the NIST speakers respectively.