Conference PaperPDF Available

# Online speaker clustering

Authors:

## Abstract

This paper describes a set of new algorithms that perform speaker clustering in an online fashion. Unlike typical clustering approaches, the proposed method does not require the presence of all the data before performing clustering. The clustering decision is made as soon as an audio segment is received. Being causal, this method enables low-latency incremental speaker adaptation in online speech-to-text systems. It also gives a speaker tracking and indexing system the ability to label speakers with cluster ID on the fly. We show that the new online speaker clustering method yields better performance compared to the traditional hierarchical speaker clustering. Evaluation metrics for speaker clustering are also discussed.
ONLINE SPEAKER CLUSTERING
Daben Liu, Francis Kubala
BBN Technologies
10 Moulton Street, Cambridge, MA 02138
dliu@bbn.com, fkubala@bbn.com
ABSTRACT
This paper describes a set of new algorithms that perform speaker
clustering in an online fashion. Unlike typical clustering
approaches, the proposed method does not require the presence of
all the data before performing clustering. The clustering decision
is made as soon as an audio segment is received. Being causal,
this method enables low-latency incremental speaker adaptation
in online speech-to-text systems. It also gives a speaker tracking
and indexing system the ability to label speakers with cluster ID
on the fly. We show that the new online speaker clustering
method yields better performance compared to the traditional
hierarchical speaker clustering. Evaluation metrics for speaker
clustering are also discussed.
1. INTRODUCTION
The goal of speaker clustering is to identify all speech segments
from the same speaker in an audio episode and assign them a
unique label. The true number of speakers involved in the
episode is usually unknown, making it a difficult problem. The
clustering output is commonly used in speaker adaptation for
speech recognition to collect sufficient data from the same
speaker [2][3][4][5]. The clustering information itself is very
useful in labeling and organizing speakers in speaker tracking or
audio indexing applications [1].
Existing clustering algorithms, such as the widely used
hierarchical clustering [2][3][4][5], typically require that all the
data be present before clustering can be performed. In these
algorithms, different numbers of clusters are hypothesized based
on local similarity or distance measures, e.g. building a
dendrogram. Then a global criterion is used to find the best
number of clusters. There are two drawbacks to this approach.
First, the process is non-causal. It cannot be used in an online
time-critical system or a system that requires incremental cluster
information. Secondly, trying different number of clusters is
computationally expensive and time consuming. In dendrogram-
based clustering, for example, the computational complexity
increases exponentially with the number of audio segments.
In this paper, we propose an online speaker clustering algorithm
that does not require the presence of all the data. The clustering
We start by a brief overview of a hierarchical speaker clustering
algorithm [2]. Then we introduce new algorithms for online
speaker clustering. We present further improvement to the
algorithms by studying the behavior of the decision criterion and
the distance measure. Finally, we present the evaluation metrics
and comparative results.
The input to the online speaker clustering application is usually
audio segments generated from automatic speaker segmentation,
which is not in the scope of this paper. Interested readers are
referred to [1][8] for more details. All the experiment results here
are reported on reference segmentations.
2. HIERARCHICAL SPEAKER CLUSTERING
In 1996, we developed an automatic speaker clustering algorithm
to improve the performance of unsupervised speaker adaptation
[2]. There were other approach [3][4][5] using different distance
measures and global criterion in determining number of clusters,
all of which fell into the hierarchical speaker clustering
framework.
Consider a collection of segments
}{
n21
s,...,s,s=S
with each s
i
denoting an audio segment that is represented by a sequence of
feature vectors. A hierarchical speaker clustering (HC) algorithm
can be described as follows:
Algorithm 1.
1
(Hierarchical Clustering)
1. begin initialize c n
2. do
3. find the nearest pairs in the c clusters, say, s
i
and s
j
4. merge s
i
and s
j
5. calculate the global criterion, save to G(c)
6. c
c - 1
7. until c = 1
8. return )(minarg cGc
c
and the c clusters
9. end
where c is the hypothesized number of clusters and G(c) is the
global criterion to be minimized. There are two strengths in
hierarchical method. Not only is it able to find the closest pairs by
comparing distances among all the available segments, it can also
compare different clustering strategies globally to pick the best.
Neither of these advantages is available in the online case.
To calculate the distance, we used the generalized likelihood ratio
(GLR). A Gaussian distribution,
),(
ii
N Σ
µ
, is estimated from
the sequence of feature vectors of each audio segment s
i
. The
GLR between
ji
ss , can be expressed as follows:
),;(),;(
),;(
),(
jjii
cc
LL
L
GLR
ΣΣ
Σ
=
µµ
µ
ji
c
ji
ss
s
ss (1)
where s
c
is the union of s
i
and s
j,
and L(.) is the likelihood of the
data given the model. We used within-cluster dispersion
1
The representation style for algorithms is adopted from Duda, et al. [7]
I - 3330-7803-8484-9/04/\$20.00 ©2004 IEEE ICASSP 2004
penalized by number of clusters for global clustering selection,
which is expressed as follows:
cNcG
c
j
jj
||)(
1
=
Σ=
(2)
where c is the number of clusters, N
j
is the number of feature
vectors in cluster j, and
j
Σ is the covariance matrix of cluster j. |.|
denotes the determinant.
The HC algorithm worked well for unsupervised speaker
adaptation. It improves the word-error-rate as much as the hand-
labeled ideal clustering [2]. In this paper, we will examine the
usefulness of the same GLR distance and dispersion criterion for
online speaker clustering.
3. ONLINE SPEAKER CLUSTERING
Duda, et al, present a generic approach called leader-follower
clustering (LFC)[7], for k-means type clustering. The basic idea is
to either alter only the cluster centroid most similar to a new
pattern being presented so that the cluster is somewhat like the
new pattern, or make a new cluster with the new pattern if none
of the existing clusters are similar. The generic algorithm can be
described as follows:
Algorithm 2 (LFC)
1. begin initialize
θη
,
2. w
1
s
3. do accept new s
4. j
||||minarg
'
j'
ws
j
5. if || s - w
j
|| <
θ
6. then w
j
w
j
+
η
s (update w
j
)
7. else add new w s (create new cluster)
8. w
w/||w||
9. until no more patterns
10. return w
1
, w
2
,
11. end
where w is the collection of clusters, s is the input pattern,
θ
is
the threshold, and
η
is the learning rate.
This generic algorithm can be easily adapted for online speaker
clustering. Considering s as the audio segments, substituting the
||.|| operation with the GLR measures and updating w
j
by re-
estimating a new Gaussian distribution, we can directly use the
LFC algorithm to perform online speaker clustering. The
threshold
θ
is empirically estimated from held-out data. The
normalization step 8 is dropped as it is irrelevant in our case.
3.2. Dispersion-based speaker clustering
Data-dependent thresholds are not desirable because they reduce
the robustness of a system. Since we have a proven criterion for
model selection the within-cluster dispersion G(c),wecan
implement an online speaker clustering algorithm without the
need for a threshold. The proposed algorithm, which we call
dispersion-based speaker clustering (DSC) is as follows:
Algorithm 3 (DSC)
1. begin
2. w
1
s
3. do accept new s
4. j
),(minarg
'
j'
wsGLR
j
5. G1 G(w, w
j
{s}) "s is merged with w
j
"
6. G2
G(w, s)"s is made a new cluster"
7. if G1 < G2
8. then update w
j
with s
s (create new cluster)
10. until no more patterns
11. return w
1
, w
2
,
12. end
From experiments we observed that DSC had a tendency to
underestimate the number of clusters. To see how effective the
dispersion criterion would be, we conducted an oracle experiment
where step 7 of Algorithm 3 was substituted by known conditions
from a reference set. In other words, when the input s truly
belong to an existing cluster, merge s to the cluster. Otherwise
create a new cluster with s. In every iteration, we still computed
the G1, G2. In Figure 1, we plot the difference, (G2-G1). Circle
(
) represents a merge situation in the reference, while Plus (+)
denotes new clusters. The horizontal line is drawn at 0. Thus, if
we were to use (G2-G1) as our criterion, any points below this
zero line would incur a creation of new cluster and any points
above would incur a merge. We can see that using dispersion
alone is not effective in splitting the two different events.
Figure 1. Dispersion difference plot for oracle experiment
(The audio segments are indexed in the incoming order. Four
audio files were processed sequentially, which is visible in the
figure as the four descending traces. The descending is due to the
penalty on number of clusters. With the increase of number of
hypothesized clusters, the difference (G2-G1) decreases.)
Figure 2. Minimum GLR plot for oracle experiment
I - 334
From the oracle experiment, we observed that when some of the
minimum GLR distances were very high, it was almost certain
that new clusters should be created and vice versa for merging.
We plot the minimum GLR for every incoming audio segment in
Figure 2, again with the merging and creating events labeled by
circles and plus. It is clear that one threshold, as in the LFC case,
is not very successful in dividing the two different events.
However, decisions could be made with high confidence at the
upper and lower regions as indicated in the figure.
3.3. Hybrid speaker clustering
The above observation inspired a hybrid algorithm (HSC) that
utilized both the dispersion and the GLR threshold. The
algorithm is presented as follows:
Algorithm 4 (HSC)
1. begin initialize
LU
θθ
,
2. w
1
s
3. do accept new s
4. j
),(minarg
'
j'
wsGLR
j
5. D
j
min GLR(s, w
j
)
6. G1
G(w, w
j
{s}) "s is merged with w
j
"
7. G2
G(w, s)"s is made a new cluster"
8. if D
j
<
L
θ
9. then update w
j
with s
10. else if D
j
>
U
θ
11. then add new w s (create new cluster)
13. else if G1 < G2
14. then update w
j
with s
12. else add new w s (create new cluster)
13. until no more patterns
14. return w
1
, w
2
,
15. end
where
LU
θθ
,
are the upper and lower thresholds that define the
high confidence regions. The re-introduction of thresholds are
somewhat undesirable. However, since they only work at high
confidence regions, they are less sensitive to data and thus more
robust. The low confidence region in Figure 2 is still handled by
dispersion analysis. Note that when
LU
θθ
=
, HSC becomes
LFC. When
−∞
LU
and
θθ
, HSC becomes the DSC
algorithm.
4. EXPERIMENT RESULTS AND EVALUATION
We used the four half-hour broadcast-news episodes, named as
file1, fil2, file3 and file4, of the NIST Hub4 1996 evaluation set
as our development data to optimize the thresholds. The
evaluation data was taken from the Hub4 1998 evaluation set,
which consists of two 1.5-hour broadcast news episodes.
4.1. Evaluation metrics
We selected three commonly used methods the misclassification
rate, cluster purity, and Rand Index [6], which are described as
follows:
Consider N speakers that are clustered into C groups, where n
ij
is
the number of segments from speaker j that are labeled with
cluster i. Assume that
ij
n has
ij
f speech frames.
Given a one-to-one speaker-to-cluster mapping, any segment
from speaker j that is not mapped is considered an error, the
summation of which is denoted as e
j
. Then
Misclassification rate =
∑∑
===
C
i
N
j
ij
N
j
j
ne
111
(3)
We define the final mapping as the one that minimizes the
misclassification rate. Note that we calculate the error rate at the
segment level, where segments with different lengths are treated
equally. The insensitivity to length is desirable when speaker
clusters are the end-products of the application. For example, in a
speaker retrieval application, any incorrectly retrieved segment
should be considered as equally undesirable regardless of the
duration of the segment. Another advantage of this error measure
is that it provides a clear map of error distribution that is useful
for error analysis and debugging.
Cluster purity is calculated at the frame level. For each cluster i,
calculate the pure frames f
i
by adding up the speech frames of the
majority speaker in cluster i.
Cluster purity =
∑∑
===
C
i
N
j
ij
C
i
i
ff
111
(4)
Cluster purity is of great interest when speaker clustering is used
amount of correctly classified data.
Rand Index gives the probability that two randomly selected
segments are from the same speaker but hypothesized in different
clusters, or the two segments are in the same cluster but from
different speakers. Let
i
n
be the number of segments in cluster i,
and
j
n
be the number of segments from speaker j,
Rand Index =
+
∑∑
===
=
2
])(
2
1
[
11
2
1
2
1
2
n
nnn
C
i
N
j
ij
N
j
j
C
i
i
(5)
Rand Index is a theoretical measure that has been widely used for
comparing partitions [6]. The lower the index, the higher the
agreement is between two partitions. However, it does not
provide any information on how the partitions are distributed and
how the two partitions are related.
4.2. Experiment results
The thresholds for LFC and HSC were tuned on Hub4 1996 test
set by minimizing the overall misclassification errors. Hub4 1998
data set provides the fair results. All the results were compared to
the hierarchical clustering (HC) performance. In Table 1, we
compare the hypothesized number of clusters from each
algorithm to the true number of speakers in each episode.
# clusters
Episode
true #
speakers
HC LFC DSC HSC
file1 7 8 9 10 10
file2 13 15 17 13 17
file3 15 16 18 15 18
file4 20 16 22 17 22
eval98-1 79 45 69 39 67
eval98-2
89 90 92 58 91
Table 1. Comparison of number of speakers and clusters
I - 335
All algorithms are able to hypothesize reasonable number of
clusters on the Hub4 1996 set. On the fair test, LFC and HSC
perform better than HC and DSC. DSC significantly
underestimated number of clusters on both Hub4 1998 episodes.
Table 2 shows the clustering performance of different algorithms
using the three evaluation metrics described in 4.1. In general,
DSC performed the worst in all measures, suggesting that within-
cluster dispersion measure alone might not be a good choice for
online speaker clustering. Both LFC and HSC yielded
comparable or better performance compared to the baseline HC.
Even though LFC performed superior on file1, the superiority did
not hold up on other episodes. HSC, however, performed
consistently well on all data sets suggesting that HSC is more
robust than LFC.
4.3. Run-time efficiency
The computational complexity for hierarchical speaker clustering
increases exponentially with number of audio segments, no matter
how many speakers there are in the data. However, the
complexity for proposed online speaker clustering algorithms is
linear to the number of input audio segments. We ran HC and
HSC on a series of test sets that contain from 80 to 1000
segments. The elapsed-time difference is shown in Figure 3. The
elapsed-time difference between online and offline clustering
starts to become significant at around 200 segments.
Figure 3. Speaker clustering elapsed-time vs. number of
segments
5. CONCLUSION AND FUTURE WORK
We have developed three algorithms for online speaker
clustering. All algorithms have shown the ability to automatically
find reasonable number of clusters. LFC and HSC have shown
promising results compared to the existing offline speaker
clustering, while running much more efficiently.
Even though the DSC algorithm did not perform well in our
evaluation, the notion of not having a threshold is still very
attractive. Future efforts should focus on finding a more
appropriate criterion that works better in the online clustering
framework.
Intuitively, offline speaker clustering should work better given
has. It also has the opportunity to find the global optimum. The
test results show that there should be more rooms for
improvement with offline speaker clustering. A hybrid system
could be considered to use the online speaker clustering as the
foreground process to produce clusters on the fly, while the
offline clustering works in the background to globally refine the
clustering results when there is sufficient data.
6. REFERENCES
1. Kubala, F., Sean Colbath, Daben Liu, Amit Srivastava, John
Makhoul, "Integrated Technologies for Indexing Spoken
Language," Communications of the ACM, Vol. 43, No. 2,
February 2000
2. Jin, H., F. Kubala, R. Schwartz, Automatic Speaker
Clustering, Proceedings of the DARPA Speech Recognition
Workshop, pp. 108-111, February 1997
3. Siegler, M., et al, Automatic Segmentation Classification
and Clustering of Broadcast News Audio, Proceedings of
the DARPA Speech Recognition Workshop, pp. 97-99,
February 1997
4. Hain, T., et al, Segment Generation and Clustering in the
HTK Broadcast News Transcription System, Proceedings
of the DARPA Broadcast News Transcription and
Understanding Workshop, Lansdowne, VA, February 1998
5. Chen, S., P. Gopalakrishnan, Speaker, Environment, and
Channel Change Detection and Clustering via the Bayesian
Information Criterion, Proceedings of the DARPA
Workshop, Lansdowne, VA, February 1998
6. Hubert, L., Comparing Partitions, Journal of
Classification, Vol 2, pp 193-218, 1985
7. Duda, R., P. Hart, D. Stork, "Pattern Classification," John
Wiley & Sons, Inc., Second Edition, 2001
8. Liu, D., F. Kubala, Fast Speaker Change Detection for
EUROSPEECH99, Budapest, Hungary, Volume 3, Page
1031-1034, September 5-9, 1999
Misclassification rate (%) Cluster Purity (%) Rand Index (%)
Episodes # segments
HC LFC DSC HSC HC LFC DSC HSC HC LFC DSC HSC
file1 81 5 2 6 6 99.8 100.0 98.6 99.9 2.6 0.1 3.1 1.1
file2 106 21 11 14 10 96.3 98.1 90.9 98.2 4.5 0.9 1.9 0.8
file3 101 11 13 26 10 99.1 97.3 83.3 99.7 3.1 3.6 5.9 2.9
file4 92 30 26 39 23 92.4 95.0 76.0 95.6 5.3 5.2 6.4 4.5
eval98-1 399 24 27 31 28 89.7 94.4 78.6 93.1 1.5 1.4 1.6 1.2
eval98-2 428 39 32 40 29 84.7 89.2 77.1 89.7 0.9 0.8 1.1 0.7
Table 2. Speaker clustering error analysis
I - 336
... In the past, extensive research effort has been put towards the audio-only section of ASD as a side tool for other problems [4], [5]. Techniques generally relying on sequential speaker clustering [6] or speech source localization [7], [8] were often enough to achieve real-time diarization, some occasionally aided by supervised learning. Unfortunately these were constrained by predetermined assumptions -mic location, static speakers, controlled environment. ...
... In terms of the external stimulus input at each neuron r s ij [n], this was obtained through the inner product of the stimulus I s with the neuron's own receptive field R s ij , made trainable and adaptive at each epoch, as in (5). This was also the case for the feedback input f s ij [n], being calculated as the inner product of a multi-modal neuron's activity z m ij [n] and F s , the strength of the feedback synaptic connection, as in (6). ...
Preprint
Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened enabling, for instance, the well known "cocktail party" and McGurk effects, i.e. speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, Neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.
... . This is a standard step taking place in the more general case of diarization systems output combination [20,21] or for the evaluation of speaker clustering performance [22]. For a small N (which is a realistic assumption for conversational settings), it is easy to find this matching in an exhaustive way. ...
... As the evaluation metric of the SRR we are using the Misclassification Rate (MR), defined as [22] MR = #misclassified frames total #frames = k I(R k =R k )d k k d k where the summation is over all the speaker turns, R k is the role assigned by the algorithm,R k is the groundtruth role and d k is the duration of the k-th turn. ...
Conference Paper
Full-text available
Speaker Role Recognition (SRR) is usually addressed either as an independent classification task, or as a subsequent step after a speaker clustering module. However, the first approach does not take speaker-specific variabilities into account, while the second one results in error propagation. In this work we propose the integration of an audio-based speaker clustering algorithm with a language-aided role recognizer into a meta-classifier which takes both modalities into account. That way, we can treat separately any speaker-specific and role-specific characteristics before combining the relevant information together. The method is evaluated on two corpora of different conditions with interactions between a clinician and a patient and it is shown that it yields superior results for the SRR task.
... In incremental clustering for on-line speech-to-text systems [19], the clustering decision is made as soon as an audio segment is received. Being causal, this approach enables lowlatency incremental speaker adaptation. ...
... In the past, extensive research effort has been put towards the audio-only section of ASD as a side tool for other problems [4,5]. Techniques generally relying on sequential speaker clustering [6] or speech source localization [7,8], were often enough to achieve real-time diarization, some occasionally aided by supervised learning. Unfortunately these were constrained by predetermined assumptions-mic location, static speakers, controlled environment. ...
Article
Full-text available
Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened, enabling, for instance, the well-known "cocktail party" and McGurk effects, i.e., speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.
... For image clustering, we test on the COIL-100 [126] dataset, a collection of different isolated objects in various orientations. To compare to related work, we measure the performance with the standard evaluation scores misclassification rate (MR) [106] and normalized mutual information (NMI) [115]. Architecturally, we The results on 2D data as presented in Fig. A.5 demonstrate that our method is able to learn specific and diverse characteristics of intuitive groupings. ...
Preprint
Full-text available
Context matters! Nevertheless, there has not been much research in exploiting contextual information in deep neural networks. For most part, the entire usage of contextual information has been limited to recurrent neural networks. Attention models and capsule networks are two recent ways of introducing contextual information in non-recurrent models, however both of these algorithms have been developed after this work has started. In this thesis, we show that contextual information can be exploited in 2 fundamentally different ways: implicitly and explicitly. In the DeepScore project, where the usage of context is very important for the recognition of many tiny objects, we show that by carefully crafting convolutional architectures, we can achieve state-of-the-art results, while also being able to implicitly correctly distinguish between objects which are virtually identical, but have different meanings based on their surrounding. In parallel, we show that by explicitly designing algorithms (motivated from graph theory and game theory) that take into considerations the entire structure of the dataset, we can achieve state-of-the-art results in different topics like semi-supervised learning and similarity learning. To the best of our knowledge, we are the first to integrate graph-theoretical modules, carefully crafted for the problem of similarity learning and that are designed to consider contextual information, not only outperforming the other models, but also gaining a speed improvement while using a smaller number of parameters.
... A number of online diarization [235,236,237,238,239] and speaker tracking [240] solutions have been reported. These use online speaker clustering algorithms [241,242]. Only speaker tracking systems assume prior knowledge of target speakers but they do not consider latency. ...
Thesis
Full-text available
Speaker diarization (SD) involves the detection of speakers within an audio stream and the intervals during which each speaker is active, i.e. the determination of ‘who spoken when’. The first part of the work presented in this thesis exploits an approach to speaker modelling involving binary keys (BKs) as a solution to SD. BK modelling is efficient and operates without external training data, as it operates using test data alone. The presented contributions include the extraction of BKs based on multi-resolution spectral analysis, the explicit detection of speaker changes using BKs, as well as SD fusion techniques that combine the benefits of both BK and deep learning based solutions. The SD task is closely linked to that of speaker recognition or detection, which involves the comparison of two speech segments and the determination of whether or not they were uttered by the same speaker. Even if many practical applications require their combination, the two tasks are traditionally tackled independently from each other. The second part of this thesis considers an application where SD and speaker recognition solutions are brought together. The new task, coined low latency speaker spotting (LLSS), involves the rapid detection of known speakers within multi-speaker audio streams. It involves the re-thinking of online diarization and the manner by which diarization and detection sub-systems should best be combined.
... As hierarchical clustering requires knowledge beforehand of all clusters, it is implausible to implement in an online setting. Instead, a modification of k-means clustering is employed to achieve online clustering in Li and Kubala (2003). In this approach, new clusters (speakers) are formed whenever a segment is found to be sufficiently distant from any existing cluster means, whereas existing cluster means are continuously updated with incoming data. ...
... For image clustering, we test on the COIL-100 [30] dataset, a collection of different isolated objects in various orientations. To compare to related work, we measure the performance with the standard evaluation scores misclassification rate (MR) [22] and normalized mutual information (NMI) [27]. Architecturally, we choose m = 14 BDLSTM layers and 288 units in the FC layer of subnetwork (b), 128 units for the BDLSTM in subnetwork (d), and α = 0.3 for all LeakyReLUs in the experiments below. ...
Chapter
Full-text available
We propose a novel end-to-end neural network architecture that, once trained, directly outputs a probabilistic clustering of a batch of input examples in one pass. It estimates a distribution over the number of clusters k, and for each $$1 \le k \le k_\mathrm {max}$$, a distribution over the individual cluster assignment for each data point. The network is trained in advance in a supervised fashion on separate data to learn grouping by any perceptual similarity criterion based on pairwise labels (same/different group). It can then be applied to different data containing different groups. We demonstrate promising performance on high-dimensional data like images (COIL-100) and speech (TIMIT). We call this “learning to cluster” and show its conceptual difference to deep metric learning, semi-supervise clustering and other related approaches while having the advantage of performing learnable clustering fully end-to-end.
Chapter
Full-text available
Article
Full-text available
This paper describes the segmentation, gender detection and segment clustering scheme used in the 1997 HTK broadcast news evaluation system and presents results on both the unpartitioned 1996 development and the 1997 evaluation sets. The stages of our approach are presented, namely classification, segmentation and gender detection, gender relabelling, and clustering of speech segments. The evaluation audio stream has been segmented according to audio type with a frame accuracy up to 95%. Further segmentation and gender labelling gave up to 99% frame accuracy with 127 multiple speaker segments. Experiments using two different segmentation approaches and three clustering schemes are presented. 1. Introduction The transcription of broadcast news requires techniques to deal with the large variety of data types present. Of particular importance is the presence of varying channel types (wide-band and telephone); data portions containing speech and/or music often simultaneously and a wide v...
Article
Full-text available
This paper presents a fully automatic speaker clustering algorithm, which consists of three components: building a distance matrix based on Gaussian models of the acoustic segments; performing hierarchical clustering on the distance matrix with the prior assumption that consecutive segments should be more likely to come from the same speaker; and selecting the best clustering solution automatically by minimizing the within-cluster dispersion with some penalty against too many clusters. We applied this automatic speaker clustering technique in 1996 Hub4 evaluation, and the results show that it contributed significantly to the word error rate (WER) reduction in unsupervised adaptation. From our experiments, the algorithm seldom misclassifies segments from the same speaker into different clusters. We used the same clustering procedure for both partitioned evaluation (PE) and unpartitioned evaluation (UE) tests [1]. Experiments also show that this automatic speaker clustering algorithm imp...
Article
The article focuses on transcribing speech and making it ready for browsing by explaining the Rough'n'Ready prototype system under development at Bolt, Beranek and Newman, that aims to provide a practical means of access to information contained in spoke The article focuses on transcribing speech and making it ready for browsing by explaining the Rough'n'Ready prototype system under development at Bolt, Beranek and Newman, that aims to provide a practical means of access to information contained in spoken language from audio and video sources. It creates a Rough summarization of speech that is Ready for browsing. The summarization, a structural representation of the content in spoken language that is very powerful and flexible as an index for content-based information management, includes extracted features such as names of people, places and organizations mentioned in the transcript as well as identities and locations of speakers in the recording. The system also breaks the continuous stream of words into passages that are thematically coherent. Each of these passages is automatically summarized with a short list of appropriate topic labels drawn from thousands of possibilities. Taken together, these capabilities effectively impose a document model upon spoken language.
Article
The problem of comparing two different partitions of a finite set of objects reappears continually in the clustering literature. We begin by reviewing a well-known measure of partition correspondence often attributed to Rand (1971), discuss the issue of correcting this index for chance, and note that a recent normalization strategy developed by Morey and Agresti (1984) and adopted by others (e.g., Miligan and Cooper 1985) is based on an incorrect assumption. Then, the general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. They are generated from corresponding partitions using various scoring rules. Special cases derivable include traditionally familiar statistics and/or ones tailored to weight certain object pairs differentially. Finally, we propose a measure based on the comparison of object triples having the advantage of a probabilistic interpretation in addition to being corrected for chance (i.e., assuming a constant value under a reasonable null hypothesis) and bounded between ±1.