Content uploaded by Jacob Donley
Author content
All content in this area was uploaded by Jacob Donley on Sep 26, 2017
Content may be subject to copyright.
© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current
or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective
works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
APSIPA ASC 2017
Blind Speaker Counting in Highly Reverberant
Environments by Clustering Coherence Features
Shahab Pasha, Jacob Donley and Christian Ritz
School of Electrical, Computer and Telecommunication Engineering, University of Wollongong, NSW, Australia
E-mail: sp900@uowmail.edu.au, jrd089@uowmail.edu.au and critz@uow.edu.au
Abstract— This paper proposes the use of the frequency-
domain Magnitude Squared Coherence (MSC) between two ad-
hoc recordings of speech as a reliable speaker discrimination
feature for source counting applications in highly reverberant
environments. The proposed source counting method does not
require knowledge of the microphone spacing and does not
assume any relative distance between the sources and the
microphones. Source counting is based on clustering the
frequency domain MSC of the speech signals derived from short
time segments. Experiments show that the frequency domain
MSC is speaker-dependent and the method was successfully used
to obtain highly accurate source counting results for up to six
active speakers for varying levels of reverberation and
microphone spacing.
I. INTRODUCTION
The number of sound sources present in an acoustic scene is
crucial information for speech processing applications, such as
guided source separation [1], [2], source localisation [3] and
speech diarisation [4]. State of the art approaches to source
counting suffer from limited assumptions and requirements
such as the prior knowledge of the microphone array structure
[4], [5], statistical modelling of the speech mixture [6] and full
noise removal [7], which confines the target scenarios of these
methods. Moreover, the performances of the existing source
counting methods are highly sensitive to noise and
reverberation [7].
A limitation of the existing features used in the literature for
source counting and discrimination is that they discriminate
speakers only based on the location cues and therefore cannot
discriminate the speakers with the same Direction of Arrival
(DOA) but located at different distances [5], [8]. Hence, more
effective features are needed to be introduced to obtain
sustainable source counting results for random scenarios.
The authors have previously applied the Coherent to Diffuse
Ratio (CDR) feature [9] and Magnitude Squared Coherence
(MSC) [10] for multi-talk detection in the context of ad-hoc
array ‘nodes’. Although accurate multi-talk detection results
are obtained, the proposed method in [9] requires all the ‘nodes’
to be of the same structure and the accuracy of the results
highly depends on an adaptively chosen threshold value for
multi-talk detection (classification). In this contribution, one
two-channel (dual) microphone array of an unknown inter-
channel distance, located arbitrarily within an acoustic scene is
used to discriminate speakers and estimate the number of
sources.
While one approach is to derive the MSC in the time domain,
this paper proposes the use of the frequency domain MSC. This
is motivated by the wide use of frequency domain analysis of
human speech for speaker identification and verification [11],
[12]. It is known that every person has specific voice
characteristics which can be exploited for speaker
discrimination [13]. The idea is to utilise the inherent
differences in the articulatory organs (the structure of the vocal
tract, the size of the nasal cavity, and vocal cord characteristics)
for feature extraction and speaker discrimination [14].
This paper shows that different speakers have different
frequency-domain MSC values when calculated for using
speech signals derived from dual microphone recordings. The
proposed method in this paper discriminates the speakers based
on their unique voice characteristics as modelled by the speech
signal. This provides an advantage over state-of-the-art speech
clustering and speaker discrimination methods [15], [16],
which are based on speaker location cues and cannot perform
properly if the speakers swap places or are located very close
to each other.
The remainder of the paper is organised as follows. In
section II the mathematical model for the ad-hoc dual
microphone recording is described. Section III describes the
proposed clustering features. In section IV the proposed
clustering method for source counting is explained. Section V
and VI are dedicated to the experimental evaluation and results,
respectively, with the paper concluded in Section VII.
II. MATHEMATICAL MODEL FOR THE AD-HOC DUAL
MICROPHONE RECORDING
Assuming a dual-channel node [9], with omni-directional
microphones and an unknown inter-channel spacing, , is
located within a reverberant and noisy environment, the
recorded signal by such an array is modelled mathematically as
∗,
, (1)
where , and , are the real-valued recorded
signals, source signals and the RIRs for multiple sources (for
microphone ∈1,2), respectively, and an example is
shown in Fig. 1. is the discerete time, is the total number
of sources (unknown), and ∈1,…, is the source index.
represents the diffuse noise and reverberation is
modelled by ∗,.
Another mathematical way to model the recorded speech in
(1) is to separate the direct path (coherent) speech from the
reverberation and noise as
, (2)
where
is the desired direct path speech signal whereas
contains the diffuse noise and the reverberation.
III. MSC
OF THE SPEECH SIGNALS
The frequency domain spectrum of the MSC calculated for
a dual-channel recording (such as in (1)) shows peaks (values
closer to 1) at frequencies corresponding to the coherent speech
(
from (2)). In contrast, values are closer to 0 at
frequencies corresponding to diffuse noise and reverberation.
This characteristic of the MSC in the frequency domain is used
here to discriminate and count the speakers based on their
location [17] and the vocal tract frequencies [12].
The MSC between two channel signals
and
is found in
the time-frequency domain as
,
|
,
|
,
|
,, (3)
Where
|
, is the cross-power spectral density (CPSD),
∈0,…,1 is a segment index of total time segments
for a given speech recording and ∈0,…1 is the
frequency index of total frequencies. Each time segment is
divided into A frames of N samples and the CPSD is then
calculated using an averaging process that follows Welch’s
method as follows:
|
,≜1
⋆
,
,
, (4)
where ∈0,…,1 is a frame index and √1 is the
imaginary unit. The sample cross-correlation
⋆
of (4) is
defined as
⋆
⋅,≜
⋅,
⋅,
, (5)
where ∈0,…,1 is the displacement and
framed is
,. The CPSD is a function of active
speaker(s), noise [18], reverberation level [19] and source to
microphone distance [20].
IV. S
OURCE COUNTING BY CLUSTERING OF THE
MSC
FREQUENCY FEATURES
It is observed that different speakers have different MSC
frequency features [13], [17]. This observation is used for
speaker discrimination and counting applications in this
research. The frequency-domain transform of the MSC is
represented in matrix form as
,
0,0⋯
1,0
⋮⋱ ⋮
0,1⋯
1,1, (6)
which is useful for clustering applications.
Using
,
from (6), it is possible to discriminate and count
the number of speakers during a meeting by clustering the
extracted features. The proposed method is depicted in Fig. 2
where the extracted feature from every time segment is
clustered with the other extracted features from the time
segments spoken by any speaker.
The differences between the MSC of the speech signal of
different speakers [17] and the similarity between the MSC of
the speech signals of the same speaker are used to cluster the
time segments spoken by the same speaker together. The
number of the formed clusters,
, is the estimate of the number
of the sources as described in Table I.
,
is clustered into 2,…,
(assuming that the
maximum number of the participants in the meeting is
)
clusters using one minus the sample correlation, ,,
between each row (segment) and each cluster as
,1
̅
̅
̅
, (7)
where ̅∑
⁄,
∑
⁄,
is a length
row vector of ones and
T
is a transposition of the vector. Here,
is a transposed column of
,
and
is a centroid (row
vector) of the k-means clustering. The number of possibly
Compute
,
Clustering Clustering
evaluation
Fig. 2: The proposed source counting system
T
ABLE
I
T
HE PROPOSED SOURCE COUNTING METHOD
1) Start with the recorded mixture
from (1).
2) Obtain the speech signal for each time segment in the frequency
domain.
3) Extract the MSC features for each time segment of the speech
signal using (3) and obtain
,
from (6).
4) Cluster the extracted features,
,
, from (6) into
2,…,
clusters.
5) Choose the optimal using a clustering evaluation metric, such as
[21], as the number of clusters.
6) The optimal number of the clusters,
, is the estimate for the
number of sources.
z
y
x
Room Walls
Fig. 1: Example ad-hoc dual microphone scenario around a rectangular table
with 5. Black circles represent the speech sources.
different sets of centroid seeds, and times to repeat clustering,
is
. The value that yields the best score with the clustering
evaluation criterion is chosen as
.
V. R
ESULTS
In this section, random scenarios (in terms of the sources and
the microphones locations) are simulated and the proposed
source counting method is applied to investigate the effect of
the diffuse noise, reverberation level and the inter-channel
distance () on the source counting performance.
A. Experimental Evaluation
Speech utterances from the TIMIT database [21] are used to
simulate meeting scenarios with 2 to 6 participants for up to 6
minutes with no participants talking at the same time.
Participants are seated randomly around a rectangular 1m by
3m table at the centre of the room. Two microphones are also
located at random locations with inter-channel distance of .
The Calinski Harabasz (CH) criterion (also known as the
Variance Ratio Criterion (VRC)) [22] is used for the cluster
evaluation where the maximum score indicates the optimal .
The experimental configurations and parameters are presented
in Table II.
The Success Rate (SR) [9] is applied for the performance
measurement. Assuming that
is the number of scenarios that
the number of sources is estimated correctly (i.e.
) and
is the total number of test scenarios, the Success Rate (SR)
evaluation measurement is defined as
SR
100. (8)
SR of 100% means that the number of speakers in all the
experimental setups are counted correctly. The sampling
frequency, , ,
and the frame overlap are fixed for all
the experiments to the values in Table II.
The performance of the proposed method is compared
against a competing method which clusters the Time
Difference of Arrival (TDOA) estimates from a Generalised
Cross-Correlation with Phase Transform (GCC-PHAT) [23],
[24]. The performance of TDOA-based methods
fundamentally degrades with low
, small and/or when
sources exist on mirroring sides of a linear array.
B. Proposed Method Success Rates
Using the experimental configuration of Table II, Fig. 3
shows 312 data points computed from 3.12 million iterations.
The data points in Fig. 3 show that the proposed method results
in significantly higher success rates across most scenarios
compared to the TDOA method. All reverberant cases result in
an SR above 75% with correct source counts occurring more
than 80% of the time for 2, 3 and 4 sources. The proposed
method results in an average SR of 86.7% opposed to the
TDOA method with an average SR of 44.9%.
Reduction in SNR has the most impact on SR where
performance below an SNR of 20dB approaches that of TDOA
and, theoretically, that of random guesses. However, the
proposed method still obtains an SR greater than 95% for 2
sources with an SNR of 5dB or higher and SR greater than 58%
for up to 6 sources with an SNR of 25dB or higher.
Fig. 3 shows that the proposed method is highly robust to
variation in inter-channel distance up to 1m. Variation of the
inter-channel distance, , within this range does not affect the
performance of the proposed method as the applied segment
Fig. 3: Results for the proposed MSC-based source counting method (dotted blue lines) compared against a TDOA-based GCC-PHAT method (dashed red lines).
For RT60, SNR 40dB and 0.1m, for SNR, RT60 400ms and 0.1m, and for inter-channel distance, RT60 200ms and SNR 40dB.
T
ABLE
II
E
XPERIMENTAL CONFIGURATION
Sampling Frequency (
) 16kHz
256
Segment length (
⁄) 2 seconds
Frame Overlap 50%
Signal-to-Noise Ratio (SNR) 0,5,…,40dB
Reverberation time (RT
60
) 200,300,…,800ms
Clustering algorithm K-means
(Inter-channel distance) 0.1,0.2,…,1.0m
6
Room dimensions (x, y, z) (7m, 4m, 3m)
100
100
length of 2 seconds is larger than the corresponding TDOA for
these microphones. The proposed method sustains an SR
greater than 92% for all source counts less than or equal to 6.
In contrast, the TDOA method obtains an average SR of 65.6%
compared to the proposed method which obtains an average SR
of 97.6%.
VI. CONCLUSIONS
In this paper, a novel speaker counting method, based on the
MSC of short-time segment speech signals, is proposed and
evaluated. It is shown that the proposed MSC feature
effectively discriminates the speakers attending an ad-hoc
meeting scenario with no training or prior knowledge of the
speaker’s locations or microphone locations. It is also
concluded that the proposed source counting method is robust
to reverberation and microphone spacing. An average source
counting success rate of 83.1% is obtained under highly
reverberant conditions (RT60 500ms). Source code is
available from https://doi.org/10.5281/zenodo.879279. Future
work could focus on speaker movement tracking and speaker
diarisation for moving sources where multi-talk may occur.
REFERENCES
[1] E. Vincent, N. Bertin, R. Gribonval and F. Bimbot, “From Blind
to Guided Audio Source Separation: How models and side
information can improve the separation of sound,” Signal
Processing Magazine, vol. 31, no. 3, pp. 107-115, May 2014.
[2] C. Rohlfing, J. M. Becker and M. Wien, “NMF-
b
ased informed
source separation,” in
I
nternational Conference on Acoustics,
Speech and Signal Processing (ICASSP), Shanghai, 2016.
[3] L. Wang, T. K. Hon, J. D. Reiss and A. Cavallaro, “An Iterative
Approach to Source Counting and Localization Using Two
Distant Microphones,”
I
EEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 24, no. 6, pp. 1079-
1093, June 2016.
[4] E. Zwyssig, S. Renals and M. Lincoln, “Determining the
number of speakers in a meeting using microphone array
features,” in
I
nternational Conference on Acoustics, Speech and
Signal Processing (ICASSP), Kyoto, 2012.
[5] D. Pavlidi, A. Griffin, M. Puigt and A. Mouchtaris, “Real-Time
Multiple Sound Source Localization and Counting Using a
Circular Microphone Array,”
I
EEE Transactions on Audio,
Speech, and Language Processing, vol. 21, no. 10, pp. 2193-
2206, Oct. 2013.
[6] L. Drude, A. Chinaev, D. H. T. Vu and R. Haeb-umbach,
“Source counting in speech mixtures using a variational EM
approach for complex WATSON mixture models,” in
I
nternational Conference on Acoustics, Speech and Signal
Processing (ICASSP), Florence, 2014.
[7] A. Bertrand and M. Moonen, “Energy-based multi-speaker
voice activity detection with an ad hoc micro
p
hone array,” in
I
nternational Conference on Acoustics, Speech and Signal
Processing, Dallas, TX, 2010.
[8] Y. Shiiki and K. Suyama, “Omnidirectional sound source
tracking based on sequential updating histogram,” in Asia-
Pacific Signal and Information Processing Association Annual
Summit and Conference (APSIPA),, Hong Kong, 2015.
[9] S. Pasha, J. Donley, C. Ritz and Y. X. Zou, “Towards real-time
source counting by estimation of coherent-to-diffuse ratios from
ad-hoc microphone array recordings,” in Hands-
f
ree scpeech
communication and microphone arrays (HSCMA), San
francisco, 2017.
[10] S. Pasha, C. Ritz and Y. X. Zou, “Detecting multiple,
simultaneous talkers through localising speech recorded by ad-
hoc microphone arrays,” in Asia-Pacific Signal and Information
P
rocessing Association Annual Summit and Conference
(APSIPA), Jeju, 2016.
[11] M. Markaki and Y. Stylianou, “Evaluation of modulation
frequency features for speaker verification and identification,”
in 17th European Signal Processing Conference, Glasgow,
2009.
[12] S. H. Chen, Y. R. Luo and R. C. guido, “Speaker Verification
Using Line Spectrum Frequency, Formant, and Support Vector
Machine,” in 11th IEEE International Symposium on
Multimedia, San Diego, CA, 2009.
[13] W. N. Chan, N. Zheng and T. Lee, “Discrimination Power o
f
Vocal Source and Vocal Tract Related Features for Speaker
Segmentation,”
I
EEE Transactions on Audio, Speech, and
Language Processing, vol. 15, no. 6, pp. 1884-1892, Aug. 2007.
[14] R. J. Mammone, X. Zhang and R. P. Ramachandran, “Robust
speaker recognition: a feature-based approach,” Signal
Processing Magazine, vol. 13, no. 5, p. 58, Sept. 1996.
[15] M. Souden, K. Kinoshita and T. Nakatani, “An integration o
f
source location cues for speech clustering in distributed
microphone arrays,” in
I
nternational Conference on Acoustics,
Speech and Signal Processing (ICASSP), Vancouver, BC, 2013.
[16] M. Souden, K. kinoshota, M. Delcroix and T. Nakatani,
“Location Feature Integration for Clustering-Based Speech
Separation in Distributed Microphone Arrays,”
I
EEE/AC
M
Transactions on Audio, Speech, and Language Processing, vol.
22, no. 2, pp. 354-367, 2014.
[17] A. Ferreira, “On the possibility of speaker discrimination using
a glottal pulse phase-related feature,” in
I
nternational
Symposium on Signal Processing and Information Technology
(ISSPIT), Noida, 2014.
[18] Y. Ji, Y. Baek and Y. Park, “A priori SAP estimator based on
the magnitude square coherence for dual-channel microphone
system,” in
I
nternational Conference on Acoustic, Speech and
Signal Processing (ICASSP), Brisbane, 2015.
[19] A. Schwarz and W. Kellermann, “Coherent-to-Diffuse Power
Ratio Estimation for Dereverberation,”
I
EEE/AC
M
Transactions on Audio, Speech, and Language Processing, vol.
23, no. 6, pp. 1006-1018, June 2015.
[20] S. Vesa, “Binaural Sound Source Distance Learning in Rooms,”
Transactions on Audio, Speech, and Language Processing, vol.
17, no. 8, pp. 1498-1507, 2009.
[21] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus and D.
S. Pallett, “TIMIT Acoustic-Phonetic Continuous Speech
Corpus LDC93S1. Web Download,” Philadelphia: Linguistic
Data Consortium, 1993.
[22] U. Bandyopadhyay and S. Maulik, “Performance evaluation o
f
some clustering algorithms and validity indices,”
I
EEE
Transactions on Pattern Analysis and Machine Intelligence,
vol. 24, no. 12, pp. 1650-1654, Dec 2002.
[23] C. H. Knapp and G. C. Carter, “The generalized correlation
method for estimation of time delay,” vol. 24, no. 4, pp. 320-
327, 1976.
[24] M. S. Brandstein and H. F. Silverman, “A robust method for
speech signal time-delay estimation in reverberant rooms,”
Munich, Germany, 1997.