Conference PaperPDF Available

Blind Speaker Counting in Highly Reverberant Environments by Clustering Coherence Features

Authors:

Abstract

This paper proposes the use of the frequency domain Magnitude Squared Coherence (MSC) between two ad-hoc recordings of speech as a reliable speaker discrimination feature for source counting applications in highly reverberant environments. The proposed source counting method does not require knowledge of the microphone spacing and does not assume any relative distance between the sources and the microphones. Source counting is based on clustering the frequency domain MSC of the speech signals derived from short time segments. Experiments show that the frequency domain MSC is speaker-dependent and the method was successfully used to obtain highly accurate source counting results for up to six active speakers for varying levels of reverberation and microphone spacing.
© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current
or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective
works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
APSIPA ASC 2017
Blind Speaker Counting in Highly Reverberant
Environments by Clustering Coherence Features
Shahab Pasha, Jacob Donley and Christian Ritz
School of Electrical, Computer and Telecommunication Engineering, University of Wollongong, NSW, Australia
E-mail: sp900@uowmail.edu.au, jrd089@uowmail.edu.au and critz@uow.edu.au
Abstract— This paper proposes the use of the frequency-
domain Magnitude Squared Coherence (MSC) between two ad-
hoc recordings of speech as a reliable speaker discrimination
feature for source counting applications in highly reverberant
environments. The proposed source counting method does not
require knowledge of the microphone spacing and does not
assume any relative distance between the sources and the
microphones. Source counting is based on clustering the
frequency domain MSC of the speech signals derived from short
time segments. Experiments show that the frequency domain
MSC is speaker-dependent and the method was successfully used
to obtain highly accurate source counting results for up to six
active speakers for varying levels of reverberation and
microphone spacing.
I. INTRODUCTION
The number of sound sources present in an acoustic scene is
crucial information for speech processing applications, such as
guided source separation [1], [2], source localisation [3] and
speech diarisation [4]. State of the art approaches to source
counting suffer from limited assumptions and requirements
such as the prior knowledge of the microphone array structure
[4], [5], statistical modelling of the speech mixture [6] and full
noise removal [7], which confines the target scenarios of these
methods. Moreover, the performances of the existing source
counting methods are highly sensitive to noise and
reverberation [7].
A limitation of the existing features used in the literature for
source counting and discrimination is that they discriminate
speakers only based on the location cues and therefore cannot
discriminate the speakers with the same Direction of Arrival
(DOA) but located at different distances [5], [8]. Hence, more
effective features are needed to be introduced to obtain
sustainable source counting results for random scenarios.
The authors have previously applied the Coherent to Diffuse
Ratio (CDR) feature [9] and Magnitude Squared Coherence
(MSC) [10] for multi-talk detection in the context of ad-hoc
array ‘nodes’. Although accurate multi-talk detection results
are obtained, the proposed method in [9] requires all the ‘nodes’
to be of the same structure and the accuracy of the results
highly depends on an adaptively chosen threshold value for
multi-talk detection (classification). In this contribution, one
two-channel (dual) microphone array of an unknown inter-
channel distance, located arbitrarily within an acoustic scene is
used to discriminate speakers and estimate the number of
sources.
While one approach is to derive the MSC in the time domain,
this paper proposes the use of the frequency domain MSC. This
is motivated by the wide use of frequency domain analysis of
human speech for speaker identification and verification [11],
[12]. It is known that every person has specific voice
characteristics which can be exploited for speaker
discrimination [13]. The idea is to utilise the inherent
differences in the articulatory organs (the structure of the vocal
tract, the size of the nasal cavity, and vocal cord characteristics)
for feature extraction and speaker discrimination [14].
This paper shows that different speakers have different
frequency-domain MSC values when calculated for using
speech signals derived from dual microphone recordings. The
proposed method in this paper discriminates the speakers based
on their unique voice characteristics as modelled by the speech
signal. This provides an advantage over state-of-the-art speech
clustering and speaker discrimination methods [15], [16],
which are based on speaker location cues and cannot perform
properly if the speakers swap places or are located very close
to each other.
The remainder of the paper is organised as follows. In
section II the mathematical model for the ad-hoc dual
microphone recording is described. Section III describes the
proposed clustering features. In section IV the proposed
clustering method for source counting is explained. Section V
and VI are dedicated to the experimental evaluation and results,
respectively, with the paper concluded in Section VII.
II. MATHEMATICAL MODEL FOR THE AD-HOC DUAL
MICROPHONE RECORDING
Assuming a dual-channel node [9], with omni-directional
microphones and an unknown inter-channel spacing, , is
located within a reverberant and noisy environment, the
recorded signal by such an array is modelled mathematically as

∗,
 , (1)
where ,  and , are the real-valued recorded
signals, source signals and the RIRs for multiple sources (for
microphone 1,2), respectively, and an example is
shown in Fig. 1. is the discerete time, is the total number
of sources (unknown), and 1,, is the source index.
 represents the diffuse noise and reverberation is
modelled by ,.
Another mathematical way to model the recorded speech in
(1) is to separate the direct path (coherent) speech from the
reverberation and noise as




, (2)
where

is the desired direct path speech signal whereas

contains the diffuse noise and the reverberation.
III. MSC
OF THE SPEECH SIGNALS
The frequency domain spectrum of the MSC calculated for
a dual-channel recording (such as in (1)) shows peaks (values
closer to 1) at frequencies corresponding to the coherent speech
(

from (2)). In contrast, values are closer to 0 at
frequencies corresponding to diffuse noise and reverberation.
This characteristic of the MSC in the frequency domain is used
here to discriminate and count the speakers based on their
location [17] and the vocal tract frequencies [12].
The MSC between two channel signals
and
is found in
the time-frequency domain as
,
|
,
|
,
|
,, (3)
Where
|
, is the cross-power spectral density (CPSD),
∈0,…,1 is a segment index of total time segments
for a given speech recording and ∈0,…1 is the
frequency index of total frequencies. Each time segment is
divided into A frames of N samples and the CPSD is then
calculated using an averaging process that follows Welch’s
method as follows:
|
,1


⋆
,

,
, (4)
where ∈0,,1 is a frame index and 1 is the
imaginary unit. The sample cross-correlation
⋆
of (4) is
defined as

⋆
⋅,
⋅,
⋅,


, (5)
where ∈0,,1 is the displacement and
framed is
,. The CPSD is a function of active
speaker(s), noise [18], reverberation level [19] and source to
microphone distance [20].
IV. S
OURCE COUNTING BY CLUSTERING OF THE
MSC
FREQUENCY FEATURES
It is observed that different speakers have different MSC
frequency features [13], [17]. This observation is used for
speaker discrimination and counting applications in this
research. The frequency-domain transform of the MSC is
represented in matrix form as
,
 0,0⋯
1,0
⋮⋱ ⋮
0,1⋯
1,1, (6)
which is useful for clustering applications.
Using
,
from (6), it is possible to discriminate and count
the number of speakers during a meeting by clustering the
extracted features. The proposed method is depicted in Fig. 2
where the extracted feature from every time segment is
clustered with the other extracted features from the time
segments spoken by any speaker.
The differences between the MSC of the speech signal of
different speakers [17] and the similarity between the MSC of
the speech signals of the same speaker are used to cluster the
time segments spoken by the same speaker together. The
number of the formed clusters,
, is the estimate of the number
of the sources as described in Table I.
,
is clustered into 2,,

(assuming that the
maximum number of the participants in the meeting is

)
clusters using one minus the sample correlation, ,,
between each row (segment) and each cluster as
,1
̅


̅

̅


, (7)
where ̅∑
,
∑
,
is a length
row vector of ones and
T
is a transposition of the vector. Here,
is a transposed column of
,
and
is a centroid (row
vector) of the k-means clustering. The number of possibly
Compute
,
Clustering Clustering
evaluation
Fig. 2: The proposed source counting system
T
ABLE
I
T
HE PROPOSED SOURCE COUNTING METHOD
1) Start with the recorded mixture
from (1).
2) Obtain the speech signal for each time segment in the frequency
domain.
3) Extract the MSC features for each time segment of the speech
signal using (3) and obtain
,
from (6).
4) Cluster the extracted features,
,
, from (6) into
2,,

clusters.
5) Choose the optimal using a clustering evaluation metric, such as
[21], as the number of clusters.
6) The optimal number of the clusters,
, is the estimate for the
number of sources.
z
y
x
Room Walls
Fig. 1: Example ad-hoc dual microphone scenario around a rectangular table
with 5. Black circles represent the speech sources.
different sets of centroid seeds, and times to repeat clustering,
is
. The value that yields the best score with the clustering
evaluation criterion is chosen as
.
V. R
ESULTS
In this section, random scenarios (in terms of the sources and
the microphones locations) are simulated and the proposed
source counting method is applied to investigate the effect of
the diffuse noise, reverberation level and the inter-channel
distance () on the source counting performance.
A. Experimental Evaluation
Speech utterances from the TIMIT database [21] are used to
simulate meeting scenarios with 2 to 6 participants for up to 6
minutes with no participants talking at the same time.
Participants are seated randomly around a rectangular 1m by
3m table at the centre of the room. Two microphones are also
located at random locations with inter-channel distance of .
The Calinski Harabasz (CH) criterion (also known as the
Variance Ratio Criterion (VRC)) [22] is used for the cluster
evaluation where the maximum score indicates the optimal .
The experimental configurations and parameters are presented
in Table II.
The Success Rate (SR) [9] is applied for the performance
measurement. Assuming that
is the number of scenarios that
the number of sources is estimated correctly (i.e.
) and
is the total number of test scenarios, the Success Rate (SR)
evaluation measurement is defined as
SR
100. (8)
SR of 100% means that the number of speakers in all the
experimental setups are counted correctly. The sampling
frequency, , ,

and the frame overlap are fixed for all
the experiments to the values in Table II.
The performance of the proposed method is compared
against a competing method which clusters the Time
Difference of Arrival (TDOA) estimates from a Generalised
Cross-Correlation with Phase Transform (GCC-PHAT) [23],
[24]. The performance of TDOA-based methods
fundamentally degrades with low
, small and/or when
sources exist on mirroring sides of a linear array.
B. Proposed Method Success Rates
Using the experimental configuration of Table II, Fig. 3
shows 312 data points computed from 3.12 million iterations.
The data points in Fig. 3 show that the proposed method results
in significantly higher success rates across most scenarios
compared to the TDOA method. All reverberant cases result in
an SR above 75% with correct source counts occurring more
than 80% of the time for 2, 3 and 4 sources. The proposed
method results in an average SR of 86.7% opposed to the
TDOA method with an average SR of 44.9%.
Reduction in SNR has the most impact on SR where
performance below an SNR of 20dB approaches that of TDOA
and, theoretically, that of random guesses. However, the
proposed method still obtains an SR greater than 95% for 2
sources with an SNR of 5dB or higher and SR greater than 58%
for up to 6 sources with an SNR of 25dB or higher.
Fig. 3 shows that the proposed method is highly robust to
variation in inter-channel distance up to 1m. Variation of the
inter-channel distance, , within this range does not affect the
performance of the proposed method as the applied segment
Fig. 3: Results for the proposed MSC-based source counting method (dotted blue lines) compared against a TDOA-based GCC-PHAT method (dashed red lines).
For RT60, SNR 40dB and 0.1m, for SNR, RT60 400ms and 0.1m, and for inter-channel distance, RT60 200ms and SNR 40dB.
T
ABLE
II
E
XPERIMENTAL CONFIGURATION
Sampling Frequency (
) 16kHz
256
Segment length (
) 2 seconds
Frame Overlap 50%
Signal-to-Noise Ratio (SNR) 0,5,,40dB
Reverberation time (RT
60
) 200,300,,800ms
Clustering algorithm K-means
(Inter-channel distance) 0.1,0.2,,1.0m

6
Room dimensions (x, y, z) (7m, 4m, 3m)
100
100
length of 2 seconds is larger than the corresponding TDOA for
these microphones. The proposed method sustains an SR
greater than 92% for all source counts less than or equal to 6.
In contrast, the TDOA method obtains an average SR of 65.6%
compared to the proposed method which obtains an average SR
of 97.6%.
VI. CONCLUSIONS
In this paper, a novel speaker counting method, based on the
MSC of short-time segment speech signals, is proposed and
evaluated. It is shown that the proposed MSC feature
effectively discriminates the speakers attending an ad-hoc
meeting scenario with no training or prior knowledge of the
speaker’s locations or microphone locations. It is also
concluded that the proposed source counting method is robust
to reverberation and microphone spacing. An average source
counting success rate of 83.1% is obtained under highly
reverberant conditions (RT60 500ms). Source code is
available from https://doi.org/10.5281/zenodo.879279. Future
work could focus on speaker movement tracking and speaker
diarisation for moving sources where multi-talk may occur.
REFERENCES
[1] E. Vincent, N. Bertin, R. Gribonval and F. Bimbot, “From Blind
to Guided Audio Source Separation: How models and side
information can improve the separation of sound,Signal
Processing Magazine, vol. 31, no. 3, pp. 107-115, May 2014.
[2] C. Rohlfing, J. M. Becker and M. Wien, “NMF-
ased informed
source separation,” in
I
nternational Conference on Acoustics,
Speech and Signal Processing (ICASSP), Shanghai, 2016.
[3] L. Wang, T. K. Hon, J. D. Reiss and A. Cavallaro, “An Iterative
Approach to Source Counting and Localization Using Two
Distant Microphones,”
I
EEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 24, no. 6, pp. 1079-
1093, June 2016.
[4] E. Zwyssig, S. Renals and M. Lincoln, “Determining the
number of speakers in a meeting using microphone array
features,” in
I
nternational Conference on Acoustics, Speech and
Signal Processing (ICASSP), Kyoto, 2012.
[5] D. Pavlidi, A. Griffin, M. Puigt and A. Mouchtaris, “Real-Time
Multiple Sound Source Localization and Counting Using a
Circular Microphone Array,”
I
EEE Transactions on Audio,
Speech, and Language Processing, vol. 21, no. 10, pp. 2193-
2206, Oct. 2013.
[6] L. Drude, A. Chinaev, D. H. T. Vu and R. Haeb-umbach,
“Source counting in speech mixtures using a variational EM
approach for complex WATSON mixture models,” in
I
nternational Conference on Acoustics, Speech and Signal
Processing (ICASSP), Florence, 2014.
[7] A. Bertrand and M. Moonen, “Energy-based multi-speaker
voice activity detection with an ad hoc micro
p
hone array,” in
I
nternational Conference on Acoustics, Speech and Signal
Processing, Dallas, TX, 2010.
[8] Y. Shiiki and K. Suyama, “Omnidirectional sound source
tracking based on sequential updating histogram,” in Asia-
Pacific Signal and Information Processing Association Annual
Summit and Conference (APSIPA),, Hong Kong, 2015.
[9] S. Pasha, J. Donley, C. Ritz and Y. X. Zou, “Towards real-time
source counting by estimation of coherent-to-diffuse ratios from
ad-hoc microphone array recordings,” in Hands-
f
ree scpeech
communication and microphone arrays (HSCMA), San
francisco, 2017.
[10] S. Pasha, C. Ritz and Y. X. Zou, “Detecting multiple,
simultaneous talkers through localising speech recorded by ad-
hoc microphone arrays,” in Asia-Pacific Signal and Information
P
rocessing Association Annual Summit and Conference
(APSIPA), Jeju, 2016.
[11] M. Markaki and Y. Stylianou, “Evaluation of modulation
frequency features for speaker verification and identification,”
in 17th European Signal Processing Conference, Glasgow,
2009.
[12] S. H. Chen, Y. R. Luo and R. C. guido, “Speaker Verification
Using Line Spectrum Frequency, Formant, and Support Vector
Machine,” in 11th IEEE International Symposium on
Multimedia, San Diego, CA, 2009.
[13] W. N. Chan, N. Zheng and T. Lee, “Discrimination Power o
f
Vocal Source and Vocal Tract Related Features for Speaker
Segmentation,”
I
EEE Transactions on Audio, Speech, and
Language Processing, vol. 15, no. 6, pp. 1884-1892, Aug. 2007.
[14] R. J. Mammone, X. Zhang and R. P. Ramachandran, “Robust
speaker recognition: a feature-based approach,” Signal
Processing Magazine, vol. 13, no. 5, p. 58, Sept. 1996.
[15] M. Souden, K. Kinoshita and T. Nakatani, “An integration o
f
source location cues for speech clustering in distributed
microphone arrays,” in
I
nternational Conference on Acoustics,
Speech and Signal Processing (ICASSP), Vancouver, BC, 2013.
[16] M. Souden, K. kinoshota, M. Delcroix and T. Nakatani,
“Location Feature Integration for Clustering-Based Speech
Separation in Distributed Microphone Arrays,”
I
EEE/AC
M
Transactions on Audio, Speech, and Language Processing, vol.
22, no. 2, pp. 354-367, 2014.
[17] A. Ferreira, “On the possibility of speaker discrimination using
a glottal pulse phase-related feature,” in
I
nternational
Symposium on Signal Processing and Information Technology
(ISSPIT), Noida, 2014.
[18] Y. Ji, Y. Baek and Y. Park, “A priori SAP estimator based on
the magnitude square coherence for dual-channel microphone
system,” in
I
nternational Conference on Acoustic, Speech and
Signal Processing (ICASSP), Brisbane, 2015.
[19] A. Schwarz and W. Kellermann, “Coherent-to-Diffuse Power
Ratio Estimation for Dereverberation,”
I
EEE/AC
M
Transactions on Audio, Speech, and Language Processing, vol.
23, no. 6, pp. 1006-1018, June 2015.
[20] S. Vesa, “Binaural Sound Source Distance Learning in Rooms,”
Transactions on Audio, Speech, and Language Processing, vol.
17, no. 8, pp. 1498-1507, 2009.
[21] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus and D.
S. Pallett, “TIMIT Acoustic-Phonetic Continuous Speech
Corpus LDC93S1. Web Download,” Philadelphia: Linguistic
Data Consortium, 1993.
[22] U. Bandyopadhyay and S. Maulik, “Performance evaluation o
f
some clustering algorithms and validity indices,”
I
EEE
Transactions on Pattern Analysis and Machine Intelligence,
vol. 24, no. 12, pp. 1650-1654, Dec 2002.
[23] C. H. Knapp and G. C. Carter, “The generalized correlation
method for estimation of time delay,” vol. 24, no. 4, pp. 320-
327, 1976.
[24] M. S. Brandstein and H. F. Silverman, “A robust method for
speech signal time-delay estimation in reverberant rooms,”
Munich, Germany, 1997.
... The presented method has been evaluated on various datasets, which estimate the number of simultaneous speakers, similar to human capacity. Shahab et al. in 2017 proposed a blind speaker counting method in the reverberant conditions based on clustering coherent features [24]. The proposed method utilizes magnitude square coherence in the frequency domain (FD-MSC) between two recorded speech signals of different speakers as a trustable feature for detecting the number of speakers in reverberant environments. ...
... WPT and the implementation of the adaptive version of the SRP function based on PHAT and ML filters, which provided high-quality data for elbow criteria. The proposed HNMA-SB-2DASRP method was compared to the FD-MSC [24], i-vector PLDA [25], AF-CRNN [26], and SC-DCCD [27] algorithms for two to five simultaneous speakers in the noisy and reverberant environments using real and simulated data. Two categories of the scenarios were considered for the evaluations. ...
... As is seen, most of the methods lost accuracy in high reverberant scenarios, but the proposed HNMA-SB-2DASRP algorithm estimated the number of speakers with high accuracy. Figure 13b shows the comparison of the The proposed HNMA-SB-2DASRP method was compared to the FD-MSC [24], i-vector PLDA [25], AF-CRNN [26], and SC-DCCD [27] algorithms for two to five simultaneous speakers in the noisy and reverberant environments using real and simulated data. Two categories of the scenarios were considered for the evaluations. ...
Article
Full-text available
Speech processing algorithms, especially sound source localization (SSL), speech enhancement, and speaker tracking are considered to be the main fields in this application. Most speech processing algorithms require knowing the number of speakers for real implementation. In this article, a novel method for estimating the number of speakers is proposed based on the hive shaped nested microphone array (HNMA) by wavelet packet transform (WPT) and 2D sub-band adaptive steered response power (SB-2DASRP) with phase transform (PHAT) and maximum likelihood (ML) filters, and, finally, the agglomerative classification and elbow criteria for obtaining the number of speakers in near-field scenarios. The proposed HNMA is presented for aliasing and imaging elimination and preparing the proper signals for the speaker counting method. In the following, the Blackman–Tukey spectral estimation method is selected for detecting the proper frequency components of the recorded signal. The WPT is considered for smart sub-band processing by focusing on the frequency bins of the speech signal. In addition, the SRP method is implemented in 2D format and adaptively by ML and PHAT filters on the sub-band signals. The SB-2DASRP peak positions are extracted on various time frames based on the standard deviation (SD) criteria, and the final number of speakers is estimated by unsupervised agglomerative clustering and elbow criteria. The proposed HNMA-SB-2DASRP method is compared with the frequency-domain magnitude squared coherence (FD-MSC), i-vector probabilistic linear discriminant analysis (i-vector PLDA), ambisonics features of the correlational recurrent neural network (AF-CRNN), and speaker counting by density-based classification and clustering decision (SC-DCCD) algorithms on noisy and reverberant environments, which represents the superiority of the proposed method for real implementation.
... When multiple microphone channels are available, speaker counting can be 45 performed by clustering interchannel features (Drude et al., 2014;Pasha et al., 2017) or explicitly localizing the speakers in space (Brutti et al., 2010;Pavlidi et al., 2012), both in the single-array and multiple-array scenarios. Singlechannel speaker counting is more challenging, with early works focusing on handcrafted features such as the modulation index (Arai, 2003), the mean and 50 variance of the 7th Mel filter (Ouamour et al., 2008) or the cosine similarity between Mel Frequency Cepstrum Coefficient (MFCC) feature vectors along with pitch (Xu et al., 2013). ...
... In addition, we explore the use of spatial features to aid VAD+OSD and speaker counting. As mentioned above, a number of works have shown that spatial features can be used for counting (Drude et al., 2014;Pasha et al., 2017;Brutti et al., 2010;Pavlidi et al., 2012) and VAD (Vecchiotti et al., 2019b). However, to our knowl-110 edge, no study has yet been performed where spatial features are used in conjunction with deep neural networks to tackle OSD and speaker counting directly. ...
... In fact, as mentioned in Section 1.2, many works have tackled speaker counting by framing it as a localization problem. These works resort to DoA esti-305 mation methods based on generalized cross-correlation with phase transform (GCC-PHAT) (Knapp and Carter, 1976) as in (Brutti et al., 2010;Drude et al., 2014), magnitude-squared coherence (MSC) (Pasha et al., 2017) or simple crosspower spectrum (Pavlidi et al., 2012;Walter et al., 2015). The speaker number is estimated via a direct approach such as in (Brutti et al., 2010) by counting peaks 310 in GCC-PHAT based acoustic maps or by clustering methods, where speaker clusters are identified by iterative grouping of complex-valued time-frequency coefficients (Drude et al., 2014), magnitude squared coherence feature vectors (Pasha et al., 2017), or DoAs estimated over single-source time-frequency zones (Pavlidi et al., 2012) or individual time-frequency bins (Walter et al., 2015). ...
Article
We study the problem of detecting and counting simultaneous, overlapping speakers in a multichannel, distant-microphone scenario. Focusing on a supervised learning approach, we treat Voice Activity Detection (VAD), Overlapped Speech Detection (OSD), joint VAD and OSD (VAD+OSD) and speaker counting in a unified way, as instances of a general Overlapped Speech Detection and Counting (OSDC) multi-class supervised learning problem. We consider a Temporal Convolutional Network (TCN) and a Transformer based architecture for this task, and compare them with previously proposed state-of-the art methods based on Recurrent Neural Networks (RNN) or hybrid Convolutional-Recurrent Neural Networks (CRNN). In addition, we propose ways of exploiting multichannel input by means of early or late fusion of single-channel features with spatial features extracted from one or more microphone pairs. We conduct an extensive experimental evaluation on the AMI and CHiME-6 datasets and on a purposely made multichannel synthetic dataset. We show that the Transformer-based architecture performs best among all architectures and that neural network based spatial localization features outperform signal-based spatial features and significantly improve performance compared to single-channel features only. Finally, we find that training with a speaker counting objective improves OSD compared to training with a VAD+OSD objective.
... Microphone clustering in ASNs has been previously explored using e.g., coherence-based features [12], [13], eigenvectors [7], divergence of power spectral densities [14] or cepstral features [6], [15]. These clustering solutions and applications, although effective, do not explicitly incorporate privacy considerations and evaluations are confined to shoebox-type room scenarios. ...
Preprint
Full-text available
In this paper we introduce a realistic and challenging, multi-source and multi-room acoustic environment and an improved algorithm for the estimation of source-dominated microphone clusters in acoustic sensor networks. Our proposed clustering method is based on a single microphone per node and on unsupervised clustered federated learning which employs a light-weight autoencoder model. We present an improved clustering control strategy that takes into account the variability of the acoustic scene and allows the estimation of a dynamic range of clusters using reduced amounts of training data. The proposed approach is optimized using clustering-based measures and validated via a network-wide classification task.
... These features are passed into a classification stage which can be used to make a decision on which participant the input speech belongs to. Image source: [5] number of speakers from a session of audio data [22,81,86]. ...
Thesis
Full-text available
In recent years, the fast proliferation of smartphones devices has provided pow- erful and portable methodologies for integrating sensing systems which can run continuously and provide feedback in real-time. The mobile crowd-sensing of human behaviour is an emerging computing paradigm that offers a challenge of sensing everyday social interactions performed by people who carry smartphone devices upon themselves. Typical smartphone sensors and the mobile crowd-sensing paradigm compose a process where the sensors present, such as the microphone, are used to infer social relationships between people in diverse social settings, where environmental factors can be dynamic and the infrastructure of buildings can vary. The typical approaches in detecting social interactions between people consider the use of co-location as a proxy for real-world interactions. Such approaches can under-perform in challenging situations where multiple social interactions can occur within close proximity to each other, for example when people are in a queue at the supermarket but not a part of the same social interaction. Other approaches involve a limitation where all participants of a social interaction must carry a smartphone device with themselves at all times and each smartphone must have the sensing app installed. The problem here is the feasibility of the sensing system, which relies heavily on each participant’s smartphone acting as nodes within a social graph, connected together with weighted edges of proximity between the devices; when users uninstall the app or disable background sensing, the system is unable to accurately determine the correct number of participants. In this thesis, we present two novel approaches to detecting co-located social interac- tions using smartphones. The first relies on the use of WiFi signals and audio signals to distinguish social groups interacting within a few meters from each other with 88% precision. We orchestrated preliminary experiments using WiFi as a proxy for co-location between people who are socially interacting. Initial results showed that in more challenging scenarios, WiFi is not accurate enough to determine if people are socially interacting within the same social group. We then made use of audio as a second modality to capture the sound patterns of conversations to identify and segment social groups within close proximity to each other. Through a range of real-world experiments (social interactions in meeting scenarios, coffee shop scenarios, conference scenarios), we demonstrate a technique that utilises WiFi fingerprinting, along with sound fingerprinting to identify these social groups. We built a system which performs well, and then optimized the power consumption and improved the performance to 88% precision in the most challenging scenarios using duty cycling and data averaging techniques. The second approach explores the feasibility of detecting social interactions without the need of all social contacts to carry a social sensing device. This work explores the use of supervised and unsupervised Deep Learning techniques before concluding on the use of an Autoencoder model to perform a Speaker Identification task. We demonstrate how machine learning can be used with the audio data collected from a singular device as a speaker identification framework. Speech from people is used as the input to our Autoencoder model and then classified against a list of “social contacts” to determine if the user has spoken a person before or not. By doing this, the system can count the number of social contacts belonging to the user, and develop a database of common social contacts. Through the use 100 randomly-generated social conversations and the use of state-of-the-art Deep Learning techniques, we demonstrate how this system can accurately distinguish new and existing speakers from a data set of voices, to count the number of daily social interactions a user encounters with a precision of 75%. We then optimize the model using Hyperparameter Optimization to ensure that the model is most optimal for the task. Unlike most systems in the literature, this approach would work without the need to modify the existing infrastructure of a building, and without all participants needing to install the same app
... The main limitation of the method proposed by Pasha et al. [45] is that all the nodes must be of the same structure (i.e., the distances between the microphones at all nodes must be the same), which limits the method's applicability. The MSC is found using the cross-power spectral density (CPSD) as presented by Pasha et al. [87] (Figure 4): ...
Article
Full-text available
Given ubiquitous digital devices with recording capability, distributed microphone arrays are emerging recording tools for hands-free communications and spontaneous tele-conferencings. However, the analysis of signals recorded with diverse sampling rates, time delays, and qualities by distributed microphone arrays is not straightforward and entails important considerations. The crucial challenges include the unknown/changeable geometry of distributed arrays, asynchronous recording, sampling rate mismatch, and gain inconsistency. Researchers have recently proposed solutions to these problems for applications such as source localization and dereverberation, though there is less literature on real-time practical issues. This article reviews recent research on distributed signal processing techniques and applications. New applications benefitting from the wide coverage of distributed microphones are reviewed and their limitations are discussed. This survey does not cover partially or fully connected wireless acoustic sensor networks.
Conference Paper
In this paper, a smart subband system for speaker counting is proposed by scattered microphone array (SMA) and using direction of arrival (DOA) estimation with adaptive eigenvalue decomposition (AED). Firstly, the recorded signals by central array are divided into subbands by using gammatone filter bank. Then, the adaptive generalized cross-correlation (GCC) algorithm is implemented on the microphone signals for different subbands. The extracted peaks’ positions are weighted, and this process is repeated on different time frames for spatial sectors in acoustical room. To complete the process, two lateral microphone arrays are considered for verification of the estimated DOAs by using of AED algorithm. The peaks’ positions in the same locations for at least 2 of 3 arrays are considered as the number of overlapped speeches. The proposed speaker counting algorithm by scattered microphone array (SCA-SMA) is compared with other works in adverse environments, which shows the superiority of the presented algorithm in adverse environments.
Article
In this manuscript we propose a methodology for the reconstruction of sound fields in arbitrary locations based on the signals acquired by a spatial distribution of compact microphone arrays (virtual miking). The proposed method is suitable for operating in reverberant environments, thanks to a two-stage analysis process, the former of which aims at separating the direct and the diffuse components of the sound field. The method that we propose is inherently parametric, as the sources of the acoustic scene are characterized by parameters describing location and directivity (spherical harmonics expansion), which are extracted from the exterior model of the direct component of the sound field. Once the parameters of the sources are extracted, the direct sound field at an arbitrary location is reconstructed. The diffuse component is reconstructed from the joint knowledge of the diffuse component at the locations of the distributed microphone arrays, under the assumption of isotropic behavior. Results show that the proposed technique is able to analyze the sound field and reconstruct the parameters of the sources that are active in the scene. In addition, the synthesis of the signals at the virtual microphone locations turns out to accurately match (in terms of spatial cues) the actual sound field, as measured by a microphone places in the desired location.
Conference Paper
In an era of ubiquitous digital devices with built-in microphones and recording capability, distributed microphone arrays of a few digital recording devices are the emerging recording tool in hands-free speech communications and immersive meetings. Such so-called ad hoc microphone arrays can facilitate high-quality spontaneous recording experiences for a wide range of applications and scenarios, though critical challenges have limited their applications. These challenges include unknown and changeable positions of the recording devices and sound sources, resulting in varying time delays of arrival between microphones in the ad hoc array as well as varying recorded sound power levels. This paper reviews state-of-the-art techniques to overcome these issues and provides insight into possible ways to make existing methods more effective and flexible. The focus of this paper is on scenarios in which the microphones are arbitrarily located in an acoustic scene and do not communicate directly or through a fusion centre.
Conference Paper
Full-text available
Coherent-to-diffuse ratio (CDR) estimates over short time frames are utilized for source counting using ad-hoc microphone arrays to record speech from multiple participants in scenarios such as a meeting. It is shown that the CDR estimates obtained at ad-hoc dual (two channel) microphone nodes, located at unknown locations within an unknown reverberant room, can detect time frames with more than one active source and are informative for source counting applications. Results show that interfering sources can be detected with accuracies ranging from 69% to 89% for delays ranging from 20 ms to 300 ms, with source counting accuracies ranged from 61% to 81% for two sources and the same range of delays.
Conference Paper
Full-text available
This paper proposes a novel approach to detecting multiple, simultaneous talkers in multi-party meetings using localisation of active speech sources recorded with an ad-hoc microphone array. Cues indicating the relative distance between sources and microphones are derived from speech signals and room impulse responses recorded by each of the microphones distributed at unknown locations within a room. Multiple active sources are localised by analysing a surface formed from these cues and derived at different locations within the room. The number of localised active sources per each frame or utterance is then counted to estimate when multiple sources are active. The proposed approach does not require prior information about the number and locations of sources or microphones. Synchronisation between microphones is also not required. A meeting scenario with competing speakers is simulated and results show that simultaneously active sources can be detected with an average accuracy of 75% and the number of active sources counted accurately 65% of the time.
Article
Full-text available
We propose a time difference of arrival (TDOA) estimation framework based on time-frequency inter-channel phase difference (IPD) to count and localize multiple acoustic sources in a reverberant environment using two distant microphones. The time-frequency processing enables exploitation of the nonstationarity and sparsity of audio signals, increasing robustness to multiple sources and ambient noise. For inter-channel phase difference estimation we use a cost function, which is equivalent to the generalized cross-correlation with phase transform (GCC-PHAT) algorithm and which is robust to spatial aliasing caused by large inter-microphone distances. To estimate the number of sources we further propose an iterative contribution removal (ICR) algorithm to count and locate the sources using the peaks of the GCC function. In each iteration, we first use IPD to calculate the GCC function, whose highest peak is detected as the location of a sound source; then we detect the time-frequency bins that are associated with this source and remove them from the IPD set. The proposed ICR algorithm successfully solves the GCC peak ambiguities between multiple sources and multiple reverberant paths.
Article
Full-text available
In this work, a multiple sound source localization and counting method is presented, that imposes relaxed sparsity constraints on the source signals. A uniform circular microphone array is used to overcome the ambiguities of linear arrays, however the underlying concepts (sparse component analysis and matching pursuit-based operation on the histogram of estimates) are appli- cable to any microphone array topology. Our method is based on detecting time-frequency (TF) zones where one source is dominant over the others. Using appropriately selected TF components in these “single-source” zones, the proposed method jointly estimates the number of active sources and their corresponding directions of arrival (DOAs) by applying a matching pursuit-based approach to the histogram of DOA estimates. The method is shown to have excellent performance for DOA estimation and source counting, and to be highly suitable for real-time applications due to its low complexity. Through simulations (in various signal-to-noise ratio conditions and reverberant environments) and real environment experiments, we indicate that our method outperforms other state-of-the-art DOA and source counting methods in terms of accuracy, while being significantly more efficient in terms of computational complexity.
Article
Full-text available
The estimation of the time- and frequency-dependent coherent-to-diffuse power ratio (CDR) from the measured spatial coherence between two omnidirectional microphones is investigated. Known CDR estimators are formulated in a common framework, illustrated using a geometric interpretation in the complex plane, and investigated with respect to bias and robustness towards model errors. Several novel unbiased CDR estimators are proposed, and it is shown that knowledge of either the direction of arrival (DOA) of the target source or the coherence of the noise field is sufficient for unbiased CDR estimation. The validity of the model for the application of CDR estimates to dereverberation is investigated using measured and simulated impulse responses. A CDR-based dereverberation system is presented and evaluated using signal-based quality measures as well as automatic speech recognition accuracy. The results show that the proposed unbiased estimators have a practical advantage over existing estimators, and that the proposed DOA-independent estimator can be used for effective blind dereverberation.
Conference Paper
In this paper, a method for omnidirectional sound source tracking using a circular microphone array is proposed. The sequential updating histogram estimated every two microphones are integrated for the sound source tracking. The histogram is estimated by weighting those reliability to results obtained every adjacent microphone pair. In addition, the wrapped Cauchy mixture distribution is used to detect the omnidirectional DOA and the number of sound sources. As a result, the accurate omnidirectional sound source tracking can be achieved. Several experimental results are shown to present the effectiveness of the proposed method.
Conference Paper
Informed Source Separation (ISS) is a topic unifying the research fields of both source separation and source coding. Its main objective is to recover audio objects out of a mixture with a source separation step assisted by a set of compact parameters extracted with complete knowledge of the sources. ISS can be used for applications such as active listening and remixing of music (e.g. karaoke). In this paper, we propose a new ISS method which includes a semi-blind source separation (SBSS) step in the ISS decoder to decrease the amount of parameter bit rate. SBSS is conducted by factorizing the mixture in time-frequency domain by nonnegative matrix factorization (NMF). The transmitted parameters consist of a compact NMF initialization as well as residuals calculated in the NMF domain. We show in simulations that using SBSS in the decoder increases the separation quality and that our scheme improves the rate-distortion performance in comparison to a state-of-the art method.
Conference Paper
In this contribution we derive a variational EM (VEM) algorithm for model selection in complex Watson mixture models, which have been recently proposed as a model of the distribution of normalized microphone array signals in the short-time Fourier transform domain. The VEM algorithm is applied to count the number of active sources in a speech mixture by iteratively estimating the mode vectors of the Watson distributions and suppressing the signals from the corresponding directions. A key theoretical contribution is the derivation of the MMSE estimate of a quadratic form involving the mode vector of the Watson distribution. The experimental results demonstrate the effectiveness of the source counting approach at moderately low SNR. It is further shown that the VEM algorithm is more robust with respect to used threshold values.