Conference PaperPDF Available

Towards Real-Time Source Counting by Estimation of Coherent-to-Diffuse Ratios from Ad-Hoc Microphone Array Recordings



Coherent-to-diffuse ratio (CDR) estimates over short time frames are utilized for source counting using ad-hoc microphone arrays to record speech from multiple participants in scenarios such as a meeting. It is shown that the CDR estimates obtained at ad-hoc dual (two channel) microphone nodes, located at unknown locations within an unknown reverberant room, can detect time frames with more than one active source and are informative for source counting applications. Results show that interfering sources can be detected with accuracies ranging from 69% to 89% for delays ranging from 20 ms to 300 ms, with source counting accuracies ranged from 61% to 81% for two sources and the same range of delays.
© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Shahab Pasha1, Jacob Donley1, Christian Ritz1 and Yue Xian Zou2
1 School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, NSW, Australia
2 ADSPLAB/ELIP, School of ECE, Peking University, Shenzhen, China
Coherent-to-diffuse ratio (CDR) estimates over short time frames
are utilised for source counting using ad-hoc microphone arrays to
record speech from multiple participants in scenarios such as a
meeting. It is shown that the CDR estimates obtained at ad-hoc dual
(two channel) microphone nodes, located at unknown locations
within an unknown reverberant room, can detect time frames with
more than one active source and are informative for source counting
applications. Results show that interfering sources can be detected
with accuracies ranging from 69% to 89% for d elays ranging from
20 ms to 300 ms, with source counting accuracies ranged from 61%
to 81% for two sources and the same range of delays.
Index Terms coherent-to-diffuse ratio, ad-hoc microphone
arrays, source counting, overlap detection
The diarisation of meeting recordings [1] and audio signal
classification [2] suffer significant errors when multiple participants
speak at the same time. When recording such meetings using ad-hoc
arrays of microphones, such as formed from participants mobile
devices, the microphone locations, as well as the number and
locations of sources, are typically unknown. Existing approaches
proposed to enhance the recording in the presence of interfering
sources with more than one microphone [2] suffer from limitations
such as: requiring the prior knowledge of the number of sources [2];
the derivation of a predefined threshold [3]; and training using a
suitable pre-recorded dataset [2].
Errors caused by speech overlaps and the baseline features for
overlap detection are discussed in [4, 5] and it is suggested that
conversational features such as speaker change statistics, can help
speaker diarisation methods over long-term segments with a length
of approximately 5 seconds [4]. Detecting regions of overlapping
speech can improve the accuracy of clustering-based diarisation
methods by 15% [6].
In a randomly distributed set of microphones that form an ad-
hoc array, diffuseness and the level of reverberation [7] can be used
as cues indicating the relative distance between sources and
microphones and subsequently used to count the number of
sources [4]. It was also suggested that such cues can also be used as
features for detecting interfering talkers [8] in a mobile speech
communication application.
The novel features investigated in [8] represent the ratio of
coherent to diffuse sources in a sound scene, referred to as the
Coherent-to-Diffuse Ratio (CDR), and are robust to additive noise.
This method extracts CDR features from short (typically 20 ms)
speech frames and does not require a training phase. The proposed
method is designed for dual microphone systems. The advantage of
the CDR features compared with other source localisation cues, such
as signal power and timed delays, is that it is independent of the
sources energy levels and does not require time alignment of the
recorded signals. The authors have previously [9] utilised
Magnitude Squared Coherence (MSC) feature for source
localisation which only takes into account the direct path component
and does not benefit from information in the ratio of the coherent
speech to the diffuse component. This contribution, (compared to
the previous work) takes into account the coherent to diffuse ratio
information which can be estimated over short time frames and
applied as distance and interference cues.
This research aims to utilise the estimated CDR features [8] for
real time interfering talker detection and source counting by ad-hoc
arrays of dual (two channel) microphone nodes located within a
certain distance from each participant. This contribution overcomes
the limitations of similar real time methods such as requiring the
knowledge of the microphone array structure [10] and is suitable to
applications where participants meet using their personal mobile
devices as a recording node. Similar to the state-of-the-art source
counting methods, herein it is assumed that sources may overlap in
some time-frequency zones [10].
The remainder of the paper is organised as follows: Section 2
explains the recording process with dual microphone nodes in noisy
reverberant environments. In Section 3 the channels within each
node are jointly analysed and the multi-talk detection discriminative
features are derived from each node. Section 4 is dedicated to the
process of source counting by using the derived cues and the
effectiveness of the proposed method and features are examined in
Section 5. Conclusions are made in Section 6.
Assuming an ad-hoc microphone array consist of nodes of dual
(two channel) omni-directional microphones with identical inter-
channel distances, , each channel at each node receives a unique
reverberated version of the source signal due to its spatial location
and Room Impulse Response (RIR) as
 
 
where  are the recorded signals by the two channels,  
, at node   ,  is the clean, anechoic source
signal (assuming there is only one active source) and
 are the
Room Impulse Responses (RIRs) at the  node location. 
represents the diffuse noise and reverberation is modelled by 
. The RIR between each node and the active source,
is dependent on the source-node distances, room geometry and
characteristics such as the  . If there is more than one
simultaneously active source in the room (cross-talk, overlap) the
recorded signals can be represented as
   
where ,  and  are the recorded signals, clean
source signals and RIRs for multiple sources, respectively. is the
total number of simultaneously active sources at time , and  
 is the source index.
Figure 1 depicts an example scenario recording of two talkers
(sources) using an ad-hoc microphone array in a reverberant room.
The two source signals are labelled and with the three
node signals labelled according to (2) and all corresponding RIRs
labelled with  for    and   .
It is shown that the coherence between the two channels
signals, and , at each node is a function of source-
node distance, frequency, noise, interference and reverberation level
[11, 12, 13]. It is also shown that there is no need to calculate this
measurement using the full-length signals and they can be accurately
estimated using 20 ms frames of the noisy speech signals [8].
In this research, estimated coherence features over short time
frames which are averaged across the frequency band, are applied as
distance and interference features to discriminate the microphone
nodes located close to an active source (inspired by [14] where MSC
is applied to localise one source).
The coherence between the two noisy and reverberant channel
signals at node is higher when the active source is closer to the
node and is lower when the active source is located far from the
node [8, 10-12]. For instance, in Fig. 1, Node 1 is dominated by
Source 2 and Node 2 is dominated by Source 1. Hence these two
nodes have higher coherence features even in cross talk situations
whereas Node 3 is not close to any source and in the case of cross-
talk it receives a mixture of Source 1 and Source 2 signals equally,
which has a low Signal-to-Interference Ratio (SIR) and coherence
feature [11, 12]. The MSC is defined as
where is frequency and  represents the cross-power
spectral density function of some and .
The spatial location of all sources, nodes and, therefore, Direction
of Arrivals (DOAs) are unknown. In this work we make use of the
fact that the MSC varies depending on the different number of
spatial locations of the sources and nodes.
The long-term noisy reverberations defined by  and
 are diffuse as they do not have any specific angle of arrival
and they arrive at each node from all directions under the assumption
that reverberant sound can be modelled as a mixture of a direct
component and a perfectly diffuse reverberation component which
are mutually uncorrelated. The only coherent component in (2) is
the direct path signal from the dominant source to the node, which
can be modelled mathematically as
  
where 
 is the time-delay between source , node and
channel , however, the time delays and the direct path signal
amplitude are not required for CDR estimation [8] and are only
explained to discriminate the coherent signal and the diffuse noise
and reverberation. The CDR is based on the time-frequency domain
representation of the MSC of (3) which can be written as
where  is the discrete-time frame index. Just as
 is from , analogously we obtain the time-invariant
noise coherence, , from  which is the first few
hundreds milliseconds of the recording and the dominant source
coherence, , from . The CDR is [15]
 
from which we propose the use of the average CDR over the entire
frequency band and frames, given by
 
 
where is the band-width of the signal, is the lowest
frequency, is the highest frequency and   [8, 15].
Equation (7) gives an estimate of the CDR since the dominant
direct speech can be considered as the coherent signal whereas the
diffuse noise and the reverberant-interfering speech form the diffuse,
or non-coherent, component. This fact is utilised in this research to
distinguish the nodes with higher CDR values (more likely located
close to an active source e.g. Node 1 and Node 2 in Fig. 1) from
nodes with lower CDR values (likely located far from active sources
e.g. Node 3 in Fig. 1).
In order to investigate the effect of interference on the CDR
feature in a scenario with one dual-channel node, located at an equal
distance (2 m) from four participants in a meeting, the CDR values
are estimated over 20 ms speech frames. The CDR values are
averaged across the frequency band with (7) when , from (2),
varies from one to four, which respectively means one, two, three or
all four participants are talking simultaneously. The effect of
Room walls
Fig. 1. Ad-hoc nodes recording two talkers.
Source 1,
Node 1
Node 2 
Node 3 
Source 2,
interference on estimated CDR is shown in Fig. 2. It is concluded
that the estimated CDR drops with the number of interfering sources
when there is no dominant source (the node is not located close to
any specific speaker). Opposed to similar state of the art
systems [16] the target scenario is a meeting room where all the
participants are present in the same room.
The same experiment (same speech frames and speaker
locations) is repeated with four nodes this time, collocated (less than
30 cm distance) with the participants, and it is concluded that the
nodes close to the simultaneously active sources (i.e. Node 1 and
Node 3 in Fig. 3) have higher CDR estimates compared with nodes
close to silent participants during the time frame. The inverse
relationship between the CDR and the active source-node distance
has been previously shown [8]. A source counting method is
proposed in Section 4 based on the relationship, seen in Fig. 2 and
Fig. 3, between the CDR, the number of active sources () and the
node locations, where increased interference reduces CDR.
The target scenario for the active source counting method is a
spontaneous meeting where each participant is located close (less
than 30 cm) to a recording device (only one dual node) and the
distance between two adjacent nodes is not less than one meter.
Under these assumptions, nodes with higher CDR values are more
likely to be located close to an active source, hence it is possible to
count the nodes with higher CDRs in order to find out the number
of simultaneously active sources. The proposed algorithm is
summarised in Table I. The CDR values are estimated for all the
  
and we define a new set with cardinality (denoted by ) as an
estimate for the number of simultaneously active sources, , as
  
 
 
 
where 
  , 
   and is a
parameter to set the threshold of maintained CDR values.
The number of simultaneously active sources in this research is
estimated by the joint analysis of an ad-hoc set of dual microphone
nodes (similar to [17]) (Fig. 1) of the same structure over short time
frames. CDR as a ratio of the coherent source signal to the diffuse
reverberated signals is independent of the sources energy levels and
can be applied when louder and quieter sources are simultaneously
active, however, it is assumed that all the nodes are of the same
This section evaluates the performance for the two target
applications: detecting multi-talk (time segments when there is more
than one active source) and source counting (counting the total
number of simultaneously active sources). CDR values at each node
locations are calculated over short time intervals of 20 ms, which
corresponds to 320 samples at a 16 kHz sampling frequency and
averaged across all frames in a time segment and all the frequencies.
This is the typical time duration for which a speech segment is
assumed to be stationary. However, better performance can be
obtained when a larger value is chosen for the frame length [18] or
the averaged CDR value across adjacent frames are applied as the
discriminative feature (7).
Ten speech sentences from the NOIZEUS database [19] are
used to generate a test database of noisy mixtures of speech signals
with an arbitrary number of simultaneously active sources (2). Time
frames of length 20 ms are randomly chosen from a simulated
meeting of four participants ( ), where   ,
randomly located around a round table of radius 2 m in a room
10 m 8 m 3 m. Each participant has one dual-channel recording
device located within a distance of 30 cm.
One hundred time segments, consisting of one or more frames
(i.e.   ), are used in order to investigate the effect
of on multi-talk detection (Section 5.1) and source counting
Fig. 2. Effect of interference on CDR estimates.
1 2 3 4
CDR (dB)
Number of Sources (
Estimated CDR
Fig. 3. CDR values at each node when two sources are active
Node 1 Node 2 Node 3 Node 4
CDR (dB)
 
 
(Section 5.2). Table II summarises the experimental settings where
informal experiments found that =  gave on average the most
reliable results.
5.1. Multi-talk detection
The algorithm proposed in Table I is applied to the time frames of
the created database, the ground truth (the number of speech sources,
  , applied to create the mixture, from (2)) is
compared with the output of the proposed method. The True Positive
Rate (TPR) for cross talk detection without focusing on the number
of simultaneously active sources is defined as
  
   
where  is the number of segments with more than one active
source labelled as cross-talk correctly,  and  are incorrectly
labelled segments and  is the single talk segments labelled
correctly as single talk segments.     equals the
number of segments in the test set (i.e. 100). The effect of on
multi-talk detection is investigated in Fig. 4 and shows that the TPR
changes rapidly between  and , where it then
saturates. Future work could investigate the impact of other values
of between  and .
5.2. Source counting evaluation
After the detection of multi-talk segments, the proposed method in
Table I is applied to determine the number of simultaneously active
sources by analysing and comparing the CDR values at all the nodes.
A more detailed source counting evaluation is presented in
Table III where the source counting Success Rate (SR) for 100 test
segments with one to three simultaneously active sources ( = to
) is calculated. The SR for source counting is defined as
 
 is the number of length segments with active
sources correctly labeled as having active sources.  is the
total number of length segments. It is worth noting that the number
of simultaneously active sources counted using (9) is limited to
by counting the top  of the CDR values. It is
suggested that a single threshold value is used to replace
 
 
 from (9) which can be
estimated based on the setup. It is concluded that the proposed
method is competitive with similar methods [4, 5] without requiring
statistical modelling of the speech and much longer time segments.
It is concluded that interfering talker(s) detection without counting
the number of simultaneously active sources (Fig. 4) is slightly more
successful than source counting (Table III) and the highest overlap
detection is achieved when = 15, which translates to 300 ms at
16 kHz sampling frequency.
The CDR estimation method of [15] is applied here for all the
experiments as it does not require the coherent signal direction of
arrival and it is shown that Direction of Arrival (DOA) based
methods do not yield successful source counting results (48.6%
accuracy) [5]. Compared with the previous work [9] the source
counting success rate for multi-talk time segments is increased up to
30% but the multi-talk detection success rate does not show any
improvement when CDR is applied. When RIRs are available,
200 ms (i.e. = ) length recordings of the RIRs can yield a
slightly higher multi-talk detection rate by using the clarity feature
as shown in [9].
This paper proposed a new indicator for cross talk detection in multi-
party meeting scenarios based on real-time estimated CDR cues. By
estimating CDR features over short time frames and averaging the
values over frames in a time segment and all the frequencies, it is
possible to detect interfering sources recorded by an ad-hoc array of
an unknown geometry. The proposed feature is also applied for
source counting during the multi-talk frames. It is shown that
increasing the time frame lengths can improve the multi-talk
detection and source counting results. The proposed method in this
paper is applicable to real time scenarios without offline training or
the knowledge of the sources DOAs and yields 80% correct cross
talk detection and average of 75% success in source counting which
is competitive to offline methods. Future work could focus on the
optimisation of the segment length and investigation of the effect of
the participant and node distances on multi-talk detection and source
  
  
  
  
  
  
  
Fig. 4. Interfering talker(s) detection as measured by the True
Positive Rate (TPR) versus number of frames, , used to obtain
average CDR values.
1 5 10 15 20 25 30
TPR (%)
Number of Frames (
Measured TPR
16 kHz
20 ms
512 (zero-padded)
15 cm
10 dB
400 ms
[1] S. H. Yella and H. Bourlard, Improved overlap speech
diarization of meeting recordings using long-term
conversational features, in Int. Conf. on Acoust., Speech and
Signal Process. (ICASSP), IEEE, 2013, pp. 7746-7750.
[2] S. Gergen, A. Nagathil and R. Martin, Audio signal
classification in reverberant environments based on fuzzy-
clustered ad-hoc microphone arrays, in Int. Conf. on Acoust.,
Speech and Signal Process. (ICASSP), IEEE, 2013, pp. 3692-
[3] K. Hayashida, M. Nakayama, T. Nishiura, Y. Yamashita, T.
Horiuchi and T. Kato, Close/distant talker discrimination
based on kurtosis of linear prediction residual signals, in Int.
Conf. on Acoust., Speech and Signal Process. (ICASSP), IEEE,
2014, 2327-2331.
[4] O. Walter, L. Drude and R. Haeb-Umbach, Source counting
in speech mixtures by nonparametric Bayesian estimation of an
infinite Gaussian mixture model, in Int. Conf. on Acoust.,
Speech and Signal Process. (ICASSP), IEEE, 2015, pp. 459-
[5] L. Drude, A. Chinaev, D. H. Tran Vu and R. Haeb-Umbach,
Towards online source counting in speech mixtures applying
a variational EM for complex Watson mixture models, in 14th
Int. Workshop on Acoustic Signal Enhancement (IWAENC),
IEEE, 2014, pp. 213-217.
[6] S. Otterson and M. Ostendorf, Efficient use of overlap
information in speaker diarization, in Workshop on Automat.
Speech Recognition and Understanding (ASRU), IEEE, 2007,
pp. 683-686.
[7] P. P. Parada, D. Sharma and P. A. Naylor, Non-intrusive
estimation of the level of reverberation in speech, in Int. Conf.
on Acoust., Speech and Signal Process. (ICASSP), IEEE, 2014,
pp. 4718-4722.
[8] M. Jeub, C. Nelke, C. Beaugeant and P. Vary, "Blind
estimation of the coherent-to-diffuse energy ratio from noisy
speech signals," 19th European Signal Processing Conference,
Barcelona, 2011, pp. 1347-1351.
[9] S. Pasha, C Ritz and Y. X. Zou, Detecting multiple,
simultaneous talkers through localising speech recorded by ad-
hoc microphone arrays, Asia-Pacific Signal and Inform.
Process. Assoc. Annual. Summit and Conf. (APSIPA ASC),
IEEE, 2016.
[10] D. Pavlidi, A. Griffin, M. Puigt and A. Mouchtaris, Source
counting in real-time sound source localization using a circular
microphone array, in 7th Sensor Array and Multichannel
Signal Process. Workshop (SAM), IEEE, 2012, pp. 521-524.
[11] V. M. Tavakoli, J. R. Jensen, M. G. Christenseny and J.
Benesty, Pseudo-coherence-based MVDR beamformer for
speech enhancement with ad hoc microphone arrays, in Int.
Conf. on Acoust., Speech and Signal Process. (ICASSP), IEEE,
2015, pp. 2659-2663.
[12] Baek and Y. c. Park, A priori SAP estimator based on the
magnitude square coherence for dual-channel microphone
system,in Int. Conf. on Acoust., Speech and Signal Process.
(ICASSP), IEEE, 2015, pp. 4415-4419.
[13] N. Yousefian and P. C. Loizou, A Dual-Microphone Speech
Enhancement Algorithm Based on the Coherence Function,
IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 2, pp.
599-609, 2012.
[14] S. Vesa, Binaural Sound Source Distance Learning in Rooms,
IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 8, pp.
1498-1507, 2009.
[15] A. Schwarz and W. Kellermann, “Coherent-to-Diffuse Power
Ratio Estimation for Dereverberation,” IEEE/ACM Trans.
Audio, Speech, Lang. Process., vol. 23, no. 6, pp. 1006-1018,
[16] M. A. Iqbal, J. Stokes, J. C. Platt, A. Surendran and S. L. Grant,
Doubletalk detection using real time recurrent learning, Proc.
International Workshop on Acoustic Echo and Noise Control
(IWAENC), 2006.
[17] M. Souden, K. Kinoshita and T. Nakatani, An integration of
source location cues for speech clustering in distributed
microphone arrays, in Int. Conf. on Acoust., Speech and
Signal Process. (ICASSP), IEEE, 2013, pp. 111-115.
[18] A. Bertrand and M. Moonen, Energy-based multi-speaker
voice activity detection with an ad hoc microphone array, in
Int. Conf. on Acoust., Speech and Signal Process. (ICASSP),
IEEE, 2010, pp. 85-88.
[19] Hu, Y. and Loizou, P., “Subjective evaluation and comparison
of speech enhancement algorithms,” Speech Commun., vol. 49,
pp. 588-601, 2007.
... The authors have previously applied the Coherent to Diffuse Ratio (CDR) feature [9] and Magnitude Squared Coherence (MSC) [10] for multi-talk detection in the context of ad-hoc array 'nodes'. Although accurate multi-talk detection results are obtained, the proposed method in [9] requires all the 'nodes' to be of the same structure and the accuracy of the results highly depends on an adaptively chosen threshold value for multi-talk detection (classification). ...
... The authors have previously applied the Coherent to Diffuse Ratio (CDR) feature [9] and Magnitude Squared Coherence (MSC) [10] for multi-talk detection in the context of ad-hoc array 'nodes'. Although accurate multi-talk detection results are obtained, the proposed method in [9] requires all the 'nodes' to be of the same structure and the accuracy of the results highly depends on an adaptively chosen threshold value for multi-talk detection (classification). In this contribution, one two-channel (dual) microphone array of an unknown interchannel distance, located arbitrarily within an acoustic scene is used to discriminate speakers and estimate the number of sources. ...
... Assuming a dual-channel node [9], with omni-directional microphones and an unknown inter-channel spacing, , is located within a reverberant and noisy environment, the recorded signal by such an array is modelled mathematically as * , , ...
Conference Paper
Full-text available
This paper proposes the use of the frequency domain Magnitude Squared Coherence (MSC) between two ad-hoc recordings of speech as a reliable speaker discrimination feature for source counting applications in highly reverberant environments. The proposed source counting method does not require knowledge of the microphone spacing and does not assume any relative distance between the sources and the microphones. Source counting is based on clustering the frequency domain MSC of the speech signals derived from short time segments. Experiments show that the frequency domain MSC is speaker-dependent and the method was successfully used to obtain highly accurate source counting results for up to six active speakers for varying levels of reverberation and microphone spacing.
... They concluded that by using short-term power measurements at different microphone locations, the multi-speaker VAD problem can be converted into a non-negative blind source separation (NBSS) problem. Other than power, coherent-todiffuse ratio (CDR) values (32) calculated or estimated at dual microphone node locations are also applied for source counting and multi-talk detection [45]. ...
... The main limitation of the method proposed by Pasha et al. [45] is that all the nodes must be of the same structure (i.e., the distances between the microphones at all nodes must be the same), which limits the method's applicability. The MSC is found using the cross-power spectral density (CPSD) as presented by Pasha et al. [87] (Figure 4): ...
Full-text available
Given ubiquitous digital devices with recording capability, distributed microphone arrays are emerging recording tools for hands-free communications and spontaneous tele-conferencings. However, the analysis of signals recorded with diverse sampling rates, time delays, and qualities by distributed microphone arrays is not straightforward and entails important considerations. The crucial challenges include the unknown/changeable geometry of distributed arrays, asynchronous recording, sampling rate mismatch, and gain inconsistency. Researchers have recently proposed solutions to these problems for applications such as source localization and dereverberation, though there is less literature on real-time practical issues. This article reviews recent research on distributed signal processing techniques and applications. New applications benefitting from the wide coverage of distributed microphones are reviewed and their limitations are discussed. This survey does not cover partially or fully connected wireless acoustic sensor networks.
... The authors related these statistics directly with the number of speakers via a polynomial function. In [Pas+17], a parametric method has been derived for speaker counting, which relies on coherent-todiffuse ratio estimation over several time frames. The maximum number of speakers J is then estimated by thresholding this computed parameter. ...
Full-text available
Sound source localization (SSL) is a subtask of audio scene analysis that has challenged researchers for more than four decades. Traditional methods (e.g., MUSIC or GCC-PHAT) impose strong assumptions on the sound propagation, number of active sources and/or signal content, which makes them vulnerable to adverse acoustic phenomena, such as reverberation and noise. Recently, data-driven models -- and particularly deep neural networks – have shown increased robustness in noisy and reverberant environments. However, their performance is still seriously degraded in the presence of multiple sound sources, especially when their number is unknown. Moreover, source detection and localization in real-life use-cases, where the latency is an important criterion, is still an open research problem. In this thesis, we focus on speaker detection and localisation in office/domestic indoor environments, using multichannel Ambisonics recordings, with the emphasis on low-latency performance. First, we propose to use deep neural networks (DNNs) to estimate the number of speakers (NoS) in a multichannel mixture. We propose a model that is capable to count up to five speakers, with a relatively high accuracy, at the short-term-frame resolution. We also provide a performance analysis of this model depending on several hyperparameters, which gives interesting insights on its behavior. Second, we explore the capabilities of a multichannel audio signal representation called time-domain velocity vector (TDVV), akin to relative impulse response in the present spherical harmonics domain, as a novel type of input features of DNNs for detection/localization tasks. Next, we address multi-speaker localization, by first improving upon a state-of-the-art convolutional recurrent neural network (CRNN) with a substantial gain in accuracy. We also examine the potential of self-attention-based neural networks for multi-speaker localization, as these models are known to be suitable for other audio processing tasks due to their capability to capture both short- and long-term dependencies in the input signal. Furthermore, we investigate the use of the estimated NoS, provided by our speaker counting neural network, to improve our speaker localization CRNN. We show experimentally that using the estimated NoS leads to more robust multi-speaker localization than the classical threshold-based direction of arrival (DoA) estimation. Moreover, we show the interest of injecting the NoS information as an additional input feature for the localization neural network. Finally, we explore multi-task neural architectures to estimate both the NoS and speaker DoAs at the same time.
... Success Rate [25] was applied for speaker counting performance in this paper. Assuming that Nc(k) is the number of scenarios that the estimated speaker countĈ equals the true speaker count C and Nt(k) is the total number per class k. ...
... signal it does not vary with the source energy level and is robust against the inconsistency between the sources energy levels [142]. It is observed that the CDR estimate drops with the interference and source to microphone distance. ...
Ad-hoc microphone arrays formed from the microphones of mobile devices such as smart phones, tablets and notebooks are emerging recording platforms for meetings, press conferences and other sound scenes. As opposed to the Wireless Acoustic Sensor Networks (WASN), ad-hoc microphones do not communicate within the array and location of each microphone is unknown. Analysing speech signals and the acoustic scene in the context of ad-hoc microphones is the goal of this thesis. Despite conventional known geometry microphone arrays (e.g. a Uniform Linear array), ad-hoc arrays do not have fixed geometries and structures and therefore standard speech processing techniques such as beamforming and dereverbearion techniques cannot be directly applied to these. The main reasons for this include unknown distances between microphones an hence unknown relative time delays and the changeable array topology. This thesis focuses on utilising the side information obtained by the acoustic scene analysis to improve the speech enhancement by ad-hoc microphone arrays randomly distributed within a reverberant environment. New discriminative features are proposed, applied and tested for various signal and audio processing applications such as microphone clustering, source localisation, multi-channel dereverberation, source counting and multi-talk detection. The main contributions of this thesis fall into two categories: 1) Novel spatial features extracted from Room Impulse Responses (RIRs) and speech signals 2) Speech enhancement and acoustic scene analysis methods specifically designed for the ad-hoc arrays. Microphone clustering, source localisation, speech enhancement, source counting and multi-talk detection in the context of ad-hoc arrays are investigated in this thesis and novel methods are proposed and tested. A clustered speech enhancement and dereverberation method tailored for the ad-hoc microphones is proposed and it is concluded that exclusively using a cluster of microphones located closer to the source, improves the dereverberation performance. Also proposed is a multi-channel speech dereverberation method based on a novel spatial multi-channel linear prediction analysis approach for the ad-hoc microphones. The spatially modified multi-channel linear prediction approach takes into account the estimated relative ii distances between the source and the microphones and improves the dereverberation performance. The coherence based features are applied for multi-talk detection and source counting in highly reverberant environments and it is shown that the proposed features are reliable source counting features in the context of ad-hoc microphones. Highly accurate offline source counting and pseudo real-time multi-talk detection results are achieved by the proposed methods.
Conference Paper
In an era of ubiquitous digital devices with built-in microphones and recording capability, distributed microphone arrays of a few digital recording devices are the emerging recording tool in hands-free speech communications and immersive meetings. Such so-called ad hoc microphone arrays can facilitate high-quality spontaneous recording experiences for a wide range of applications and scenarios, though critical challenges have limited their applications. These challenges include unknown and changeable positions of the recording devices and sound sources, resulting in varying time delays of arrival between microphones in the ad hoc array as well as varying recorded sound power levels. This paper reviews state-of-the-art techniques to overcome these issues and provides insight into possible ways to make existing methods more effective and flexible. The focus of this paper is on scenarios in which the microphones are arbitrarily located in an acoustic scene and do not communicate directly or through a fusion centre.
Conference Paper
Full-text available
This paper proposes a novel approach to detecting multiple, simultaneous talkers in multi-party meetings using localisation of active speech sources recorded with an ad-hoc microphone array. Cues indicating the relative distance between sources and microphones are derived from speech signals and room impulse responses recorded by each of the microphones distributed at unknown locations within a room. Multiple active sources are localised by analysing a surface formed from these cues and derived at different locations within the room. The number of localised active sources per each frame or utterance is then counted to estimate when multiple sources are active. The proposed approach does not require prior information about the number and locations of sources or microphones. Synchronisation between microphones is also not required. A meeting scenario with competing speakers is simulated and results show that simultaneously active sources can be detected with an average accuracy of 75% and the number of active sources counted accurately 65% of the time.
Full-text available
The estimation of the time- and frequency-dependent coherent-to-diffuse power ratio (CDR) from the measured spatial coherence between two omnidirectional microphones is investigated. Known CDR estimators are formulated in a common framework, illustrated using a geometric interpretation in the complex plane, and investigated with respect to bias and robustness towards model errors. Several novel unbiased CDR estimators are proposed, and it is shown that knowledge of either the direction of arrival (DOA) of the target source or the coherence of the noise field is sufficient for unbiased CDR estimation. The validity of the model for the application of CDR estimates to dereverberation is investigated using measured and simulated impulse responses. A CDR-based dereverberation system is presented and evaluated using signal-based quality measures as well as automatic speech recognition accuracy. The results show that the proposed unbiased estimators have a practical advantage over existing estimators, and that the proposed DOA-independent estimator can be used for effective blind dereverberation.
Conference Paper
Full-text available
Overlapping speech is a source of significant errors in speaker diarization of spontaneous meeting recordings. Recent works on speaker diarization have attempted to solve the problem of overlap detection using classifiers trained on acoustic and spatial features. This paper proposes a method to improve the short-term spectral feature based overlap detector by incorporating information from long-term conversational features in the form of speaker change statistics. The statistics are obtained at segment level(around few seconds) from the output of a diarization system. The approach is motivated by the observation that segments containing more speaker changes are more probable to have more overlaps. Experiments on AMI meeting corpus reveal that the number of overlaps in a segment follows a Poisson distribution whose rate is directly proportional to the number of speaker changes in the segment. When this information is combined with acoustic information in an HMM/GMM overlap detector, improvements are verified in terms of F-measure and consequently, diarization error (DER) is reduced by 5% relative to the baseline overlap detector.
Conference Paper
This contribution describes a step-wise source counting algorithm to determine the number of speakers in an offline sce-nario. Each speaker is identified by a variational expectation maximization (VEM) algorithm for complex Watson mixture models and therefore directly yields beamforming vectors for a subsequent speech separation process. An observation selection criterion is proposed which improves the robustness of the source counting in noise. The algorithm is compared to an alternative VEM approach with Gaussian mixture models based on directions of arrival and shown to deliver improved source counting accuracy. The article concludes by extending the offline algorithm towards a low-latency online estimation of the number of active sources from the streaming input data.
Conference Paper
Speech enhancement with distributed arrays has been met with various methods. On the one hand, data independent methods require information about the position of sensors, so they are not suitable for dynamic geometries. On the other hand, Wiener-based methods cannot assure a distortionless output. This paper proposes minimum variance distortionless response filtering based on multichannel pseudo-coherence for speech enhancement with ad hoc microphone arrays. This method requires neither position information nor control of the trade-off used in the distortion weighted methods. Furthermore, certain performance criteria are derived in terms of the pseudo-coherence vector, and the method is compared with the multichannel Wiener filter. Evaluation shows the suitability of the proposed method in terms of noise reduction with minimum distortion in ad hoc scenarios.
Conference Paper
We show corroborating evidence that, among a set of common acoustic parameters, the clarity index C50 provides a measure of reverberation that is well correlated with speech recognition accuracy. We also present a data driven method for non-intrusive C50 parameter estimation from a single channel speech signal. The method extracts a number of features from the speech signal and uses a binary regression tree, trained on appropriate training data, to estimate the C50. Evaluation is carried out using speech utterances convolved with real and simulated room impulse responses, and additive babble noise. The new method outperforms a baseline approach in our evaluation.
Conference Paper
Desired/undesired speech discrimination is as important as speech/non-speech discrimination to achieve useful applications such as speech interfaces and teleconferencing systems. Conventional methods of voice activity detection (VAD) utilize the directional information of sound sources to distinguish desired from undesired speech. However, these methods have to utilize multiple microphones to estimate the directions of sound sources. Here, we propose a new method to discriminate desired from undesired speech with a single microphone. We assumed that the desired talkers would be close to the microphone, and the proposed method could distinguish close/distant-talking speech from observed signals based on the kurtosis of the linear prediction (LP) residual signals. The experimental results revealed that the proposed method could distinguish close-talking speech from distant-talking speech within a 10% equal error rate (EER) in ordinary reverberant environments with less processing time.
Conference Paper
Audio signal classification suffers from the mismatch of environmental conditions when training data is based on clean and anechoic signals and test data is distorted by reverberation and signals from other sources. In this contribution we analyze the classification performance for such a scenario with two concurrently active sources in a simulated reverberant environment. To obtain robust classification results, we exploit the spatial distribution of ad-hoc microphone arrays to capture the signals and extract cepstral features. Based on these features only, we use unsupervised fuzzy clustering to estimate clusters of microphones which are dominated by one of the sources. Finally, signal classification based on clean and anechoic training data is performed for each of the cluster. The probability of cluster membership for each microphone is provided by the fuzzy clustering algorithm and is used to compute a weighted average of the feature vectors. It is shown that the proposed method exceeds the performance of classification based on single microphones.