PreprintPDF Available

Enhancing Direct-Path Relative Transfer Function Using Deep Neural Network for Robust Sound Source Localization

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

This paper proposes a deep neural network (DNN) based direct-path relative transfer function (DP-RTF) enhancement method for robust direction of arrival (DOA) estimation in noisy and reverberant environments. The DP-RTF refers to the ratio between the direct-path acoustic transfer functions of the two microphone channels. First, the complex-value DP-RTF is decomposed into the inter-channel intensity difference, sinusoidal functions of the inter-channel phase difference in the time-frequency domain. Then, the decomposed DP-RTF features from a series of temporal context frames are utilized to train a DNN model, which maps the DP-RTF features contaminated by noise and reverberation to the clean ones, and meanwhile provides a time-frequency (TF) weight to indicate the reliability of the mapping. The DP-RTF enhancement network can help to enhance the DP-RTF against noise and reverberation. Finally, the DOA of a sound source can be estimated by integrating the weighted matching between the enhanced DP-RTF features and the DP-RTF templates. Experimental results on simulated data show the superiority of the proposed DP-RTF enhancement network for estimating DOA of sound source in the environments with various levels of noise and reverberation.
Received: 18 November 2020
-
Accepted: 21 December 2020
-
CAAI Transactions on Intelligence Technology
DOI: 10.1049/cit2.12024
ORIGINAL RESEARCH PAPER
Enhancing directpath relative transfer function using deep
neural network for robust sound source localization
Bing Yang
1,2
|Runwei Ding
1
|Yutong Ban
3,4
|Xiaofei Li
2
|Hong Liu
1
1
Key Laboratory of Machine Perception, Shenzhen
Graduate School, Peking University, Beijing, China
2
Westlake University & Westlake Institute for
Advanced Study, Hangzhou, China
3
SAIIL, Massachusetts General Hospital, Boston,
Massachusetts, USA
4
CSAIL, Massachusetts Institute of Technology
(MIT), Cambridge, Massachusetts, USA
Correspondence
Runwei Ding, Shenzhen Graduate School, Peking
University, China.
Email: dingrunwei@pku.edu.cn
Funding information
National Natural Science Foundation of China,
Grant/Award Numbers: No.61673030, U1613209;
National Natural Science Foundation of Shenzhen,
Grant/Award Number: No.
JCYJ20200109140410340
Abstract
This article proposes a deep neural network (DNN)based directpath relative transfer
function (DPRTF) enhancement method for robust direction of arrival (DOA) estimation
in noisy and reverberant environments. The DPRTF refers to the ratio between the direct
path acoustic transfer functions of the two microphone channels. First, the complexvalue
DPRTF is decomposed into the interchannel intensity difference, and sinusoidal func-
tions of the interchannel phase difference in the timefrequency domain. Then, the
decomposed DPRTF features from a series of temporal context frames are utilized to train
a DNN model, which maps the DPRTF features contaminated by noise and reverberation
to the clean ones, and meanwhile provides a timefrequency (TF) weight to indicate the
reliability of the mapping. The DPRTF enhancement network can help to enhance the
DPRTF against noise and reverberation. Finally, the DOA of a sound source can be
estimated by integrating the weighted matching between the enhanced DPRTF features
and the DPRTF templates. Experimental results on simulated data show the superiority of
the proposed DPRTF enhancement network for estimating the DOA of the sound source
in the environments with various levels of noise and reverberation.
1
|
INTRODUCTION
Sound source localization has a wide range of applications such
as teleconferencing, robot audition, hearing aids, and so forth.
With the development of deep learning techniques, lots of
datadriven sound source localization works are built in a su-
pervised manner [1]. According to the role of the deep learning
model plays, these methods are classied into four categories,
namely signaltolocation [2], featuretolocation [3,4], spatial
spectrumtolocation [5], and featuretofeature [6,7]based
methods. Among these methods, the featuretofeaturebased
method is simple and effective for improving the performance
of sound source localization in noisy and reverberant envi-
ronments, as it is the data driven and the extracted features can
adapt to various acoustic conditions.
The spatial features utilized for localization include the time
and the intensity differences between dualmicrophone signals.
Interchannel time difference (ITD) is commonly estimated by
searching the maximum of the generalized crosscorrelation
(GCC) function [8]. Interchannel phase difference (IPD) is
another time difference feature and owns an approximate linear
property with respect to frequency [9]. Moreover, interchannel
intensity difference (IID) is computed as the energy ratio of the
signals captured by two microphones. Relative transfer function
(RTF) [10,11] encodes time and intensity information in its
argument and magnitude respectively, which is the ratio between
the acoustic transfer functions of the two channels. Other
highlevel localization features include the crosscorrelation
function (CCF) [3], the eigen vectors of spatial correlation matrix
associated with signal subspace [12], and so forth. Overall, the
sound source can be easily localized with the aforementioned
localization features under a noisefree and anechoic condition.
However, in practical acoustic scenes, noise and reverberation
often contaminate the directpath propagated source signal and
degrade the accuracy of localization feature estimation, which
furtherleadstoa signicantdroponthe localization performance.
Many methods aim to remove the effect of acoustic in-
terferences on the directpath localization feature extraction.
This is an open access article under the terms of the Creative Commons AttributionNonCommercialNoDerivs License, which permits use and distribution in any medium, provided the
original work is properly cited, the use is noncommercial and no modications or adaptations are made.
© 2021 The Authors. CAAI Transactions on Intelligence Technology published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology and Chongqing
University of Technology.
CAAI Trans. Intell. Technol. 2021;19. wileyonlinelibrary.com/journal/cit2
-
1
Some research works give high attention values to the direct
sound dominant timefrequency (TF) regions using a TF
weighting scheme, which can be classied into unsupervised
methods [13–16] and supervised methods [17,18]. However,
these methods do not rene the value of localization features.
Though the IPD enhancement method [6] has been used to
netune the localization features using deep neural network
(DNN), it only considers the time difference information of
current time frame, but intensity difference and temporal
context information are also important for localization. We aim
to investigate how to make full use of the time and intensity
difference information of both the historical and current time
frames, in order to recover the clean localization features from
the existing contaminated ones, so that the sound source can
be robustly localized.
This article designs a directpath relative transfer function
(DPRTF) enhancement network to preserve the time and
intensity difference information of directpath signal and
suppress the contamination of noise and reverberation. The
complex DPRTF is decomposed as the IID and the sinu-
soidal functions of the IPD, in order to t the realvalue
DNN framework and to explicitly present the localization
cues. Then the DPRTF is enhanced using a DNN which non
linearly maps the contaminated DPRTF feature of multiple
temporal context frames to one singleframe clean feature.
The DPRTF enhancement network can jointly predict the
clean DPRTF and the TF reliability weight, by adopting a
weighted mean square error (MSE) with a weightmaximization
regularization term. The trained DPRTF enhancement
network can signicantly depress the effect of noise and
reverberation on the DPRTF estimation. For each TF bin,
the enhanced DPRTF are matched with the template of
candidate directions. The direction of arrival (DOA) of the
sound source is determined by integrating the matching
functions united with the predicted TF weight from multiple
TF bins. Experiments using simulated data demonstrate the
effectiveness of our method under noisy and reverberate
acoustic conditions. The main contributions of this paper are
summarized as follows:
(1)We design a DPRTFenhancement network to recover the
DPRTF features from the contaminated localization features for
robust sound source localization. Different from the method in
[6] which recovers the framewise time difference information
using current time frame, the proposed model considers the
shortterm temporal context information, and jointly recovers
both time and intensity difference information.
(2) We use a weighted MSE with a weightmaximization
regularization term to guarantee that the TFwise DPRTF
features are selectively recovered. Using the predicted TF
weight, the enhancement network is more concentrated on the
features with high weights, therefore highweights’ features will
dominate the DOA estimation.
The rest of this paper is organized as follows. Section 2
introduces the DPRTF based DOA estimation method.
Section 3details the proposed DNN based DPRTF
enhancement network. Experiments and discussions with
simulated data are presented in Section 4, and conclusions
are drawn in Section 5.
2
|
DPRTF BASED DOA ESTIMATION
In an enclosed environment with additive ambient noise, a
single source is observed by a pair of microphones. The signal
received by the mth microphone is formulated as
xmðtÞ ¼ hmðtÞsðtÞ þ vmðtÞ;ð1Þ
where m[1,2] is the microphone index, s(t) denotes the
source signal, v
m
(t) denotes the received noise signal at the
mth microphone, and h
m
(t) is the acoustic impulse response
(AIR) from the source to the mth microphone. Here, de-
notes the convolution operation. Applying the shorttime
Fourier transform (STFT) to Equation (1), the signal at the
mth microphone can be rewritten in the TF domain as
Xmðn;fÞ ¼ Hmðf;θÞSðn;fÞ þ Vmðn;fÞ;ð2Þ
where n[1, N] denotes the index of time frame, f[1, F]
denotes the index of frequency, and θdenotes the horizontal
DOA of source. X
m
(n,f), S(n,f) and V
m
(n,f) represent the
STFT coefcients of x
m
(t), s(t), and v
m
(t), respectively. The
acoustic transfer function (ATF) H
m
(f,θ) is the Fourier
transform of h
m
(t). The ATF contains the direct and reected
propagation paths of sound source, that is,
Hmðf;θÞ ¼ Hd
mðf;θÞ þ Hr
mðf;θÞ;ð3Þ
where Hd
mðf;θÞand Hr
mðf;θÞdenote the ATFs of directpath
and reected propagations, respectively. The DPRTF [19] is
dened as the ratio between the two directpath ATFs, namely
Rdðf;θÞ ¼ Hd
2ðf;θÞ
Hd
1ðf;θÞ:ð4Þ
Under the anechoic and noisefree condition, Hr
mðf;θÞ
and V
m
(n,f) are equal to zero. According to Equations (2) and
(3), the microphone signal is simplied as
Xmðn;fÞ ¼ Hd
mðf;θÞSðn;fÞ:ð5Þ
Using this simplication, the DPRTF can be estimated by
b
Rdðn;fÞ ¼ X2ðn;fÞ
X1ðn;fÞ:ð6Þ
The estimated DPRTF is decomposed and rewritten in a
vector form, namely
2
-
YANG ET AL.
brðn;fÞ
¼
20 log10 b
Rdðn;fÞ
ΔImax
;sinb
Rdðn;fÞ;cosb
Rdðn;fÞ
2
6
43
7
5
T
:
ð7Þ
The DOA of the sound source is estimated by matching
the estimated DPRTF vectors from reliable TF bins with the
template, namely
b
θ¼argmin
θSX
N
n¼1X
F
f¼1b
wðn;fÞbrðn;fÞrðf;θÞð Þk k2;ð8Þ
where ‖⋅‖ denotes the Euclidean norm, Sis the set of
candidate directions, and b
wðn;fÞdenotes the TF weight that
indicates the reliability of brðn;fÞ. Here, the groundtruth DP
RTF vector r(f,θ) is dened in Equation (13) (see Section 3.1),
and r(f,θ) of all candidate directions are used as the matching
template.
3
|
DPRTF ENHANCEMENT
NETWORK
The abovementioned framework provides a solution to the
single sound source localization on the anechoic and noisefree
assumption (see Figure 1a). However, noise and reverberation
are inevitable in realworld scenarios. The DPRTF estimation
using Equation (6) is a biased approximation under noisy and
reverberant conditions, which can introduce an obvious devi-
ation of DOA estimation. Hence, we add a DPRTF
enhancement network to the abovementioned framework, in
order to recover the clean DPRTF feature from the
contaminated ones to suit the reverberant and noisy environ-
ments (see Figure 1b).
3.1
|
Input and target
As the complex DPRTF estimates cannot be directly pro-
cessed by the realvalue DNN, to t the realvalue DNN, the
input and target complex DPRTF values are transformed into
a realvalue form.
The theoretical directpath ATF, namely the headrelated
transfer function (HRTF) in the binaural localization case, is
formulated as
Hd
mðf;θÞ ¼ αmðf;θÞejωfτmðθÞ;ð9Þ
where ω
f
is the angular frequency of the fth frequency. α
m
(f)
and τ
m(θ)
denote the propagation attenuation factor and the
time of arrival from the source to the mth microphone,
respectively. Substituting Equation (9) into Equation (4), the
DPRTF is rewritten as
Rdðf;θÞ ¼ α2ðf;θÞ
α1ðf;θÞejωfτ2ðθÞτ1ðθÞð Þ:ð10Þ
It can be seen that the DPRTF encodes IID and IPD infor-
mation in its magnitude and argument respectively. We extract
these two localization cues using a disjoint decomposition,
formally written as
ΔIðf;θÞ ¼ 20 log10 Rdðf;θÞ
;ð11Þ
ΔPðf;θÞ ¼ Rdðf;θÞ;ð12Þ
where is the phase operator of complex numbers. With these
two realvalue localization cues, the complex DPRTF value
can be recovered.
However, IPD is presented in the range of [π,π], and
may be periodically wrapped with the increasing of frequency
or time difference. Hence, the mean squared IPD error cannot
be directly used to reect the DOA difference, and the DOA
estimation using such IPD will fail as well due to the phase
wrapping ambiguity. To avoid this ambiguity, the sinusoidal
(a)
(b)
FIGURE 1 Pipeline for the DPRTF based DOA estimation method. (a) Without DPRTF enhancement network. (b) With DPRTF enhancement network
YANG ET AL.
-
3
functions of IPD is used instead. Accordingly, the complex
DPRTF is decomposed into IID, the sine and cosine func-
tions of IPD. The three decomposed parts are concatenated to
form the DPRTF vector ground truth associated with the
direction θ, that is,
rðf;θÞ ¼ ΔIðf;θÞ
ΔImax
;sin ΔPðf;θÞ;cos ΔPðf;θÞ
 T
;ð13Þ
where ()
T
denotes vector transpose, and ΔI
max
is an empiri-
cally set maximum value of IID used for normalization. As the
original IID has a relatively wider range than the sinusoidal
functions, the IID is scaled to [1,1] to balance the contri-
bution of the IID and IPD information. The dimension of r
(f,θ) is 3 1, and each element is in the range from 1 to 1.
By using this transformation, we didn't lose any information of
the sound source location. Therefore, the DPRTF feature can
be recovered according to the IID and IPD information
contained in RTF vector.
Accordingly, for frame n, the input vector is a concatena-
tion of the contaminated DPRTF vectors from a series of
(context) time frames and frequency bands, namely
CIðnÞ¼ ½brðnCþ1;1ÞT;;brðnCþ1;FÞT;
brðn;1ÞT;;brðn;FÞTiT;ð14Þ
and the learning target vector is a concatenation of the clean
DPRTF vectors from multiple frequency bands, namely
CRTF
TðnÞ ¼ ½ rð1;θÞT;;rðF;θÞTT;ð15Þ
where Cdenotes the number of context time frames and F
refers to the number of utilized frequencies. The dimensions of
C
I
(n) and CRTF
TðnÞare 3CF 1 and 3F1, respectively. By
concatenating the DPRTF feature vectors of CF TF bins into
a long input feature vector, the acoustic context information
along the time and frequency axes can be captured. Full bands
are taken into account due to the mutual dependencies of
localization cues on different frequencies.
3.2
|
Network architecture
Considering the complex noise and reverberation generating and
mixing process, the mapping from contaminated DPRTFs to
clean DPRTFs is indeed a complicated nonlinear operation.
Since neural networks own many levels of nonlinearity, we
design a DNN model to approximate this highly nonlinear
relationship. The architecture of the designed DNN model for
DPRTF enhancement is illustrated in Figure 2. The DNN
model employs Cframe contaminated DPRTF vector to pre-
dict oneframe clean DPRTF vector and TF reliability weight.
The input vector C
I
(n) is rst fed into three fully connected (FC)
layers to obtain the latent features. Each of the FC layers is with
2048 units and activated by a rectied linear unit (ReLU). Then
the latent features are pass to two FC layers, respectively.
Accordingly, the output contains two parts, one for DPRTF
vector and the other for TF weight, which are activated by a tan h
unit and a sigmoid unit, respectively. The output for DPRTF
vector is with a dimension of 3F1, namely
CRTF
OðnÞ ¼ ½ erðn;1ÞT;;erðn;FÞTT:ð16Þ
The output for TF weight is with a dimension of F1, namely
CW
OðnÞ ¼ ½ e
wðn;1Þ;;e
wðn;FÞ T;ð17Þ
where erðn;fÞand e
wðn;fÞare DPRTF vector and TF weight
predicted by DNN, respectively.
For network training, one commonly used loss function is
the MSE between the output and the target. This loss function
treats all frequencies equivalently, which is not suitable for the
present DPRTF enhancement problem. The TF bins where
source signal is silent or greatly contaminated by noise or
reverberation provide unreliable localization cues, and hence
should be disregarded in an indirect manner. To tackle this
problem, we add an adaptive weighting scheme to the loss
function, and thus the enhanced DPRTF vector and the TF
reliability weight can be jointly learned. The loss function is
then dened as
LðnÞ¼ 1
3FX
F
f¼1e
wðn;fÞerðn;fÞrðf;θÞð Þk k2þ
λ1
FX
F
f¼1
1e
wðn;fÞj j;
ð18Þ
where r(f,θ) is the groundtruth DPRTF vector which is free
of the affection of acoustic interferences, and hence it is uti-
lized as the training target of erðf;θÞ. Here, λdenotes the
regularization factor. The regularization part is set to avoid
trivial solutions, that is zero weights, and to guarantee a
sufcient number of TF bins to be employed for training.
Using this weighted MSE, the DPRTF learning disregards
those TF bins that cannot provide accurate DPRTF estimates.
In the test stage, the trained DNN predicts a singleframe
fullband estimate of erðn;fÞand e
wðn;fÞ. The DOA of the
sound source is estimated using the enhanced DPRTFs and
TF weights of Ntime frames (and Ffrequency bands for each
frame), according to Equation (8).
4
|
EXPERIMENTS AND DISCUSSIONS
In this section, the performance of the proposed method is
measured. We rst give the details of the experimental setup,
and then show the experimental results and discussions.
4
-
YANG ET AL.
4.1
|
Experimental setup
4.1.1
|
Simulated data
We simulate different room congurations using the image
method [20] which is implemented by the Roomsim toolbox
[21]. Four acoustic congurations are generated for training and
three for test, as shown in Table 1. All experiments are carried
out with binaural microphones whose shadow effect is signi-
cant. As illustrated in Figure 3, the speech sound source is
located in same horizontal plane as the binaural microphones,
and the candidate source directions ranges from 90° to 90°
with an interval of 5°. The binaural room impulse response
(BRIR) is generated using the Roomsim toolbox [21] and the
headrelated impulse response of the KEMAR dummy head
[22]. Speech recordings from TIMIT dataset
1
are truncated and
form speech segments with 0.5 s duration. These segments are
used as source signals, which are divided into training, validation
and test sets, respectively. White, Babble and Factory noise les
from the NOISEX92 database [24] are employed as noise
signals. Each type of noise signal segments is divided into
training, validation and test sets, respectively. Diffuse noise eld
[23] is generated using these noise segments. The sensor signals
are created by rst convolving the source signals with the
BRIRs, and then adding the scaled diffuse noise to the convo-
luted signals according to a given signaltonoise ratio (SNR).
4.1.2
|
Parameter settings and evaluation metrics
The sampling rate of the binaural signals used for localization is
16 kHz. The binaural signals are enframed by a window of 32 ms
with a frame shift of 16 ms. The frequency ranges from 0 to
4 kHz is used for localization (F=128). The maximum value of
IID ΔI
max
is set to 20. The DNN model is trained using the
Adam optimizer. The learning rate is set to 0.001. The accuracy
of DOA estimation is accessed by the mean absolute error
(MAE) and the localization accuracy. The MAE is dened as the
average error between the estimated and the groundtruth
DOAs over different test instances. The localization accuracy
considers a prediction to be correct if the difference between the
DOA estimate and the true DOA is less than or equal to 5°.
4.2
|
Experimental result
4.2.1
|
Inuence of the IID information
To investigate the inuence of exploiting IID, we compare the
MAE of using only the enhanced IPD and using both
the enhanced IID and IPD under different sizes of rooms. The
experiments are carried in rooms with different levels of
reverberation and noise. The RT
60
is 0.2, 0.4, 0.6, 0.8 s with
SNR being set to 5 dB. The SNR is 5, 0, 10, 15 dB with RT
60
being set to 0.6 s. The experiment results present in Table 2are
an average of these acoustic conditions. It can be seen that the
IPD +IID method performs better than the IPD method in
Room 5 and Room 6 while slightly worse in Room 7. Overall,
with the IID information, the MAE of DOA estimation can be
reduced, which demonstrates the effectiveness of incorpo-
rating the IID information in the present framework for
binaural sound source localization.
4.2.2
|
Inuence of the temporal context
information
We set the number of context time frames Cto different values
and give the MAE of DOA estimation under different sizes of
rooms in Table 3. The experiment data is the same as that used
in Section 4.2.1. It can be seen that the MAE of DOA esti-
mation decreases when Cvaries from 1 to 7 and increases
when Cis larger than 7. Hence, Cis set to 7 in the following
experiments, as it is found to be the optimal value under
different acoustic conditions.
FIGURE 2 Architecture for the DPRTF enhancement network. The input is the DPRTF vector contaminated by noise and reverberation, and the output
is the enhanced DPRTF vector and the TF weight
1
https://catalog.ldc.upenn.edu/ldc93s1
YANG ET AL.
-
5
4.2.3
|
Inuence of the TF weighting scheme
The DOA estimation results without and with the TF
weighting scheme are present in Table 4. The experiment data
is the same as that used in Section 4.2.1. For the DOA esti-
mation without TF weighting scheme, the TF weight b
wðn;fÞ
is set to one. For the loss with weighting scheme, the regula-
rization factor λis set to 0.01 and 0.5, respectively. The result
show that compared with the DOA estimation without the TF
weighting scheme, using this weight achieves a smaller MAE,
which veries the effectiveness of the weighting scheme. Be-
sides, with a larger regularization factor, the MAE of DOA
estimation become higher. This is because that a stronger
punishment of regularization term can result in that the TF
weight is closer to one, which will achieve a similar perfor-
mance to that without TF weighting scheme. In the following
experiments, λis set to 0.01.
4.2.4
|
Robustness evaluation
To illustrate the effectiveness of the proposed DPRTF
enhancement network, we estimate the DPRTF without and
with the enhancement network, respectively, and plot the phase
and amplitude of the DPRTF estimate as a function of fre-
quency bins in Figure 4. Each presented phase or amplitude
corresponds one DPRTF estimate in a certain TF bin. We use
31frame DPRTF estimates for evaluation. For the method
without the enhancement network, the TF weight is set to one.
It can be observed that the phase and amplitude of DPRTF
estimated without the enhancement network is scattered, while
that provided with the enhancement network is clustered
around the groundtruth lines. The proposed enhancement
network provides more accurate DPRTF estimates, which
show the ability to preserve the directpath localization cues
and meanwhile reduce the effect of noise and reverberation.
The proposed method is compared with other three
methods, IPDNN [6], RTF and RTFCT [13]. The IPDNN
method uses a fourlayer DNN to map the contaminated IPD
features to the corresponding clean ones. The RTF method
FIGURE 3 Illustration for the candidate directions of sound sources
TABLE 2Inuence of the IID information under different rooms
(C=1)
Method
MAE (degrees)
Room 5 Room 6 Room 7 AVG.
IPD [6] 14.28 18.47 20.23 17.66
DPRTF(IPD +IID) 13.93 17.97 20.64 17.51
TABLE 3Inuence of the temporal context information under
different rooms
Temporal Context
MAE (degrees)
Room 5 Room 6 Room 7 AVG.
C=1 13.93 17.97 20.64 17.51
C=3 9.18 11.99 13.45 11.54
C=5 8.43 10.78 11.87 10.36
C=7 7.35 9.40 10.38 9.04
C=9 7.66 9.95 11.16 9.59
C=11 7.64 9.88 11.20 9.57
C=13 7.86 10.02 10.64 9.51
C=15 8.42 10.64 11.99 10.35
TABLE 4Inuence of the TF weighting scheme under different
rooms
Method
MAE (degrees)
Room 5 Room 6 Room 7 AVG.
w/o weighting 7.35 9.40 10.38 9.04
w / weighting (λ=0.01) 7.05 8.98 9.99 8.67
w/ weighting (λ=0.5) 7.04 9.01 10.2 8.75
TABLE 1Room conguration for training and test data
Dataset Training Test
Room Label 1 2 3 4 5 6 7
Room size (m
3
) 7.0 8.0 5.0 6.0 6.0 3.5 4.0 5.5 3.0 3.8 3.0 2.5 6.0 8.0 3.8 5.0 7.0 3.0 4.0 4.0 2.7
Array centre (m) (3.00, 3.50, 1.70) (3.50, 3.00, 1.65) (2.50, 2.50, 1.40) (1.20, 1.45, 1.55) (2.00, 4.00, 1.65) (2.50, 3.00, 1.50) (1.80, 1.70, 1.60)
Distance (m) 1.50: 0.50: 3.00, 3.40 1.75, 2.25 0.50, 1.00 0.75, 1.25 0.60: 0.90: 3.30 0.70, 1.40, 2.10 0.80, 1.30
RT
60
(s) 0: 0.17: 0.85 0: 0.22: 0.88 0: 0.25: 0.75 0: 0.3: 0.9 0.2: 0.2: 0.8 0.2: 0.2: 0.8 0.2: 0.2: 0.8
SNR (dB) 5: 5: 20 5: 5: 20 5: 5: 20 5: 5: 20 5: 5: 20 5: 5: 20 5: 5: 20
6
-
YANG ET AL.
means directly using the contaminated DPRTF features for
DOA estimation (namely the pipeline in Figure 1a). It uses a
TF weight equalling to one. The RTFCT method also follows
the pipeline in Figure 1a but sets the TF weight by coherence
test which is used to select direct path dominated TF bins. For
fair comparison, all the comparison methods estimate the
DOA by nding the optimal matching between the enhanced
feature and the template features of all candidate directions,
following the principle of the proposed method. The com-
parison between these methods is carried out in the environ-
ments with different levels of noise and reverberation. Tables 5
and 6respectively show the MAEs of the four methods under
Room 6 with various SNR and RT
60
conditions. Each test
signal segment used for DOA estimation is with a duration of
0.5 s. It can be seen that the proposed method and IPDNN
outperform RTF and RTFCT in all cases, which demonstrates
the superiority of DNN based methods for enhancing the
localization feature. The proposed method performs better
than the IPDNN method. This is due to that the proposed
method incorporates the IID and temporal context informa-
tion to feature estimation, which is helpful for improving the
robustness of DOA estimation. Compared with the RTF
method, both RTFCT and our method add the DPRTF
enhancement process, but our method achieves a lower MAE
than the RTFCT. It can conclude that employing all data to
enhance the DPRTF is more benecial than only employing
the data selected by the coherence test. Besides, the RTFCT
method applies a hard selection on TF bins, while the pro-
posed method applies a better TF weighting scheme, that is a
soft weight.
FIGURE 4 phase and amplitude of the DP
RTF as a function of frequency bins under a typical
acoustic condition that RT
60
=600 ms and
SNR =5 dB (babble noise) in Room 6. The DP
RTF is estimated without DPRTF enhancement
network in (a) (c), and with DPRTF enhancement
network in (b) (d). The sound source is located at 0°
in (a) (b), and 30° in (c) (d). The distance from the
sound source to the centre of the microphone array
is 2.1 m
YANG ET AL.
-
7
5
|
CONCLUSION
This article proposes a DPRTF enhancement network for
sound source localization under adverse acoustic conditions.
Considering the complex nonlinear process of noise and
reverberation generating and mixing, we utilize a DNN to
model the nonlinear regression that discriminates the clean
DPRTFs from the contaminated ones. For training, a novel
loss function composed of a weighted MSE loss for DPRTF
and a TFweight regularization term are proposed to account
for the fact that only parts of TF bins contain reliable
localization information due to the TF sparsity of the (speech)
source signal. Experiments with binaural microphones verify
the robustness of our method for DOA estimation especially
in scenarios with high level of noise and reverberation. In this
work, we focus on the concept of DPRTF enhancement and
TF weight estimation, and a generic DNN model is adopted,
which can further be revised with a more advanced network
structure, such as the recurrent neural network, as a future
work.
ACKNOWLEDGEMENTS
This work is supported by National Natural Science Founda-
tion of China (No. 61673030, U1613209), Science and
Technology Plan Project of Shenzhen (No. JCYJ20200109
140410340).
ORCID
Bing Yang
https://orcid.org/0000-0002-8978-2322
REFERENCES
1. Talmon, R., Cohen, I., Gannot, S.: Supervised source localization us-
ing diffusion kernels. In: IEEE Workshop on Applications of
Signal Processing to Audio and Acoustics (WASPAA), pp. 245–248.
(2011)
2. Vecchiotti, P. et al.: Endtoend binaural sound localisation from the raw
waveform. In: IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP), pp. 451–455. (2019)
3. Ma, N., May, T., Brown, G.J.: Exploiting deep neural networks and head
movements for robust binaural localization of multiple sources in re-
verberant environments. IEEE Trans. Audio Speech Lang. Process.
25(12), 2444–2453 (2017)
4. Chakrabarty, S., Habets, E.A.P.: Multispeaker DOA estimation using
deep convolutional networks trained with noise signals. IEEE J. Sel. Top.
Sig. Process. 13(1), 8–21 (2019)
5. Nguyen, T.N.T. et al.: Robust source counting and DOA estimation using
spatial pseudospectrum and convolutional neural network. IEEE Trans.
Audio Speech Lang. Process. 28, 2626–2637 (2020)
6. Pak, J., Shin, J.W.: Sound localization based on phase difference
enhancement using deep neural networks. IEEE Trans. Audio Speech
Lang. Process. 27(8), 1335–1345 (2019)
7. Tang, D., Taseska, M., van Waterschoot, T.: Supervised contrastive
embeddings for binaural source localization. In: IEEE Workshop on
Applications of Signal Processing to Audio and Acoustics (WASPAA),
pp. 358–362. (2019)
8. Knapp, C.H., Carter, G.C.: The generalized correlation method for
estimation of time delay. IEEE Trans. Acoust. Speech Signal Process.
24(4), 320–327 (1976)
9. Zhang, W., Rao, B.D.: A two microphonebased approach for source
localization of multiple speech sources. IEEE Trans. Audio Speech Lang.
Process. 18(8), 1913–1928 (2010)
10. Braun, S., Zhou, W., Habets, E.A.P.: Narrowband directionofarrival
estimation for binaural hearing aids using relative transfer functions. In:
IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics (WASPAA), pp. 1–5. (2015)
TABLE 6Performance of different methods under various reverberation conditions (Room 6, SNR =5 dB)
Method
MAE (degrees) ACC (%)
0.2 s 0.4 s 0.6 s 0.8 s AVG. 0.2 s 0.4 s 0.6 s 0.8 s AVG.
IPDNN [6] 4.79 11.43 17.64 21.79 13.91 89.73 76.58 66.44 60.68 73.36
RTF 22.91 30.87 34.30 35.66 30.94 55.21 40.03 33.36 30.26 39.72
RTFCT [13] 20.20 28.98 32.35 34.02 28.89 60.42 44.05 36.71 33.42 43.65
Proposed (C=1, w/o weighting) 6.00 11.73 17.23 20.63 13.89 89.13 77.84 67.61 61.23 73.95
Proposed (C=7, w/o weighting) 2.85 5.92 8.78 9.82 6.84 95.03 87.54 81.55 78.20 85.58
Proposed (C=7, w/ weighting) 2.865.76 8.25 9.36 6.56 94.89 88.47 82.34 78.81 86.13
TABLE 5Performance of different methods under various noise conditions (Room 6, RT
60
=0.6 s)
Method
MAE (degrees) ACC (%)
15 dB 10 dB 0 dB 5 dB AVG. 15 dB 10 dB 0 dB 5 dB AVG.
IPDNN [6] 6.61 10.61 29.27 45.61 23.03 85.08 78.35 50.93 33.05 61.85
RTF 23.69 28.49 38.42 41.14 32.94 51.17 42.73 25.96 20.63 35.12
RTFCT [13] 20.96 26.30 37.12 40.32 31.18 57.70 48.02 28.14 21.88 38.94
Proposed (C=1, w/o weighting) 5.78 9.78 28.90 43.70 22.04 87.67 79.37 50.48 32.99 62.63
Proposed (C=7, w/o weighting) 3.47 4.98 14.33 25.07 11.96 91.85 88.30 70.75 53.59 76.12
Proposed (C=7, w/ weighting) 3.38 4.72 13.68 23.82 11.40 92.15 88.99 71.43 54.32 76.72
8
-
YANG ET AL.
11. Wang, Z. et al.: Semisupervised learning with deep neural networks for
relative transfer function inverse regression. In: IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 191–195. (2018)
12. Yang, B., Liu, H., Pang, C.: Multiple sound source counting and locali-
zation based on spatial principal eigenvector. In: Annual Conference of
the International Speech Communication Association (INTER-
SPEECH), pp. 1924–1928. (2017)
13. Mohan, S. et al.: Localization of multiple acoustic sources with small
arrays using a coherence test. J. Acoust. Soc. Am. 123(4), 2136–2147
(2008)
14. Pavlidi, D. et al.: Realtime multiple sound source localization and
counting using a circular microphone array. IEEE Trans. Audio Speech
Lang. Process. 21(10), 2193–2206 (2013)
15. Nadiri, O., Rafaely, B.: Localization of multiple speakers under high
reverberation using a spherical microphone array and the directpath
dominance test. IEEE Trans. Audio Speech Lang. Process. 22(10),
1494–1505 (2014)
16. Madmoni, L., Rafaely, B.: Direction of arrival estimation for reverberant
speech based on enhanced decomposition of the direct sound. IEEE
Trans. Audio Speech Lang. Process. 13(1), 131–142 (2019)
17. Pertila, P., Cakir, E.: Robust direction estimation with convolutional
neural networks based steered response power. In: IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP),
pp. 6125–6129. (2017)
18. Wang, Z.Q., Zhang, X., Wang, D.: Robust speaker localization guided by
deep learningbased timefrequency masking. IEEE Trans. Audio Speech
Lang. Process. 27(1), 178–188 (2019)
19. Li, X. et al.: Estimation of the directpath relative transfer function for
supervised soundsource localization. IEEE Trans. Audio Speech Lang.
Process. 24(11), 2171–2186 (2016)
20. Allen, J.B., Berkley, D.A.: Image method for efciently simulating small
room acoustics. J. Acoust. Soc. Am. 65(4), 943–950 (1979)
21. Campbell, D.R., Palomaki, K.J., Brown, G.J.: A MATLAB simulation of
shoebox room acoustics for use in research and teaching. Comput.
Inform. Syst. J. 9(3), 48–51 (2005)
22. Gardner, W.G., Martin, K.D.: HRTF measurements of a KEMAR.
J. Acoust. Soc. Am. 97(6), 3907–3908 (1995)
23. Habets, E.A.P., Gannot, S.: Generating sensor signals in isotropic noise
elds. J. Acoust. Soc. Am. 122(6), 3464–3470 (2007)
How to cite this article: Yang B, Ding R, Ban Y, Li X,
Liu H. Enhancing directpath relative transfer function
using deep neural network for robust sound source
localization. CAAI Trans. Intell. Technol. 2021;19.
https://doi.org/10.1049/cit2.12024
YANG ET AL.
-
9
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
This paper proposes a deep neural network (DNN) based direct-path relative transfer function (DP-RTF) enhancement method for robust direction of arrival (DOA) estimation in noisy and reverberant environments. The DP-RTF refers to the ratio between the direct-path acoustic transfer functions of the two microphone channels. First, the complex-value DP-RTF is decomposed into the inter-channel intensity difference, sinusoidal functions of the inter-channel phase difference in the time-frequency domain. Then, the decomposed DP-RTF features from a series of temporal context frames are utilized to train a DNN model, which maps the DP-RTF features contaminated by noise and reverberation to the clean ones, and meanwhile provides a time-frequency (TF) weight to indicate the reliability of the mapping. The DP-RTF enhancement network can help to enhance the DP-RTF against noise and reverberation. Finally, the DOA of a sound source can be estimated by integrating the weighted matching between the enhanced DP-RTF features and the DP-RTF templates. Experimental results on simulated data show the superiority of the proposed DP-RTF enhancement network for estimating DOA of sound source in the environments with various levels of noise and reverberation.
Article
Full-text available
This paper presents a novel machine-hearing system that exploits deep neural networks (DNNs) and head movements for robust binaural localisation of multiple sources in reverberant environments. DNNs are used to learn the relationship between the source azimuth and binaural cues, consisting of the complete cross-correlation function (CCF) and interaural level differences (ILDs). In contrast to many previous binaural hearing systems, the proposed approach is not restricted to localisation of sound sources in the frontal hemifield. Due to the similarity of binaural cues in the frontal and rear hemifields, front-back confusions often occur. To address this, a head movement strategy is incorporated in the localisation model to help reduce the front-back errors. The proposed DNN system is compared to a Gaussian mixture model (GMM) based system that employs interaural time differences (ITDs) and ILDs as localisation features. Our experiments show that the DNN is able to exploit information in the CCF that is not available in the ITD cue, which together with head movements substantially improves localisation accuracies under challenging acoustic scenarios in which multiple talkers and room reverberation are present.
Article
Many signal processing-based methods for sound source direction-of-arrival estimation produce a spatial pseudo-spectrum of which the local maxima strongly indicate the source directions. Due to different levels of noise, reverberation and different number of overlapping sources, the spatial pseudo-spectra are noisy even after smoothing. In addition, the number of sources are often unknown. As a result, selecting the peaks from these spectra is susceptible to error. Convolutional neural network has been successfully applied to many image processing problems in general and direction-of-arrival estimation in particular. In addition, deep learning-based methods for direction-of-arrival estimation show good generalization to different environments. We propose to use a 2D convolutional neural network with multi-task learning to robustly estimate the number of sources and the directions-of-arrival from short-time spatial pseudo-spectra, which have useful directional information from audio input signals. This approach reduces the tendency of the neural network to learn unwanted association between sound classes and directional information, and helps the network generalize to unseen sound classes. The simulation and experimental results show that the proposed methods outperform other directional-of-arrival estimation methods in different levels of noise and reverberation, and different number of sources.
Article
The performance of most of the classical sound source localization algorithms degrades seriously in the presence of background noise or reverberation. Recently, deep neural networks (DNNs) have successfully been applied to sound source localization, which mainly aim to classify the direction-of-arrival (DoA) into one of the candidate sectors. In this paper, we propose a DNN-based phase difference enhancement for DoA estimation, which turned out to be better than the direct estimation of the DoAs from the input interchannel phase differences (IPDs). The sinusoidal functions of the phase differences for “clean and dry” source signals are estimated from the sinusoidal functions of the IPDs for the input signals which may include directional signals, diffuse noise, and reverberation. The resulted DoA is further refined to compensate for the estimation bias near the end-fire directions. From the enhanced IPDs, we can determine the DoA for each frequency bin and the DoAs for the current frame from the distributions of the DoAs for frequencies. Experimental results with various types and levels of background noise, reverberation times, numbers of sources, room impulse responses, and DoAs showed that the proposed method outperformed conventional approaches.
Conference Paper
A novel end-to-end binaural sound localisation approach is proposed which estimates the azimuth of a sound source directly from the waveform. Instead of employing hand-crafted features commonly employed for binaural sound localisation, such as the interaural time and level difference, our end-to-end system approach uses a convolutional neural network (CNN) to extract specific features from the waveform that are suitable for localisation. Two systems are proposed which differ in the initial frequency analysis stage. The first system is auditory-inspired and makes use of a gammatone filtering layer, while the second system is fully data-driven and exploits a trainable convolutional layer to perform frequency analysis. In both systems, a set of dedicated convolutional kernels are then employed to search for specific localisation cues, which are coupled with a localisation stage using fully connected layers. Localisation experiments using binaural simulation in both anechoic and reverberant environments show that the proposed systems outperform a state-ofthe-art deep neural network system. Furthermore, our investigation of the frequency analysis stage in the second system suggests that the CNN is able to exploit different frequency bands for localisation according to the characteristics of the reverberant environment.
Article
Supervised learning based methods for source localization, being data driven, can be adapted to different acoustic conditions via training and have been shown to be robust to adverse acoustic environments. In this paper, a convolutional neural network (CNN) based supervised learning method for estimating the direction-of-arrival (DOA) of multiple speakers is proposed. Multi-speaker DOA estimation is formulated as a multi-class multi-label classification problem, where the assignment of each DOA label to the input feature is treated as a separate binary classification problem. The phase component of the short-time Fourier transform (STFT) coefficients of the received microphone signals are directly fed into the CNN, and the features for DOA estimation are learnt during training. Utilizing the assumption of disjoint speaker activity in the STFT domain, a novel method is proposed to train the CNN with synthesized noise signals. Through experimental evaluation with both simulated and measured acoustic impulse responses, the ability of the proposed DOA estimation approach to adapt to unseen acoustic conditions and its robustness to unseen noise type is demonstrated. Through additional empirical investigation, it is also shown that with an array of M microphone our proposed framework yields the best localization performance with M-1 convolution layers. The ability of the proposed method to accurately localize speakers in a dynamic acoustic scenario with varying number of sources is also shown.
Article
Direction of arrival (DOA) estimation for speech sources is an important task in audio signal processing. This task becomes a challenge in reverberant environments, which are typical to real scenarios. Several methods of DOA estimation for speech sources have been developed recently, in an attempt to overcome the effect of reverberation. One effective approach aims to identify time-frequency bins in the short time Fourier transform domain, that are dominated by the direct sound. This approach was shown to be particularly adequate for spherical arrays, with processing in the spherical harmonics domain. The direct-path dominance (DPD) test, and a method which is based on the directivity of the sound field are recent examples. While these methods seem to perform well, high reverberation conditions may degrade their performance. In this work, the structure of the spatial correlation matrix is comprehensively studied, showing that under some well-defined conditions, the DOA of the direct sound can be correctly extracted from its dominant eigenvector, even when contaminated by reflections. This new insight leads to the development of a new test, performing an enhanced decomposition of the direct sound (EDS), denoted the DPD-EDS test. The proposed test is compared to previous DPD tests, and to other recently proposed reverberation-robust methods, using computer simulations and an experimental study, demonstrating its potential advantage. The studies include multiple speakers in highly reverberant environments, therefore representing challenging real-life acoustics scenes.
Article
Deep learning based time-frequency (T-F) masking has dramatically advanced monaural (single-channel) speech separation and enhancement. This study investigates its potential for direction of arrival (DOA) estimation in noisy and reverberant environments. We explore ways of combining T-F masking and conventional localization algorithms, such as GCC-PHAT, as well as newly proposed algorithms based on steered-response SNR and steering vectors. The key idea is to utilize deep neural networks (DNNs) to identify speech dominant T-F units containing relatively clean phase for DOA estimation. Our DNN is trained using only monaural spectral information, and this makes the trained model directly applicable to arrays with various numbers of microphones arranged in diverse geometries. Although only monaural information is used for training, experimental results show strong robustness of the proposed approach in new environments with intense noise and room reverberation, outperforming traditional DOA estimation methods by large margins. Our study also suggests that the ideal ratio mask (IRM) and its variants remain effective training targets for robust speaker localization.