Conference PaperPDF Available

Synthesis of Device-Independent Noise Corpora for Realistic ASR Evaluation

Authors:

Figures

Content may be subject to copyright.
Synthesis of Device-Independent Noise Corpora for Realistic ASR Evaluation
Hannes Gamper1, Mark R. P. Thomas1, Lyle Corbin2, Ivan Tashev1
1Microsoft Research Redmond
2Microsoft Corporation
{hagamper,lylec,ivantash}@microsoft.com, mark.r.thomas@ieee.org
Abstract
In order to effectively evaluate the accuracy of automatic speech
recognition (ASR) with a novel capture device, it is important to
create a realistic test data corpus that is representative of real-
world noise conditions. Typically, this involves either record-
ing the output of a device under test (DUT) in a noisy environ-
ment, or synthesizing an environment over loudspeakers in a
way that simulates realistic signal-to-noise ratios (SNRs), rever-
beration times, and spatial noise distributions. Here we propose
a method that aims at combining the realism of in-situ record-
ings with the convenience and repeatability of synthetic cor-
pora. A device-independent spatial recording containing noise
and speech is combined with the measured directivity pattern
of a DUT to generate a synthetic test corpus for evaluating the
performance of an ASR system. This is achieved by a spherical
harmonic decomposition of both the sound field and the DUT’s
directivity patterns. Experimental results suggest that the pro-
posed method can be a viable alternative to costly and cumber-
some device-dependent measurements. The proposed simula-
tion method predicted the SNR of the DUT response to within
about 3 dB and the word error rate (WER) to within about 20%,
across a range of test SNRs, target source directions, and noise
types.
Index Terms: automatic speech recognition, device characteri-
zation, device-related transfer function, spherical harmonics
1. Introduction
Automatic speech recognition (ASR) is an integral part of many
hardware devices, including mobile phones, game consoles and
smart televisions, to enable hands-free operation and voice con-
trol. When evaluating how robust a device’s ASR engine is to
noise and reverberation in real world settings, one must account
for the device hardware characteristics as well as typical usage
scenarios. This is normally achieved by exposing the device un-
der test (DUT) to realistic conditions in terms of environmen-
tal noise and reverberation while evaluating the performance
of the ASR engine. Such in-situ tests are extremely valuable
when tuning hardware and software parameters to maximize
ASR performance, especially if the DUT has multiple micro-
phones and the effect of microphone placement and (spatial)
speech enhancement algorithms needs to be assessed.
However, in-situ tests are lengthy and cumbersome, requir-
ing hours of recordings made on the DUT that have to be redone
whenever hardware changes are made to the DUT. Furthermore,
the exact test conditions are difficult or impossible to recreate
when attempting to evaluate the effect of a hardware change or
comparing the performance of an ASR engine across devices.
In order to overcome the limitations of in-situ tests, a pre-
recorded test corpus can be used. An ASR test corpus typi-
cally consists of a variety of scenarios–differing in the level,
type and spatial quality of the background noise–that are repre-
sentative of the conditions to which a DUT might be subjected
in real-world use. During testing, the corpus is rendered over
a loudspeaker setup and recorded through the DUT. Two ex-
amples of pre-recorded test corpus methodologies are encoded
as part of specifications by the European Telecommunications
Standards Institute (ETSI) [1, 2]. Both techniques utilize multi-
channel recordings that are played back over rigorously cali-
brated speaker systems. The systems attempt to recreate the
original sound field of the real world environment for a device
placed at the center of the playback system.
Song et al. describe a method for simulating realistic back-
ground noise to test telecommunication devices based on spatial
sound field recordings from a spherical microphone array [3].
The authors compare various methods to derive the input sig-
nals to a circular loudspeaker array delivering the spatial noise
recording to the DUT. While using a pre-recorded test corpus
has the advantage of repeatability and simpler logistics com-
pared to in-situ tests, it requires a highly specialized test en-
vironment and hardware setup for playback and recording. In
addition, emulating the complexity of a real, noisy environment,
with potentially hundreds of spatially distributed noise sources,
can be challenging.
Here we propose a method that combines the realism of
in-situ tests with the convenience and repeatability of a pre-
recorded corpus without the requirement of a specialized play-
back setup. The approach is based on a device-independent
spatial in-situ recording that is combined with the directivity
characteristics of a DUT to create a synthetic test corpus for
ASR performance and robustness evaluation. The DUT direc-
tivity characteristics can be obtained through measurements or
via acoustic simulation. In a similar fashion to the work by Song
et al. [3], the proposed method is based on a spherical harmon-
ics decomposition of a spatial noise recording obtained using a
spherical microphone array. However, by also using a spherical
harmonic decomposition of the DUT’s directivity pattern, we
show that it is possible to simulate the DUT response directly
without performing actual recordings on the DUT.
2. Proposed approach
The goal of the proposed method for generating a synthetic
test corpus is to simulate the response of a DUT to a pre-
recorded noisy environment. Section 2.1 describes the device-
independent capture and representation of a sound field, while
section 2.2 discusses obtaining the device-dependent directivity
characteristics of the DUT. The proposed approach for combin-
ing device-independent recordings with device-dependent di-
rectivity characteristics in the spherical harmonics domain is
presented in Section 2.3.
Copyright © 2016 ISCA
INTERSPEECH 2016
September 8–12, 2016, San Francisco, USA
http://dx.doi.org/10.21437/Interspeech.2016-9782791
Figure 1: 64-channel spherical microphone array, allowing a
7-th order spherical harmonic approximation to the recorded
sound field.
2.1. Sound field capture and decomposition
Real noisy environments typically contain a multitude of spa-
tially distributed noise sources. In order to evaluate the ASR
performance under realistic conditions it is important to subject
the DUT to spatially diverse noise environments. Therefore, the
spatial quality of the noise environment is preserved in record-
ings used for ASR evaluation. A common way to capture a
sound field spatially is to use an array of microphones placed
on the surface of a sphere. A 64-channel example with radius
100 mm is shown in Figure 1.
Spherical harmonics provide a convenient way to describe
a sound field captured using a spherical microphone array. By
removing the scattering effect of the microphone baffle, the
free-field decomposition of the recorded sound field can be esti-
mated. Given the microphone signals P(r0, θ, φ, ω), where r0
is the array radius, θand φare the microphone colatitude and
azimuth angles, respectively, and ωis the angular frequency,
the plane wave decomposition of the sound field captured with
a spherical array of Mmicrophones, distributed uniformly on
the surface of the sphere [4], is given in the spherical harmonics
domain by [5, 6]
˘
Snm(ω) = 1
bn(kr0)
4π
M
M
X
i=1
P(r0, θi, φi, ω)Ym
n(θi, φi),
(1)
where k=ω/c and cis the speed of sound. The spherical
harmonic of order nand degree mis defined as
Ym
n(θ, φ) = (1)ms2n+ 1
4π
(n− |m|)!
(n+|m|)! P|m|
n(cos θ)eimφ,
(2)
where the associated Legendre function Pm
nrepresents stand-
ing waves in θand eimφ represents travelling waves in φ.
Note that Condon-Shortley phase convention is used such that
Ym
n(θ, φ)=Ym
n(θ, φ)[7].
In the case of a spherical scatterer, the mode strength
Figure 2: DUT directivity patterns: measured (top) and sim-
ulated through 7th-order spherical microphone array (middle)
and ideal 7th-order rigid spherical scatterer (bottom).
bn(kr0)is defined for an incident plane wave as
bn(kr0) = 4πin jn(kr0)j0
n(kr0)
h0(2)
n(kr0)h(2)
n(kr0)!,(3)
where jn(kr0)is the spherical Bessel function of degree n,
h(2)
n(kr0)is the spherical Hankel function of the second kind
of degree n, and (·)0denotes differentiation with respect to the
argument. The mode strength term in (1) is necessary to ac-
count for the scattering effect of the spherical baffle in order to
obtain a plane-wave decomposition of the sound field.
2.2. Characterising the device under test (DUT)
Under the assumption of linearity and time invariance, the re-
sponse of the DUT to an input signal is given by a transfer func-
tion describing the acoustic path from the sound source to the
microphone. In far field conditions, that is, when the source is
further than approximately one meter from the DUT, this trans-
fer function varies spectrally with source azimuth and eleva-
tion, whereas the effect of source distance is mostly limited to
the signal gain. Therefore, the directivity characteristics of the
DUT can be approximated through transfer function measure-
ments at a single distance, spanning the range of azimuth and
elevation angles of interest. Figure 2 (top) shows the directivity
patterns of one microphone of a Kinect device [8]. Alterna-
tively, acoustic simulation can be used to estimate these trans-
fer functions. Due to the similarity of these direction-dependent
transfer functions to the concept of head-related transfer func-
tions (HRTFs) in the field of spatial audio rendering [9], we
2792
Kinect
Spherical decomp. (1)
DUT DRTF application (7)
"Simulation" "Reference"
Figure 3: Experimental setup.
refer to the direction-dependent DUT transfer functions as the
device-related transfer functions (DRTFs).
In analogy to spherical microphone array recordings,
DRTFs measured at points uniformly distributed over the sphere
can be decomposed using spherical harmonics:
˘
Dnm(ω) = 4π
N
N
X
i=1
D(θi, φi, ω)Ym
n(θi, φi),(4)
where Nis the number of DRTFs and D(r, θi, φi, ω)are the
DRTFs as a function of the measurement colatitude and azimuth
angles of arrival, θand φ. In cases where the DRTF measure-
ment points do not cover the whole sphere or are not uniformly
distributed, a least-squares decomposition can be used [10].
2.3. Combining sound field recordings and DUT directivity
To simulate the DUT behavior in the recorded noise environ-
ment, the device-related transfer functions (DRTFs) of the DUT
are applied to the spherical array recording. This can be conve-
niently performed in the spherical harmonics domain, in anal-
ogy to applying head-related transfer functions to a recording
for binaural rendering [11]. An aperture weighting function de-
rived from the DRTFs is applied to the estimated free-field de-
composition of the recorded sound field. The sound pressure at
each microphone of the DUT is then found by integrating the
DRTF-weighted pressure over the sphere:
P(ω) = ZS2
S(Ω, ω)D(Ω, ω)d(5)
=
X
n=−∞
X
n0=−∞
n
X
m=n
n0
X
m0=n0
˘
Snm(ω)˘
Dn0m0(ω)
ZS
Ym
n(Ω)Ym0
n0(Ω)d(6)
=
X
n=−∞
n
X
m=n
˘
Snm(ω)˘
Dn,m(ω).(7)
3. Experimental evaluation
The experimental setup (see Figure 3) consisted of a spheri-
cal microphone array (see Figure 1) and a Kinect device as
the DUT. Impulse responses of both the spherical array and the
DUT were measured in an anechoic chamber for 400 directions
at a radius of one meter, using a setup described by Bilinski
et al. [12]. For the resulting DUT DRTFs, extrapolation was
Figure 4: Geometric layout of noise sources (black dots) and
speech sources (red dots) at 5.6 degrees azimuth and 0 degrees
elevation (a), 63.7 degrees azimuth and -10.4 degrees eleva-
tion (b), -84.4 degrees azimuth and 0 degrees elevation (c), and
172.1 degrees azimuth and 44.7 degrees elevation (d).
target SNR [dB]
-20 0 20
SNR difference [dB]
-5
0
5
10
target SNR [dB]
-20 0 20
a
b
c
d
Figure 5: SNR difference between simulation and reference, for
brown noise (left) and market noise (right). Labels a–d indicate
the speech source locations labelled a–d in Figure 4.
used to cover the whole sphere [10]. Figure 2 (top) illustrates
the directivity patterns of one microphone of the DUT. Figure 2
(middle) depicts the directivity patterns equivalent to applying
the DUT DRTFs to a spherically isotropic sound field recorded
via the spherical array. Each point in the directivity patterns in
Figure 2 (middle) is obtained by decomposing the Nspherical
array impulse responses for that direction using (1) and apply-
ing the DUT DRTF via (7). As the spherical array shown in
Figure 1 does not behave like an ideal scatterer with ideal micro-
phones, the sound field decomposition is imperfect and the re-
sulting equivalent DUT directivity slightly distorted compared
to the actual, measured DUT DRTF. As shown in Figure 2 (bot-
tom), this distortion is largely corrected when replacing the real
array impulse responses in the simulation with those of an ideal
scatterer [6]. A follow-up study [15] addresses the discrepancy
between real and ideal scatterer through calibration of the array
and by deriving optimal scatterer removal functions [13, 14].
Simulations were performed combining speech recordings
with simulated and recorded spatial noise. The speech corpus
consisted of 2000 utterances containing 14146 words recorded
by 50 male and female speakers in a quiet environment, for a to-
tal duration of 2.5 hours. Speech recognition was performed us-
ing a DNN based ASR engine [16] with acoustic models trained
on the clean speech corpus. Two types of noise were used: 60
2793
target SNR [dB]
-20 0 20
WER [%]
0
50
100
a) target SNR [dB]
-20 0 20
b)
Simulation
Reference
Difference
target SNR [dB]
-20 0 20
WER [%]
0
50
100
c)
target SNR [dB]
-20 0 20
d)
Figure 6: Word error rates (WERs) for brown noise. Labels a–d
indicate the speech source locations labelled a–d in Figure 4.
seconds of random Gaussian noise with a frequency roll-off of
6 dB per octave (i.e., brown noise), and a 60 second record-
ing of ambient noise in a busy outdoor market place, obtained
with the spherical microphone array. For the experiments, the
spherical harmonics decomposition of the recorded sound field
was evaluated at 16 directions, as shown in Figure 4, to emulate
playback over 16 spatially distributed virtual loudspeakers. The
number of virtual speakers was chosen as a trade-off between
spatial fidelity and computational complexity.
The output of both the spherical array and the DUT was
simulated by convolving virtual source signals with the corre-
sponding measured impulse responses. The virtual source sig-
nals were derived by extracting a pseudo-random segment of
the noise data and mapping it to a virtual source direction. Sim-
ilarly, the speech recordings were mapped to one of four virtual
source directions, to simulate a speaker embedded in noise. The
noise and speaker locations used are shown in Figure 4. Note
that the setup includes locations off the horizontal plane to em-
ulate the spatial diversity found in real environments.
Tests were performed by combining the simulated noise and
speech responses at a target signal-to-noise ratio (SNR). The
SNR was calculated in the frequency band 100–2000 Hz, as the
energy ratio between the microphone response during speech
activity (signal) and the response in absence of speech (noise).
For each target SNR, appropriate noise and speech gains were
derived by simulating the DUT response via (7), i.e., the pro-
posed method (“simulation” in Figure 3). Those same gains
were then applied to the noise and speech samples convolved
directly with the DUT DRTFs (“reference” in Figure 3). This
experiment evaluates how closely the SNR estimated from the
simulated DUT response matches the SNR of the reference re-
sponse. As shown in Figure 5, the mismatch between simulated
and reference SNRs is within ±5 dB across the tested target
SNRs, source directions, and noise types, with lower errors for
low target SNRs and speech directions closer to the front (a and
b). For target SNRs below 10 dB, the predicted SNRs are actu-
ally within ±3 dB. Above 10 dB SNR background noise in the
raw speech recordings may start to affect results.
The simulation and reference responses of the DUT to the
noisy speech samples generated for the SNR experiment de-
scribed above were fed to the ASR engine. Figures 6 and
target SNR [dB]
-20 0 20
WER [%]
0
50
100
a) target SNR [dB]
-20 0 20
b)
Simulation
Reference
Difference
target SNR [dB]
-20 0 20
WER [%]
0
50
100
c)
target SNR [dB]
-20 0 20
d)
Figure 7: Word error rates (WERs) for market noise. Labels a–d
indicate the speech source locations labelled a–d in Figure 4.
7 show the resulting average word error rates (WERs). The
simulated corpus predicts the reference WERs fairly accurately
across SNRs and source directions, except around -5 dB where
the WER is most sensitive to the SNR. For brown noise, the
simulation underestimates the WER for source directions a and
c, whereas for market noise, the simulation overestimates the
WER for directions a, b and d. This estimation bias could be
explained by SNR mismatches between simulated and reference
responses, as illustrated in Figure 5. However, the WER change
as a function of SNR is predicted fairly well by the simulation.
4. Summary and conclusion
The proposed method allows the use of a device-independent
spatial noise recording to generate a device-specific synthetic
speech corpus for automatic speech recognition performance
evaluation under realistic conditions. Experimental results in-
dicate that the proposed method allows predicting the expected
signal-to-noise ratio (SNR) of a device under test (DUT) ex-
posed to spatial noise to within about ±3 dB. The mismatch
between simulation and reference SNRs may be reduced by ap-
plying a calibration and appropriate optimal scatterer removal
functions to the spherical microphone array used for the spatial
noise recordings. The prediction of average word error rates
(WERs) was accurate to within about 20%. While estimation
bias may have affected absolute WER prediction, the proposed
method predicted WER change as a function of SNR fairly well.
This indicates that the method may be well suited to evaluate the
relative effect of hardware changes on ASR performance.
One limitation of the method is the assumption of far-field
conditions, i.e., that all sound sources are further than approxi-
mately one meter from the DUT. However, in a close-talk situa-
tion with a target source in the vicinity of the DUT, the method
may still prove useful for evaluating the effect of ambient noise
on ASR performance. A major advantage of the proposed
method is that it allows running completely simulated experi-
ments. Here, speech recognition was performed on 2.5 hours
of speech data for two noise types, four speech directions, and
over 40 target SNRs. Collecting this data using live recordings
would have taken 2.5×2×4×40 = 800 hours. Future work
includes verification of the method in live noise environments.
2794
5. References
[1] Speech and multimedia Transmission Quality (STQ); A sound field
reproduction method for terminal testing including a background
noise database, ETSI EG 202 396-1 Std., 2015.
[2] Speech and multimedia Transmission Quality (STQ); Speech
quality performance in the presence of background noise; Part
1: Background noise simulation technique and background noise
database, ETSI TS 103 224 Std., 2011.
[3] W. Song, M. Marschall, and J. D. G. Corrales, “Simulation of
realistic background noise using multiple loudspeakers,” in Proc.
Int. Conf. on Spatial Audio (ICSA), Graz, Austria, 2015.
[4] J. Fliege and U. Maier, “A two-stage approach for computing cu-
bature formulae for the sphere,” in Mathematik 139T, Universit ¨
at
Dortmund, Fachbereich Mathematik, 44221, 1996.
[5] E. G. Williams, Fourier Acoustics: Sound Radiation and
Nearfield Acoustical Holography, 1st ed. London: Academic
Press, 1999.
[6] B. Rafaely, “Analysis and design of spherical microphone arrays,”
IEEE Trans. Speech and Audio Processing, vol. 13, no. 1, pp.135–
143, 2005.
[7] N. A. Gumerov and R. Duraiswami, Fast Multipole Methods for
the Helmholtz Equation in Three Dimensions. Elsevier, 2004.
[8] “Kinect for Xbox 360,” http://www.xbox.com/en-US/xbox-
360/accessories/kinect.
[9] C. I. Cheng and G. H. Wakefield, “Introduction to head-related
transfer functions (HRTFs): Representations of HRTFs in time,
frequency, and space,” in Proc. Audio Engineering Society Con-
vention, New York, NY, USA, 1999.
[10] J. Ahrens, M. R. Thomas, and I. Tashev, “HRTF magnitude mod-
eling using a non-regularized least-squares fit of spherical har-
monics coefficients on incomplete data,” in Proc. APSIPA Annual
Summit and Conference, Hollywood, CA, USA, 2012.
[11] L. S. Davis, R. Duraiswami, E. Grassi, N. A. Gumerov, Z. Li,
and D. N. Zotkin, “High order spatial audio capture and its bin-
aural head-tracked playback over headphones with HRTF cues,”
in Proc. Audio Engineering Society Convention, New York, NY,
USA, 2005.
[12] P. Bilinski, J. Ahrens, M. R. P. Thomas, I. J. Tashev, and J. C. Platt,
“HRTF magnitude synthesis via sparse representation of anthro-
pometric features,” Florence, Italy, 2014, pp. 4501–4505.
[13] S. Moreau, J. Daniel, and S. Bertet, “3D sound field recording
with higher order ambisonics - objective measurements and vali-
dation of spherical microphone,” in Proc. Audio Engineering So-
ciety Convention 120, Paris, France, 2006.
[14] C. T. Jin, N. Epain, and A. Parthy, “Design, optimization
and evaluation of a dual-radius spherical microphone array,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-
cessing, vol. 22, no. 1, pp. 193–204, 2014.
[15] H. Gamper, L. Corbin, D. Johnston, and I. J. Tashev, “Synthe-
sis of device-independent noise corpora for speech quality assess-
ment,” in Proc. Int. Workshop on Acoustic Signal Enhancement
(IWAENC), Xi’an, China, 2016.
[16] F. Seide, G. Li, and D. Yu, “Conversational speech transcription
using context-dependent deep neural networks,” in Proc. Inter-
speech, Florence, Italy, 2011, pp. 437–440.
2795
... For completeness, we also provide synthesized, full-bandwidth first-order ambisonics [6], which are aliasing free. Either of these formats may be used to simulate a wide variety of arrays [7]. We describe the dataset generation process in more detail in Section 3.1. ...
... Plotting the representations from our models trained on Spatial LibriSpeech further illustrates the transferability to real-world test data seen in Section 4.1. Figure 1 shows UMAP [42] plots of embeddings extracted from representations before the MLP block of the network 7 We see that the model trained for 3D 6 For ACE Challenge, we obtained first-order ambisonics from EM32 audio samples [41]. 7 ...
... Figure 1 shows UMAP [42] plots of embeddings extracted from representations before the MLP block of the network 7 We see that the model trained for 3D 6 For ACE Challenge, we obtained first-order ambisonics from EM32 audio samples [41]. 7 ...
... For completeness, we also provide synthesized, full-bandwidth first-order ambisonics [6], which are aliasing free. Either of these formats may be used to simulate a wide variety of arrays [7]. We describe the dataset generation process in more detail in Section 3.1. ...
... Plotting the representations from our models trained on Spatial LibriSpeech further illustrates the transferability to real-world test data seen in Section 4.1. Figure 1 shows UMAP [42] plots of embeddings extracted from representations before the MLP block of the network 7 We see that the model trained for 3D 6 For ACE Challenge, we obtained first-order ambisonics from EM32 audio samples [41]. 7 ...
... Figure 1 shows UMAP [42] plots of embeddings extracted from representations before the MLP block of the network 7 We see that the model trained for 3D 6 For ACE Challenge, we obtained first-order ambisonics from EM32 audio samples [41]. 7 ...
Preprint
We present Spatial LibriSpeech, a spatial audio dataset with over 650 hours of 19-channel audio, first-order ambisonics, and optional distractor noise. Spatial LibriSpeech is designed for machine learning model training, and it includes labels for source position, speaking direction, room acoustics and geometry. Spatial LibriSpeech is generated by augmenting LibriSpeech samples with 200k+ simulated acoustic conditions across 8k+ synthetic rooms. To demonstrate the utility of our dataset, we train models on four spatial audio tasks, resulting in a median absolute error of 6.60{\deg} on 3D source localization, 0.43m on distance, 90.66ms on T30, and 2.74dB on DRR estimation. We show that the same models generalize well to widely-used evaluation datasets, e.g., obtaining a median absolute error of 12.43{\deg} on 3D source localization on TUT Sound Events 2018, and 157.32ms on T30 estimation on ACE Challenge.
... Similar to WS in (13), the D × D matrix WD contains appropriate sampling weights for the measurement grid directions. Alternatively, measurement-based filters can be computed with (19) by estimating the SH array matrixG in the least-squares sense from the array response measurements via: ...
... For equiangular measurement grids in azimuth and elevation that order is N ≈ √ D/2 − 1 [16]. Since (21) and (19) give equivalent results, we evaluate only the SH domain-based inversion of (19), which applies to both the model-and measurement-based approach. ...
... For equiangular measurement grids in azimuth and elevation that order is N ≈ √ D/2 − 1 [16]. Since (21) and (19) give equivalent results, we evaluate only the SH domain-based inversion of (19), which applies to both the model-and measurement-based approach. ...
Conference Paper
Full-text available
Spherical microphone array processing is commonly performed in a spatial transform domain, due to theoretical and practical advantages related to sound field capture and beamformer design and control. Multichannel encoding filters are required to implement a discrete spherical harmonic transform and extrapolate the captured sound field coefficients from the array radius to the far field. These spherical harmonic encoding filters can be designed based on a theoretical array model or on measured array responses. Various methods for both design approaches are presented and compared, and differences between modeled and measurement-based filters are investigated. Furthermore, a flexible filter design approach is presented that combines the benefits of previous methods and is suitable for deriving both modeled and measurement-based filters.
... In previous work, the generation of a device independent noise corpus using a spherical microphone array (see Figure 1) for evaluating the performance of automatic speech recognition (ASR) on a DUT was introduced [4]. The approach aims at combining the realism of in-situ recordings with the convenience and controllability of a synthetic noise corpus. ...
... Given a sound field recording from a spherical microphone array in the time domain, the estimated free-field decomposition, S nm , is obtained via fast convolution in the frequency domain with the decomposition filters described in Section 2.3. The DUT response is simulated by applying the DUT directivity via the DRTF,D n,−m , and integrating over the sphere [4]:P ...
Conference Paper
Full-text available
The perceived quality of speech captured in the presence of background noise is an important performance metric for communication devices, including portable computers and mobile phones. For a realistic evaluation of speech quality, a device under test (DUT) needs to be exposed to a variety of noise conditions either in real noise environments or via noise recordings, typically delivered over a loudspeaker system. However, the test data obtained this way is specific to the DUT and needs to be re-recorded every time the DUT hardware changes. Here we propose an approach that uses device-independent spatial noise recordings to generate device-specific synthetic test data that simulate in-situ recordings. Noise captured using a spherical microphone array is combined with the directivity patterns of the DUT, referred to here as device-related transfer functions (DRTFs), in the spherical harmonics domain. The performance of the proposed method is evaluated in terms of the predicted signal-to-noise ratio (SNR) and the predicted mean opinion score (PMOS) of the DUT under various noise conditions. The root-mean-squared errors (RMSEs) of the predicted SNR and PMOS are on average below 4 dB and 0.28, respectively, across the range of tested SNRs, target source directions, noise types, and spherical harmonics decomposition methods. These experimental results indicate that the proposed method may be suitable for generating device-specific synthetic corpora from device-independent in-situ recordings.
... Spherical microphone arrays are widely used for 3D soundfield capture [11,12,13,14,15,16]. Equipment calibration is an important step in soundfield capture, reproduction, and validation of soundfield duplication for consumer device testing [17,18,19]. A variant of the proposed calibration method was used in [17] to achieve a broad spectral alignment within 1 dB. ...
Preprint
We propose a straightforward and cost-effective method to perform diffuse soundfield measurements for calibrating the magnitude response of a microphone array. Typically, such calibration is performed in a diffuse soundfield created in reverberation chambers, an expensive and time-consuming process. A method is proposed for obtaining diffuse field measurements in untreated environments. First, a closed-form expression for the spatial correlation of a wideband signal in a diffuse field is derived. Next, we describe a practical procedure for obtaining the diffuse field response of a microphone array in the presence of a non-diffuse soundfield by the introduction of random perturbations in the microphone location. Experimental spatial correlation data obtained is compared with the theoretical model, confirming that it is possible to obtain diffuse field measurements in untreated environments with relatively few loudspeakers. A 30 second test signal played from 4-8 loudspeakers is shown to be sufficient in obtaining a diffuse field measurement using the proposed method. An Eigenmike is then successfully calibrated at two different geographical locations.
... Spherical microphone arrays are widely used for 3D soundfield capture [11,12,13,14,15,16]. Equipment calibration is an important step in soundfield capture, reproduction, and validation of soundfield duplication for consumer device testing [17,18,19]. A variant of the proposed calibration method was used in [17] to achieve a broad spectral alignment within 1 dB. ...
... The behaviour of a linear and time-invariant system can be conveniently described by its transfer function in the frequency domain or its impulse response (IR) in the time domain. A measured IR is useful for analysing and simulating the acoustic properties of a room or a hardware device [1,2]. While an IR can theoretically be measured by recording the response to a Dirac delta function, in practice this approach may suffer from poor signal-to-noise ratio and reproducibility [3], in particular if an acoustic IR is measured using an impulsive excitation source, for example a pistol shot or balloon pop [4]. ...
Conference Paper
Full-text available
(Matlab code available: https://github.com/microsoft/Asynchronous_impulse_response_measurement) The impulse response (IR) of an acoustic environment or audio device can be measured by recording its response to a known test signal. Ideally, the same digital clock should be used for playback and recording to ensure synchronous digital-to-analog and analog-to-digital conversion. When measuring the acoustic performance of a hardware device, be it for audio input to a device microphone or audio output from a device speaker, it is often difficult to access the device's audio signal path electronically. Therefore, the device-under-test (DUT) has to act either as a playback or recording device for the IR measurement. However, it may be impossible to synchronise the internal clock of the DUT with the reference clock of the measurement system. As a result, the recorded DUT response may be subject to unknown clock drift which may lead to undesired artefacts in the measured IR. Here, a method is proposed for estimating the drift between a play-back and recording clock directly from the recorded response to obtain a drift-compensated IR. Experimental results from IR measurements of a DUT subject to clock drift indicate that the proposed method successfully estimates the drift rate and yields an accurate IR estimate in magnitude and phase.
Conference Paper
Full-text available
The perceived quality of speech captured in the presence of background noise is an important performance metric for communication devices, including portable computers and mobile phones. For a realistic evaluation of speech quality, a device under test (DUT) needs to be exposed to a variety of noise conditions either in real noise environments or via noise recordings, typically delivered over a loudspeaker system. However, the test data obtained this way is specific to the DUT and needs to be re-recorded every time the DUT hardware changes. Here we propose an approach that uses device-independent spatial noise recordings to generate device-specific synthetic test data that simulate in-situ recordings. Noise captured using a spherical microphone array is combined with the directivity patterns of the DUT, referred to here as device-related transfer functions (DRTFs), in the spherical harmonics domain. The performance of the proposed method is evaluated in terms of the predicted signal-to-noise ratio (SNR) and the predicted mean opinion score (PMOS) of the DUT under various noise conditions. The root-mean-squared errors (RMSEs) of the predicted SNR and PMOS are on average below 4 dB and 0.28, respectively, across the range of tested SNRs, target source directions, noise types, and spherical harmonics decomposition methods. These experimental results indicate that the proposed method may be suitable for generating device-specific synthetic corpora from device-independent in-situ recordings.
Conference Paper
Full-text available
We propose a method for the synthesis of the magnitudes of Head-related Transfer Functions (HRTFs) using a sparse representation of anthropometric features. Our approach treats the HRTF synthesis problem as finding a sparse representation of the subject’s anthropometric features w.r.t. the anthropometric features in the training set. The fundamental assumption is that the magnitudes of a given HRTF set can be described by the same sparse combination as the anthropometric data. Thus, we learn a sparse vector that represents the subject’s anthropometric features as a linear superposition of the anthropometric features of a small subset of subjects from the training data. Then, we apply the same sparse vector directly on the HRTF tensor data. For evaluation purpose we use a new dataset, containing both anthropometric features and HRTFs. We compare the proposed sparse representation based approach with ridge regression and with the data of a manikin (which was designed based on average anthropometric data), and we simulate the best and the worst possible classifiers to select one of the HRTFs from the dataset. For instrumental evaluation we use log-spectral distortion. Experiments show that our sparse representation outperforms all other evaluated techniques, and that the synthesized HRTFs are almost as good as the best possible HRTF classifier.
Article
Full-text available
This volume in the Elsevier Series in Electromagnetism presents a detailed, in-depth and self-contained treatment of the Fast Multipole Method and its applications to the solution of the Helmholtz equation in three dimensions. The Fast Multipole Method was pioneered by Rokhlin and Greengard in 1987 and has enjoyed a dramatic development and recognition during the past two decades. This method has been described as one of the best 10 algorithms of the 20th century. Thus, it is becoming increasingly important to give a detailed exposition of the Fast Multipole Method that will be accessible to a broad audience of researchers. This is exactly what the authors of this book have accomplished. For this reason, it will be a valuable reference for a broad audience of engineers, physicists and applied mathematicians. The Only book that provides comprehensive coverage of this topic in one location. Presents a review of the basic theory of expansions of the Helmholtz equation solutions Comprehensive description of both mathematical and practical aspects of the fast multipole method and it's applications to issues described by the Helmholtz equation.
Conference Paper
Full-text available
Head-related transfer functions (HRTFs) represent the acoustic transfer function from a sound source at a given location to the ear drums of a human. They are typically measured from discrete source positions at a constant distance. Spherical harmonics decompositions have been shown to provide a flexible representation of HRTFs. Practical constraints often prevent the retrieval of measurement data from certain directions, a circumstance that complicates the decomposition of the measured data into spherical harmonics. A least-squares fit of coefficients is a potential approach to determining the coefficients of incomplete data. However, a straightforward non-regularized fit tends to give unrealistic estimates for the region were no measurement data is available. Recently, a regularized least-squares fit was proposed, which yields well-behaved results for the unknown region at the expense of reducing the accuracy of the data representation in the known region. In this paper, we propose using a lower-order non-regularized least-squares fit to achieve a well-behaved estimation of the unknown data. This data then allows for a high-order non-regularized least-squares fit over the entire sphere. We compare the properties of all three approaches applied to modeling the magnitudes of the HRTFs measured from a manikin. The proposed approach reduces the normalized mean-square error by approximately 7 dB in the known region
Conference Paper
Full-text available
Higher Order Ambisonics (HOA) is a flexible approach for representing and rendering 3D sound fields. Nevertheless, lack of effective microphone systems limited its use until recently. As a result of authors' previous work on the theory and design of spherical microphone arrays, a 4 th order HOA microphone has been built, measured and used for natural recording. The present paper first discusses theoretical aspects and physical limitations proper to discrete, relatively small arrays (spatial aliasing, low-frequency estimation). Then it focuses on the objective validation of such microphones. HOA directivities reconstructed from simulated and measured 3D responses are compared to the expected spherical harmonics. Criteria like spatial correlation help characterizing the encoding artifacts due to the model limitations and the prototype imperfections. Impacts on localisation criteria are evaluated.
Article
Full-text available
A theory and a system for capturing an audio scene and then rendering it remotely are developed and presented. The sound capture is performed with a spherical microphone array. The sound field at the location of the array is deduced from the captured sound and is represented using either spherical wave-functions or plane-wave expansions. The sound field representation is then transmitted to a remote location for immediate rendering or stored for later use. The sound renderer, coupled with the head tracker, reconstructs the acoustic field using individualized head-related transfer functions to preserve the perceptual spatial structure of the audio scene. Rigorous error bounds and Nyquist-like sampling criterion for the representation of the sound field are presented and verified.
Article
In this tutorial, head-related transfer functions (HRTFs) are introduced and treated with respect to their role in the synthesis of spatial sound over headphones. HRTFs are formally defined, and are shown to be important in reducing the ambiguity with which the classical duplex theory decodes a free-field sound's spatial location. Typical HRTF measurement strategies are described, and simple applications of HRTFs to headphone-based spatialized sound synthesis are given. By comparing and contrasting representations of HRTFs in the time, frequency, and spatial domains, different analytic and signal processing techniques used to investigate the structure of HRTFs are highlighted.
Article
Spherical Microphone Arrays (SMAs) constitute a powerful tool for analyzing the spatial properties of sound fields. However, the performance of SMA-based signal processing algorithms ultimately depends on the physical characteristics of the array. In particular, the range of frequencies over which an SMA provide rich spatial information is conditioned by the size of the array, the angular position of the sensors and other factors. In this work, we investigate the design of SMAs offering a wider frequency range of operation than that offered by conventional designs. To achieve this goal, microphones are distributed both on and at a distance from the surface of a rigid spherical baffle. The contributions of the paper are as follows. First, we present a general framework for modeling SMAs whose sensors are located at different distances from the array center and calculating optimal filters for the decomposition of the sound field into spherical harmonic modes. Second, we present an optimization method to design multi-radius SMAs with an optimally wide frequency range of operation given the total number of sensors available and target spatial resolution. Lastly, based on the optimization results, we built a prototype dual-radius SMA with 64 microphones. We present measurement results for the prototype microphone array and compare these results with theory.
Conference Paper
We apply the recently proposed Context-Dependent Deep- Neural-Network HMMs, or CD-DNN-HMMs, to speech-to-text transcription. For single-pass speaker-independent recognition on the RT03S Fisher portion of phone-call transcription benchmark (Switchboard), the word-error rate is reduced from 27.4%, obtained by discriminatively trained Gaussian-mixture HMMs, to 18.5%?aa 33% relative improvement. CD-DNN-HMMs combine classic artificial-neural-network HMMs with traditional tied-state triphones and deep-beliefnetwork pre-training. They had previously been shown to reduce errors by 16% relatively when trained on tens of hours of data using hundreds of tied states. This paper takes CD-DNNHMMs further and applies them to transcription using over 300 hours of training data, over 9000 tied states, and up to 9 hidden layers, and demonstrates how sparseness can be exploited. On four less well-matched transcription tasks, we observe relative error reductions of 22¨C28%.
Article
Spherical microphone arrays have been recently studied for sound-field recordings, beamforming, and sound-field analysis which use spherical harmonics in the design. Although the microphone arrays and the associated algorithms were presented, no comprehensive theoretical analysis of performance was provided. This work presents a spherical-harmonics-based design and analysis framework for spherical microphone arrays. In particular, alternative spatial sampling schemes for the positioning of microphones on a sphere are presented, and the errors introduced by finite number of microphones, spatial aliasing, inaccuracies in microphone positioning, and measurement noise are investigated both theoretically and by using simulations. The analysis framework can also provide a useful guide for the design and analysis of more general spherical microphone arrays which do not use spherical harmonics explicitly.