Content uploaded by Ivan Tashev
Author content
All content in this area was uploaded by Ivan Tashev on Jun 01, 2017
Content may be subject to copyright.
Synthesis of Device-Independent Noise Corpora for Realistic ASR Evaluation
Hannes Gamper1, Mark R. P. Thomas1, Lyle Corbin2, Ivan Tashev1
1Microsoft Research Redmond
2Microsoft Corporation
{hagamper,lylec,ivantash}@microsoft.com, mark.r.thomas@ieee.org
Abstract
In order to effectively evaluate the accuracy of automatic speech
recognition (ASR) with a novel capture device, it is important to
create a realistic test data corpus that is representative of real-
world noise conditions. Typically, this involves either record-
ing the output of a device under test (DUT) in a noisy environ-
ment, or synthesizing an environment over loudspeakers in a
way that simulates realistic signal-to-noise ratios (SNRs), rever-
beration times, and spatial noise distributions. Here we propose
a method that aims at combining the realism of in-situ record-
ings with the convenience and repeatability of synthetic cor-
pora. A device-independent spatial recording containing noise
and speech is combined with the measured directivity pattern
of a DUT to generate a synthetic test corpus for evaluating the
performance of an ASR system. This is achieved by a spherical
harmonic decomposition of both the sound field and the DUT’s
directivity patterns. Experimental results suggest that the pro-
posed method can be a viable alternative to costly and cumber-
some device-dependent measurements. The proposed simula-
tion method predicted the SNR of the DUT response to within
about 3 dB and the word error rate (WER) to within about 20%,
across a range of test SNRs, target source directions, and noise
types.
Index Terms: automatic speech recognition, device characteri-
zation, device-related transfer function, spherical harmonics
1. Introduction
Automatic speech recognition (ASR) is an integral part of many
hardware devices, including mobile phones, game consoles and
smart televisions, to enable hands-free operation and voice con-
trol. When evaluating how robust a device’s ASR engine is to
noise and reverberation in real world settings, one must account
for the device hardware characteristics as well as typical usage
scenarios. This is normally achieved by exposing the device un-
der test (DUT) to realistic conditions in terms of environmen-
tal noise and reverberation while evaluating the performance
of the ASR engine. Such in-situ tests are extremely valuable
when tuning hardware and software parameters to maximize
ASR performance, especially if the DUT has multiple micro-
phones and the effect of microphone placement and (spatial)
speech enhancement algorithms needs to be assessed.
However, in-situ tests are lengthy and cumbersome, requir-
ing hours of recordings made on the DUT that have to be redone
whenever hardware changes are made to the DUT. Furthermore,
the exact test conditions are difficult or impossible to recreate
when attempting to evaluate the effect of a hardware change or
comparing the performance of an ASR engine across devices.
In order to overcome the limitations of in-situ tests, a pre-
recorded test corpus can be used. An ASR test corpus typi-
cally consists of a variety of scenarios–differing in the level,
type and spatial quality of the background noise–that are repre-
sentative of the conditions to which a DUT might be subjected
in real-world use. During testing, the corpus is rendered over
a loudspeaker setup and recorded through the DUT. Two ex-
amples of pre-recorded test corpus methodologies are encoded
as part of specifications by the European Telecommunications
Standards Institute (ETSI) [1, 2]. Both techniques utilize multi-
channel recordings that are played back over rigorously cali-
brated speaker systems. The systems attempt to recreate the
original sound field of the real world environment for a device
placed at the center of the playback system.
Song et al. describe a method for simulating realistic back-
ground noise to test telecommunication devices based on spatial
sound field recordings from a spherical microphone array [3].
The authors compare various methods to derive the input sig-
nals to a circular loudspeaker array delivering the spatial noise
recording to the DUT. While using a pre-recorded test corpus
has the advantage of repeatability and simpler logistics com-
pared to in-situ tests, it requires a highly specialized test en-
vironment and hardware setup for playback and recording. In
addition, emulating the complexity of a real, noisy environment,
with potentially hundreds of spatially distributed noise sources,
can be challenging.
Here we propose a method that combines the realism of
in-situ tests with the convenience and repeatability of a pre-
recorded corpus without the requirement of a specialized play-
back setup. The approach is based on a device-independent
spatial in-situ recording that is combined with the directivity
characteristics of a DUT to create a synthetic test corpus for
ASR performance and robustness evaluation. The DUT direc-
tivity characteristics can be obtained through measurements or
via acoustic simulation. In a similar fashion to the work by Song
et al. [3], the proposed method is based on a spherical harmon-
ics decomposition of a spatial noise recording obtained using a
spherical microphone array. However, by also using a spherical
harmonic decomposition of the DUT’s directivity pattern, we
show that it is possible to simulate the DUT response directly
without performing actual recordings on the DUT.
2. Proposed approach
The goal of the proposed method for generating a synthetic
test corpus is to simulate the response of a DUT to a pre-
recorded noisy environment. Section 2.1 describes the device-
independent capture and representation of a sound field, while
section 2.2 discusses obtaining the device-dependent directivity
characteristics of the DUT. The proposed approach for combin-
ing device-independent recordings with device-dependent di-
rectivity characteristics in the spherical harmonics domain is
presented in Section 2.3.
Copyright © 2016 ISCA
INTERSPEECH 2016
September 8–12, 2016, San Francisco, USA
http://dx.doi.org/10.21437/Interspeech.2016-9782791
Figure 1: 64-channel spherical microphone array, allowing a
7-th order spherical harmonic approximation to the recorded
sound field.
2.1. Sound field capture and decomposition
Real noisy environments typically contain a multitude of spa-
tially distributed noise sources. In order to evaluate the ASR
performance under realistic conditions it is important to subject
the DUT to spatially diverse noise environments. Therefore, the
spatial quality of the noise environment is preserved in record-
ings used for ASR evaluation. A common way to capture a
sound field spatially is to use an array of microphones placed
on the surface of a sphere. A 64-channel example with radius
100 mm is shown in Figure 1.
Spherical harmonics provide a convenient way to describe
a sound field captured using a spherical microphone array. By
removing the scattering effect of the microphone baffle, the
free-field decomposition of the recorded sound field can be esti-
mated. Given the microphone signals P(r0, θ, φ, ω), where r0
is the array radius, θand φare the microphone colatitude and
azimuth angles, respectively, and ωis the angular frequency,
the plane wave decomposition of the sound field captured with
a spherical array of Mmicrophones, distributed uniformly on
the surface of the sphere [4], is given in the spherical harmonics
domain by [5, 6]
˘
Snm(ω) = 1
bn(kr0)
4π
M
M
X
i=1
P(r0, θi, φi, ω)Y−m
n(θi, φi),
(1)
where k=ω/c and cis the speed of sound. The spherical
harmonic of order nand degree mis defined as
Ym
n(θ, φ) = (−1)ms2n+ 1
4π
(n− |m|)!
(n+|m|)! P|m|
n(cos θ)eimφ,
(2)
where the associated Legendre function Pm
nrepresents stand-
ing waves in θand eimφ represents travelling waves in φ.
Note that Condon-Shortley phase convention is used such that
Ym
n(θ, φ)∗=Y−m
n(θ, φ)[7].
In the case of a spherical scatterer, the mode strength
Figure 2: DUT directivity patterns: measured (top) and sim-
ulated through 7th-order spherical microphone array (middle)
and ideal 7th-order rigid spherical scatterer (bottom).
bn(kr0)is defined for an incident plane wave as
bn(kr0) = 4πin jn(kr0)−j0
n(kr0)
h0(2)
n(kr0)h(2)
n(kr0)!,(3)
where jn(kr0)is the spherical Bessel function of degree n,
h(2)
n(kr0)is the spherical Hankel function of the second kind
of degree n, and (·)0denotes differentiation with respect to the
argument. The mode strength term in (1) is necessary to ac-
count for the scattering effect of the spherical baffle in order to
obtain a plane-wave decomposition of the sound field.
2.2. Characterising the device under test (DUT)
Under the assumption of linearity and time invariance, the re-
sponse of the DUT to an input signal is given by a transfer func-
tion describing the acoustic path from the sound source to the
microphone. In far field conditions, that is, when the source is
further than approximately one meter from the DUT, this trans-
fer function varies spectrally with source azimuth and eleva-
tion, whereas the effect of source distance is mostly limited to
the signal gain. Therefore, the directivity characteristics of the
DUT can be approximated through transfer function measure-
ments at a single distance, spanning the range of azimuth and
elevation angles of interest. Figure 2 (top) shows the directivity
patterns of one microphone of a Kinect device [8]. Alterna-
tively, acoustic simulation can be used to estimate these trans-
fer functions. Due to the similarity of these direction-dependent
transfer functions to the concept of head-related transfer func-
tions (HRTFs) in the field of spatial audio rendering [9], we
2792
Kinect
Spherical decomp. (1)
DUT DRTF application (7)
"Simulation" "Reference"
Figure 3: Experimental setup.
refer to the direction-dependent DUT transfer functions as the
device-related transfer functions (DRTFs).
In analogy to spherical microphone array recordings,
DRTFs measured at points uniformly distributed over the sphere
can be decomposed using spherical harmonics:
˘
Dnm(ω) = 4π
N
N
X
i=1
D(θi, φi, ω)Y−m
n(θi, φi),(4)
where Nis the number of DRTFs and D(r, θi, φi, ω)are the
DRTFs as a function of the measurement colatitude and azimuth
angles of arrival, θand φ. In cases where the DRTF measure-
ment points do not cover the whole sphere or are not uniformly
distributed, a least-squares decomposition can be used [10].
2.3. Combining sound field recordings and DUT directivity
To simulate the DUT behavior in the recorded noise environ-
ment, the device-related transfer functions (DRTFs) of the DUT
are applied to the spherical array recording. This can be conve-
niently performed in the spherical harmonics domain, in anal-
ogy to applying head-related transfer functions to a recording
for binaural rendering [11]. An aperture weighting function de-
rived from the DRTFs is applied to the estimated free-field de-
composition of the recorded sound field. The sound pressure at
each microphone of the DUT is then found by integrating the
DRTF-weighted pressure over the sphere:
P(ω) = ZΩ∈S2
S(Ω, ω)D(Ω, ω)dΩ(5)
=
∞
X
n=−∞
∞
X
n0=−∞
n
X
m=−n
n0
X
m0=−n0
˘
Snm(ω)˘
Dn0m0(ω)
ZS
Ym
n(Ω)Ym0
n0(Ω)dΩ(6)
=
∞
X
n=−∞
n
X
m=−n
˘
Snm(ω)˘
Dn,−m(ω).(7)
3. Experimental evaluation
The experimental setup (see Figure 3) consisted of a spheri-
cal microphone array (see Figure 1) and a Kinect device as
the DUT. Impulse responses of both the spherical array and the
DUT were measured in an anechoic chamber for 400 directions
at a radius of one meter, using a setup described by Bilinski
et al. [12]. For the resulting DUT DRTFs, extrapolation was
Figure 4: Geometric layout of noise sources (black dots) and
speech sources (red dots) at 5.6 degrees azimuth and 0 degrees
elevation (a), 63.7 degrees azimuth and -10.4 degrees eleva-
tion (b), -84.4 degrees azimuth and 0 degrees elevation (c), and
172.1 degrees azimuth and 44.7 degrees elevation (d).
target SNR [dB]
-20 0 20
SNR difference [dB]
-5
0
5
10
target SNR [dB]
-20 0 20
a
b
c
d
Figure 5: SNR difference between simulation and reference, for
brown noise (left) and market noise (right). Labels a–d indicate
the speech source locations labelled a–d in Figure 4.
used to cover the whole sphere [10]. Figure 2 (top) illustrates
the directivity patterns of one microphone of the DUT. Figure 2
(middle) depicts the directivity patterns equivalent to applying
the DUT DRTFs to a spherically isotropic sound field recorded
via the spherical array. Each point in the directivity patterns in
Figure 2 (middle) is obtained by decomposing the Nspherical
array impulse responses for that direction using (1) and apply-
ing the DUT DRTF via (7). As the spherical array shown in
Figure 1 does not behave like an ideal scatterer with ideal micro-
phones, the sound field decomposition is imperfect and the re-
sulting equivalent DUT directivity slightly distorted compared
to the actual, measured DUT DRTF. As shown in Figure 2 (bot-
tom), this distortion is largely corrected when replacing the real
array impulse responses in the simulation with those of an ideal
scatterer [6]. A follow-up study [15] addresses the discrepancy
between real and ideal scatterer through calibration of the array
and by deriving optimal scatterer removal functions [13, 14].
Simulations were performed combining speech recordings
with simulated and recorded spatial noise. The speech corpus
consisted of 2000 utterances containing 14146 words recorded
by 50 male and female speakers in a quiet environment, for a to-
tal duration of 2.5 hours. Speech recognition was performed us-
ing a DNN based ASR engine [16] with acoustic models trained
on the clean speech corpus. Two types of noise were used: 60
2793
target SNR [dB]
-20 0 20
WER [%]
0
50
100
a) target SNR [dB]
-20 0 20
b)
Simulation
Reference
Difference
target SNR [dB]
-20 0 20
WER [%]
0
50
100
c)
target SNR [dB]
-20 0 20
d)
Figure 6: Word error rates (WERs) for brown noise. Labels a–d
indicate the speech source locations labelled a–d in Figure 4.
seconds of random Gaussian noise with a frequency roll-off of
6 dB per octave (i.e., brown noise), and a 60 second record-
ing of ambient noise in a busy outdoor market place, obtained
with the spherical microphone array. For the experiments, the
spherical harmonics decomposition of the recorded sound field
was evaluated at 16 directions, as shown in Figure 4, to emulate
playback over 16 spatially distributed virtual loudspeakers. The
number of virtual speakers was chosen as a trade-off between
spatial fidelity and computational complexity.
The output of both the spherical array and the DUT was
simulated by convolving virtual source signals with the corre-
sponding measured impulse responses. The virtual source sig-
nals were derived by extracting a pseudo-random segment of
the noise data and mapping it to a virtual source direction. Sim-
ilarly, the speech recordings were mapped to one of four virtual
source directions, to simulate a speaker embedded in noise. The
noise and speaker locations used are shown in Figure 4. Note
that the setup includes locations off the horizontal plane to em-
ulate the spatial diversity found in real environments.
Tests were performed by combining the simulated noise and
speech responses at a target signal-to-noise ratio (SNR). The
SNR was calculated in the frequency band 100–2000 Hz, as the
energy ratio between the microphone response during speech
activity (signal) and the response in absence of speech (noise).
For each target SNR, appropriate noise and speech gains were
derived by simulating the DUT response via (7), i.e., the pro-
posed method (“simulation” in Figure 3). Those same gains
were then applied to the noise and speech samples convolved
directly with the DUT DRTFs (“reference” in Figure 3). This
experiment evaluates how closely the SNR estimated from the
simulated DUT response matches the SNR of the reference re-
sponse. As shown in Figure 5, the mismatch between simulated
and reference SNRs is within ±5 dB across the tested target
SNRs, source directions, and noise types, with lower errors for
low target SNRs and speech directions closer to the front (a and
b). For target SNRs below 10 dB, the predicted SNRs are actu-
ally within ±3 dB. Above 10 dB SNR background noise in the
raw speech recordings may start to affect results.
The simulation and reference responses of the DUT to the
noisy speech samples generated for the SNR experiment de-
scribed above were fed to the ASR engine. Figures 6 and
target SNR [dB]
-20 0 20
WER [%]
0
50
100
a) target SNR [dB]
-20 0 20
b)
Simulation
Reference
Difference
target SNR [dB]
-20 0 20
WER [%]
0
50
100
c)
target SNR [dB]
-20 0 20
d)
Figure 7: Word error rates (WERs) for market noise. Labels a–d
indicate the speech source locations labelled a–d in Figure 4.
7 show the resulting average word error rates (WERs). The
simulated corpus predicts the reference WERs fairly accurately
across SNRs and source directions, except around -5 dB where
the WER is most sensitive to the SNR. For brown noise, the
simulation underestimates the WER for source directions a and
c, whereas for market noise, the simulation overestimates the
WER for directions a, b and d. This estimation bias could be
explained by SNR mismatches between simulated and reference
responses, as illustrated in Figure 5. However, the WER change
as a function of SNR is predicted fairly well by the simulation.
4. Summary and conclusion
The proposed method allows the use of a device-independent
spatial noise recording to generate a device-specific synthetic
speech corpus for automatic speech recognition performance
evaluation under realistic conditions. Experimental results in-
dicate that the proposed method allows predicting the expected
signal-to-noise ratio (SNR) of a device under test (DUT) ex-
posed to spatial noise to within about ±3 dB. The mismatch
between simulation and reference SNRs may be reduced by ap-
plying a calibration and appropriate optimal scatterer removal
functions to the spherical microphone array used for the spatial
noise recordings. The prediction of average word error rates
(WERs) was accurate to within about 20%. While estimation
bias may have affected absolute WER prediction, the proposed
method predicted WER change as a function of SNR fairly well.
This indicates that the method may be well suited to evaluate the
relative effect of hardware changes on ASR performance.
One limitation of the method is the assumption of far-field
conditions, i.e., that all sound sources are further than approxi-
mately one meter from the DUT. However, in a close-talk situa-
tion with a target source in the vicinity of the DUT, the method
may still prove useful for evaluating the effect of ambient noise
on ASR performance. A major advantage of the proposed
method is that it allows running completely simulated experi-
ments. Here, speech recognition was performed on 2.5 hours
of speech data for two noise types, four speech directions, and
over 40 target SNRs. Collecting this data using live recordings
would have taken 2.5×2×4×40 = 800 hours. Future work
includes verification of the method in live noise environments.
2794
5. References
[1] Speech and multimedia Transmission Quality (STQ); A sound field
reproduction method for terminal testing including a background
noise database, ETSI EG 202 396-1 Std., 2015.
[2] Speech and multimedia Transmission Quality (STQ); Speech
quality performance in the presence of background noise; Part
1: Background noise simulation technique and background noise
database, ETSI TS 103 224 Std., 2011.
[3] W. Song, M. Marschall, and J. D. G. Corrales, “Simulation of
realistic background noise using multiple loudspeakers,” in Proc.
Int. Conf. on Spatial Audio (ICSA), Graz, Austria, 2015.
[4] J. Fliege and U. Maier, “A two-stage approach for computing cu-
bature formulae for the sphere,” in Mathematik 139T, Universit ¨
at
Dortmund, Fachbereich Mathematik, 44221, 1996.
[5] E. G. Williams, Fourier Acoustics: Sound Radiation and
Nearfield Acoustical Holography, 1st ed. London: Academic
Press, 1999.
[6] B. Rafaely, “Analysis and design of spherical microphone arrays,”
IEEE Trans. Speech and Audio Processing, vol. 13, no. 1, pp.135–
143, 2005.
[7] N. A. Gumerov and R. Duraiswami, Fast Multipole Methods for
the Helmholtz Equation in Three Dimensions. Elsevier, 2004.
[8] “Kinect for Xbox 360,” http://www.xbox.com/en-US/xbox-
360/accessories/kinect.
[9] C. I. Cheng and G. H. Wakefield, “Introduction to head-related
transfer functions (HRTFs): Representations of HRTFs in time,
frequency, and space,” in Proc. Audio Engineering Society Con-
vention, New York, NY, USA, 1999.
[10] J. Ahrens, M. R. Thomas, and I. Tashev, “HRTF magnitude mod-
eling using a non-regularized least-squares fit of spherical har-
monics coefficients on incomplete data,” in Proc. APSIPA Annual
Summit and Conference, Hollywood, CA, USA, 2012.
[11] L. S. Davis, R. Duraiswami, E. Grassi, N. A. Gumerov, Z. Li,
and D. N. Zotkin, “High order spatial audio capture and its bin-
aural head-tracked playback over headphones with HRTF cues,”
in Proc. Audio Engineering Society Convention, New York, NY,
USA, 2005.
[12] P. Bilinski, J. Ahrens, M. R. P. Thomas, I. J. Tashev, and J. C. Platt,
“HRTF magnitude synthesis via sparse representation of anthro-
pometric features,” Florence, Italy, 2014, pp. 4501–4505.
[13] S. Moreau, J. Daniel, and S. Bertet, “3D sound field recording
with higher order ambisonics - objective measurements and vali-
dation of spherical microphone,” in Proc. Audio Engineering So-
ciety Convention 120, Paris, France, 2006.
[14] C. T. Jin, N. Epain, and A. Parthy, “Design, optimization
and evaluation of a dual-radius spherical microphone array,”
IEEE/ACM Transactions on Audio, Speech, and Language Pro-
cessing, vol. 22, no. 1, pp. 193–204, 2014.
[15] H. Gamper, L. Corbin, D. Johnston, and I. J. Tashev, “Synthe-
sis of device-independent noise corpora for speech quality assess-
ment,” in Proc. Int. Workshop on Acoustic Signal Enhancement
(IWAENC), Xi’an, China, 2016.
[16] F. Seide, G. Li, and D. Yu, “Conversational speech transcription
using context-dependent deep neural networks,” in Proc. Inter-
speech, Florence, Italy, 2011, pp. 437–440.
2795