Content uploaded by Jacob Donley
Author content
All content in this area was uploaded by Jacob Donley on Jun 24, 2015
Content may be subject to copyright.
© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
AN EFFICIENT APPROACH TO DYNAMICALLY WEIGHTED MULTIZONE
WIDEBAND REPRODUCTION OF SPEECH SOUNDFIELDS
Jacob Donley, Christian Ritz
School of Electrical Computer and Telecommunications Engineering, University of Wollongong,
Wollongong, NSW, Australia, 2522, jrd089@uowmail.edu.au, critz@uow.edu.au
ABSTRACT
This paper proposes and evaluates an efficient approach for practical
reproduction of multizone soundfields for speech sources. The
reproduction method, based on a previously proposed approach,
utilises weighting parameters to control the soundfield reproduced
in each zone whilst minimising the number of loudspeakers
required. Proposed here is an interpolation scheme for predicting the
weighting parameter values of the multizone soundfield model that
otherwise requires significant computational effort. It is shown that
initial computation time can be reduced by a factor of 1024 with
only 85dB of error in the reproduced soundfield relative to
reproduction without interpolated weighting parameters. The
perceptual impact on the quality of the speech reproduced using the
method is also shown to be negligible. By using pre-saved
soundfields determined using the proposed approach, practical
reproduction of dynamically weighted multizone soundfields of
wideband speech could be achieved in real-time.
Index Terms— multizone soundfield reproduction, wideband
multizone soundfield, weighted multizone soundfield, look-up
tables (LUT), interpolation, sound field synthesis (SFS)
1. INTRODUCTION
The reproduction of audio in spatially separated regions is of high
interest for applications such as the creation of personal listening
zones in entertainment/cinema, multi-participant teleconferencing
and vehicle cabins. This can be achieved using multizone soundfield
reproduction and was originally proposed in [1] to reproduce
multiple independent zones of active and quiet 2D soundfield
regions. In [1] a least squares pressure matching technique was used,
where estimated acoustic pressures reproduced by a set of
loudspeakers are matched to sample values within the desired
soundfield zone. This technique is also used in Sound Field
Reconstruction (SFR) [2].
Following [1], a multizone approach using cylindrical
harmonic expansion with coefficient translation and angular
windowing was proposed [3]–[5]. This approach, however, attempts
to completely suppress any interzone interference which can result
in impractically large loudspeaker signal amplitudes or
impractically low levels in zones. A method better suited for
implementation and controllability was introduced in [6]. The
approach uses orthogonal basis expansion which reduces the
problem to the reconstruction of a set of basis wavefields and allows
each zone to be weighted according to the importance of its
reproduction. This weighting improves the practicality of the system
by relaxing the ideal requirement of completely quiet zones outside
the target zone. The theory in [5] was extended in [7] to include a
similar weighting criteria as used in [6].
While originally focusing on single frequency soundfields,
more recent work attempted to create multizone soundfields [8], [9]
with frequency bandwidths equivalent to narrowband and wideband
speech. This paper investigates the use of weighted multizone
soundfield reproduction for wideband speech using the orthogonal
basis expansion of [6]. The orthogonal basis expansion approach
assumes the soundfield is reproduced as a sum of independent
planewaves reproduced by an array of loudspeakers. Each
planewave corresponds to an individual frequency and direction.
The number of chosen planewaves is governed by the desired
reproduction accuracy and varies depending on frequency. Due to
this, a formula is proposed in this work which determines the
number of planewaves to be used for a given frequency so that
neither spatial aliasing nor ill-conditioned matrices is an inherent
problem. This is necessary when dealing with the reproduction of
soundfields containing multiple frequencies.
While [6] assumes the same weight for each frequency,
dynamically deriving the weights can be used to control the
reproduction accuracy of individual frequency components within
the bright and quiet zones. For example, the weightings can be based
on the perceptual importance of particular frequencies in the zones
in an effort to improve the overall perceived sound quality.
However, this results in increases in computational complexity. To
reduce this complexity and create a more practical solution, this
paper proposes the interpolation of spatial components of the
reproduction along different domains, such as the weighting domain
and frequency domain.
A system is synthesised with varying linear interpolation
distances by using different resolution lookup tables (LUTs) for
storing pre-computed loudspeaker weights and soundfield values.
The synthesis comprises of reproducing wideband zones where
frequency domain content is weighted uniformly with weights that
are in the centre of interpolation regions. The approach is validated
by comparing the reproduced zone signals from the interpolation
method with signals reproduced without interpolation using Mean
Squared Error (MSE) and Perceptual Evaluation of Speech Quality
(PESQ) [10] measures.
Section 2 describes the orthogonal basis expansion approach
to multizone soundfield reproduction. The proposed dynamically
weighted multizone approach is described in Section 3 and a novel
approach to selecting the number of orthogonal planewaves is
presented in Section 4. Section 5 describes the interpolation method.
Evaluation results are given in Section 6 and conclusions outlined in
Section 7.
2. MONOFREQUENT MULTIZONE SOUNDFIELD
REPRODUCTION
A multizone soundfield reproduction layout is depicted in Figure 1.
The reproduction region, , of radius
R
, contains three sub
regions called the bright zone, quiet zone and unattended zone,
denoted by b, q and ()bq
, respectively. The radius of
b and q is r and their centres are located on a circle of radius
z
r. The angle of the desired planewave in bis
and is
reproduced by loudspeakers positioned on an arc of angle
L
and
radius l
R
with the first loudspeaker starting at angle
.
In the orthogonal basis expansion approach to multizone
soundfield reproduction [11], a soundfield function (,)xSk that
fulfils the wave equation, where x is an arbitrary spatial
sampling point and k is the wavenumber of the soundfield, is
written as a summation of an orthogonal set of the Helmholtz
equation solutions [12] as
,,,
nn
n
Sk CkwG k
xx
(1)
where { }
nG is a series of weighted basis functions and given a
desired soundfield, (,)x
d
Sk, the expansion coefficients are derived
as *
(, ) (,) (,) ()
d
nn
Ckw S kG kw dxxxx
using a weighted inner-
product, where w(x) represents the weighting function. The
weighted QR factorisation may also use a weighted inner-product.
Here, the weighting function, ( )x
w, can be written as ( )xbbww
,
()xqqww and ( )xuuww, where xbb, xqq
and
()xbqu
, indicating different values for the weights
within each of the bright ( b), quiet ( q), and unattended zones,
respectively as illustrated in Figure 1.
Following a QR factorisation on a set of
N
planewaves,
),(x
nFk, arriving from angles 02
, (1) becomes,
,, ,,xx
jj
j
Skw P wFkk (2)
where the coefficients for the wavefields are
1
(, ) (, )( )Rj
n
jnnkw C wPk
, {1, , }jN , {1, , }nN and R
is the upper triangular matrix from the QR factorisation [6].
The reproduction of the soundfield in (2) at a particular
location can then be expressed as a summation of the
discontinuously located loudspeaker signals [3],
(1)
0
1
,, , 4
L
a
disc l l
l
i
Skw kwHk
xxx
(3)
where (, )
lkw
is the th
l complex loudspeaker weight,
L
is the
number of loudspeakers, 1i
, (1)
0(||||)Hk is a zeroth-order
Hankel function of the first kind [12] and xl is the position of the
th
l loudspeaker. The complex loudspeaker weights are defined as,
1
(1)
(, ) 2 , pw l
Mim im
m
lmlj s
j
mM
kw i H kR P kwie e
(4)
where
M
kR
is the truncation length [3],
R
and l
R
are from
Figure 1, ( 1)
pw j
are the wavefield angles, 2 /
J
,
l
is the angle of the th
l loudspeaker from the x-axis and
s
is
the angular spacing of the loudspeakers. The minimum number of
loudspeakers to avoid aliasing is given by,
221
max
LeRk
(5)
where
L
is the minimum number of required loudspeakers, e is
Euler’s number and max
k
is the wavenumber of the highest
frequency (where frequency,
f
kc [12] and 1
343cms
is the
speed of sound). See [3] for further details.
3. WEIGHTED MULTIZONE WIDEBAND SOUNDFIELDS
A wideband soundfield is described here as a linear combination of
planewaves corresponding to each source frequency, similar to a
Fourier series [12] and the approach of [8], [9]. The pressure
generated at any point in the reproduced soundfield is given by,
ˆ,,,xx
K
a
disc
k
pw S kw (6)
where there are
K
different sinusoidal components. Multiple
nested summations are required to derive ˆ(, )
xpw as:
*
,,
ˆ,
x
xxxxx
KL M J N
kl km m
kl mM
n
n
jn
j
d
SkGkww
(7)
where 1
()R
j
njn
, pw
im
m
mie
, (1)
2/(())
l
im
km l
sm
eiHkR
and (1)
0(||() ||)/4xxxkl l
iH k
. Here, a summation occurs for
every sample in , for
N
planewaves,
J
wavefields, 21
M
modes,
L
loudspeakers and
K
sinusoidal components.
This paper extends the approach of [6] to allow for frequency-
dependent weighting functions such that the leaked frequency
spectrum can be controlled. This may benefit in cases where
occlusion is a problem [9] and may improve the perceivable error
compared to a single zone weight. To weight signals dynamically
for each frequency, during the soundfield construction in (3),
loudspeaker weights and soundfield samples are obtained per time-
frequency component for a free field reproduction scenario (future
work will look into reproduction in reverberant rooms). For example,
if a signal from an arbitrary location in the reproduction area,
defined in (3), was to be synthesised,
ˆ,,,
a
FdiscF
YkS kwYkxx (8)
where ˆ(,)
xFYk
is the frequency domain signal at an arbitrary
location, x, in the reproduction region, , ( )FYk is obtained from
the short-time Fourier transform of the windowed frame of input
()yn and | |
denotes the absolute value. If loudspeaker signals are
preferred to be synthesised then |(,,)|x
a
disc
Skw can be replaced with
Figure 1 – Weighted multizone soundfield reproduction layout
(, )
lkw
from (4) and ˆ(, )
xFYk
becomes ˆ()
FlYk. Time domain
loudspeaker signals, ˆ()
l
yn, and/or soundfields, ˆ(,)xyn, are derived
from (8) via overlap-add reconstruction [13].
4. ORTHOGONAL PLANEWAVE SELECTION
The wavefield coefficients depend on the expansion coefficients and
the upper triangular matrix resulting from QR factorisation
computed on a set of
N
planewave functions arriving from angles
02
. Hence, the computation increases with both
N
and the
spatial sampling density of the reproduction zone. A larger
N
results in a more accurate reproduction, however, in practice, this
also results in poorly conditioned upper triangular matrices. A lower
N
causes spatial aliasing. The effect of N is also frequency
dependent. Selecting
N
for the layout in Figure 1 is chosen to
balance the impact of spatial aliasing and ill-conditioning:
1.2 1.2
max min max min min mincc c
N
fNf Nf f f ff Nf
(9)
where min
f
and max
f
are the minimum and maximum frequencies,
respectively, ()
c
N
f
is the largest number of planewaves that
satisfies 10
()0() 1Rf
or )min( ) (
q
M
f
for a given frequency,
f
.
Here, ( )R
is the condition number of the upper triangular matrix,
q
M
is the mean squared error (MSE) in the quiet zone and is limited
to 60
q
MdB
. This work focuses on reproducing small room
sized soundfields with a bandwidth 150 8
f
Hz kHz. Due to the
size of the soundfields, zone weighting for 150
f
Hz has a
negligible effect, as can be seen to trend in Figure 2. Analysis shows
that (150 30)
cN and (8000) 300cN based on (9). This allows
the wideband system to be synthesised using the orthogonal basis
expansion method with minimal error caused by the selection of
N
.
5. LOOK-UP TABLE BASED SYNTHESIS AND
WEIGHTING
It is computationally demanding to construct a weighted multizone
soundfield using the methods of Sections 2 and 3 due to the QR
factorisation involved for all time-frequency components (e.g. a
three second audio file sampled at 16kHz requires approximately
48 103 independent reproductions). In order to best make use of
these repeated reproductions, the loudspeaker weights and
soundfield pressure samples can be reproduced and stored for later
use. Once enough values have been stored, interpolation between
them can further reduce computation and error caused by quantised
values. In this paper we propose the use of Look-Up Tables (LUT)
to store pre-determined weighted soundfield values to be used for a
given setup or wideband reproduction, with an example shown in
Figure 2. A LUT may be described as a matrix of soundfield
reproduction values for a given frequency and weight range,
,, ,,
,, ,,
xx
A
xx
aa
disc min min disc min max
aa
disc max min di
u
sc max
v
max
Skw Skw
Skw Skw
(10)
where ( , , )
a
disc
Skwx from (3) is a soundfield reproduction value for
wavenumber, k, and weighting, w, at x and Auv is a LUT
with u number of frequencies and v number of weights in the
range { , , }min maxkk
and { },,min maxww
, respectively. The set of
frequencies is logarithmically spaced as it closely resembles the
spacing of the Bark scale [13] and the set of weights is
logarithmically spaced as it provides larger control ranges in the
decibel scale. The table can be built for loudspeaker signals by
replacing (, , )x
a
disc
Skw with (, )
lkw
from (4).
In order to evaluate the error and perceptual effects of
quantising and interpolating soundfield values, a comparison is
made between two LUTs (see Section 6). The MSE is evaluated as
the difference between the interpolated values of lower and higher
resolution LUT values as,
2
1
uv uvLuvTu
Uv
AAA
(11)
where
L
UT
is the MSE for the given interpolated LUT, Auv
,
relative to the highest resolution LUT, Auv
, u
is the highest
frequency resolution, v
is the highest weight resolution, u and v
are the set of frequency and weight resolutions to evaluate,
respectively. The interpolated LUT, Auv
, is a matrix of size uv
obtained from interpolation of a smaller matrix, Auv . In this work
bilinear interpolation is used.
6. RESULTS
This section describes simulations of reproduced soundfields using
the above approach evaluated using MSE and PESQ measures.
6.1. Evaluation Setup
The multizone soundfield layout of Figure 1 is evaluated, where
0.3rm
, 0.6zrm
, 1
R
m, 1.5l
R
m
,
1
sin ( / 2 ) 14.5zrr
and 3.14159c
. This setup is similar
to [6] and
is chosen such that an evanescent planewave with
instant decay would interfere with half the quiet zone resulting in a
large range of weighting control with a slight occlusion problem.
Signals sampled at 16 kHz are converted to the time-frequency
domain using a Hamming window (50% overlap) and Fast Fourier
transform (FFT) of length 1024 . The tables are analysed for a
reproduction that meets and doesn’t meet the aliasing criteria for the
minimum number of reproduction loudspeakers given by (5). The
LUTs are built for a reproduction where 16L, / 2
and
L
, referred here on as the aliasing setup. They are also built
where 65L
and 2L
, referred to as the non-aliasing setup.
The aliasing setup aliases above 4kHz and the non-aliasing setup
aliases above 8kHz . The aliasing setup is included to show that
visualising the soundfield values can provide other benefits.
The tables are built with the pressures for all bq
x and
averaged across b and q, from which the soundfield zones can
be approximated. The zone weights are chosen as 1bw
and
0.05uw
following [6] and the variable weight is qw. The effect
of qwon the input signal is evaluated using (8), which can also be
reversed to find a soundfield level to obtain a desired output. This
Figure 2 – Example high resolution table of values for SPL.
Frequency (Hz)
SPL in
Q
uiet Zone for Various Wei
g
hts and Fre
q
uencies
Weight
SPL
(
dB
)
soundfield level will map to a particular qwin the LUT and, if
implemented, to (, )
lkw
that would reproduce the level.
Without interpolation or LUT’s, the highest frequency
resolution is 512 (based on the 1024 length FFT) and 256 different
weight values (results in negligible reconstruction error). Each table
is built for resolutions consecutively halving and decreasing in
resolution down to 16 frequencies and 8 weights. In this work we
have 24
{1 0 , , 1 0 }qw
which extends the range used in [6]. The
error between the different LUTs is evaluated using (11) where the
highest resolution for frequency is 512u and for weight
/2vu
, and the set of frequency and weight resolutions to
evaluate are {16,32,64,128,256}u and /2vu, respectively.
The proposed approach is further evaluated using PESQ [10]
to estimate the perceptual quality of the reproduced soundfields.
Speech files for the evaluation are taken from the TIMIT corpus [14]
where 20 files are chosen randomly. The male to female speaker
ratio of these files is 50 : 50 . The reference signal for the PESQ
algorithm is the original speech signal. PESQ values are obtained
for the reproduced speech soundfields using the different resolution
LUTs and then mapped to the PESQ Mean Opinion Score (MOS)
[15]. These reproductions use 0.5 0.5 1.5 2.5
{1 0 , 1 0 , 1 0 , 1 0}qw
such
that they lie primarily in the centre of the interpolation regions. This
allows the highest resolution LUT to be evaluated, however, due to
the computational complexity is limited to four different weights.
6.2. Evaluation Results
Figure 3 illustrates the aliasing threshold described by (5).
Here, it is expected that aliasing will occur above 4kHz and a
pattern is noticeable, as noted by the black and white line, where it
can be seen that the amplitude in both the bright and quiet zone
becomes discernibly larger. It can also be seen that at about 8kHz
in the quiet zone a significant aliasing occurs and when the
weighting is increased it occurs at lower frequencies.
Analysing the MSE results between the different interpolation
distances (Figure 4) indicate the lower resolution LUTs require
significantly less computations than those of the higher resolution.
This can be observed from the labels that show the relative decrease
in the number of reproduced soundfields, which measured up to
1024 times less at just 0.10% the number of computations of the
highest resolution LUT and with an MSE of 85dB, comparable to
high end audio systems. In general, an increase in the interpolation
distance increases the MSE.
As can be seen in Figure 5, the increased MSE caused by
higher interpolation distances appears to have no discernible impact
on the perceptual quality where the maximum mapped MOS is
indicated by the red line. Figure 5 does show, however, a slight
increase in the variability of the PESQ MOS, as indicated by the
95% confidence interval markers, where higher interpolation
distances are required. This shows that interpolating the weighted
soundfield values has an indiscernible perceptual effect on the
reproduction and decreases the computational complexity of the
problem with 1024 times less individual soundfield reproductions.
7. CONCLUSIONS
This paper proposed a method for building multizone soundfields
for speech signals which allows dynamic control of the weighting
between zones. We have proposed a method for reducing the
computational effort involved when dynamically weighting zones
for speech signals. We have also proposed a method for determining
the number of planewaves required for multiple frequency systems
utilising the orthogonal basis expansion approach. The method has
been evaluated and shows indiscernible impact on perceptual quality
of reproductions and decreased computational complexity. The
evaluations show PESQ MOS lie around 4.4 , MSE around 85dB
and 1024 times less individual reproductions. Also demonstrated in
this paper is the use of LUTs for visualising speech soundfields and
possible reproduction problems. For instance, media designers or
engineers may benefit from the visualisation of these LUTs to better
predict when aliasing or other phenomena occur in a system.
8. ACKNOWLEDGEMENTS
The authors would like to thank Wenyu Jin for his helpful insight
into the orthogonal basis expansion method.
Figure 5 – PESQ MOS between weighted speech files
reproduced by different LUTS with 95% confidence intervals.
Labels show the relative complexity decrease from . Red
line indicates maximum mapped PESQ MOS.
1024x 512x 256x 128x 64x
512x 256x 128x 64x 32x
256x 128x 64x 32x 16x
128x 64x 32x 16x 8x
64x 32x 16x 8x 4x
4.2
4.4
4.6
16 32 64 128 256
PESQ MOS
No of Frequencies
Reproduction PESQ of LUTs 8
Weights
16
Weights
32
Weights
64
Weights
128
Weights
Figure 4 – MSE between different LUT resolutions. Labels
show the relative complexity decrease from .
1024x 512x
256x 128x 64x
512x 256x
128x
64x 32x
256x 128x
64x
32x
16x
128x 64x
32x
16x
8x
64x
32x
16x
8x
4x
-150
-130
-110
-90
16 32 64 128 256
MSE (dB)
No of Frequencies
Mean Squared Error between LUTs 8
Weights
16
Weights
32
Weights
64
Weights
128
Weights
Figure 3 – LUT from the aliasing setup for the bright (A) and
quiet (B) zones. The black and white line marks 4kHz aliasing.
Frequency (Hz)
Frequency (Hz)
Wei
g
ht Wei
g
ht
Amplitude Amplitude
A
B
Zone Samples for various Weights and Frequencies
9. REFERENCES
[1] M. Poletti, “An Investigation of 2-D Multizone Surround
Sound Systems,” presented at the Audio Engineering Society
Convention 125, 2008.
[2] M. Kolundzija, C. Faller, and M. Vetterli, “Reproducing
Sound Fields Using MIMO Acoustic Channel Inversion,”
JAES, vol. 59, no. 10, pp. 721–734, Nov. 2011.
[3] Y. J. Wu and T. D. Abhayapala, “Theory and design of
soundfield reproduction using continuous loudspeaker
concept,” Audio, Speech, and Language Processing, IEEE
Transactions on, vol. 17, no. 1, pp. 107–116, 2009.
[4] Y. J. Wu and T. D. Abhayapala, “Spatial multizone
soundfield reproduction,” in Acoustics, Speech and Signal
Processing, 2009. ICASSP 2009. IEEE International
Conference on, 2009, pp. 93–96.
[5] Y. J. Wu and T. D. Abhayapala, “Spatial multizone
soundfield reproduction: Theory and design,” Audio, Speech,
and Language Processing, IEEE Transactions on, vol. 19,
no. 6, pp. 1711–1720, 2011.
[6] W. Jin, W. B. Kleijn, and D. Virette, “Multizone soundfield
reproduction using orthogonal basis expansion,” presented at
the Acoustics, Speech and Signal Processing (ICASSP),
2013 IEEE International Conference on, 2013, pp. 311–315.
[7] H. Chen, T. D. Abhayapala, and W. Zhang, “Enhanced sound
field reproduction within prioritized control region,” in
INTER-NOISE and NOISE-CON Congress and Conference
Proceedings, 2014, vol. 249, pp. 4055–4064.
[8] N. Radmanesh and I. S. Burnett, “Generation of isolated
wideband sound fields using a combined two-stage lasso-ls
algorithm,” Audio, Speech, and Language Processing, IEEE
Transactions on, vol. 21, no. 2, pp. 378–387, 2013.
[9] N. Radmanesh and I. S. Burnett, “Reproduction of
independent narrowband soundfields in a multizone surround
system and its extension to speech signal sources,” in
Acoustics, Speech and Signal Processing (ICASSP), 2011
IEEE International Conference on, 2011, pp. 461–464.
[10] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra,
“Perceptual evaluation of speech quality (PESQ)-a new
method for speech quality assessment of telephone networks
and codecs,” in 2001 IEEE International Conference on
Acoustics, Speech, and Signal Processing, 2001.
Proceedings. (ICASSP ’01), 2001, vol. 2, pp. 749–752 vol.2.
[11] G. H. Golub and C. F. van Van Loan, “Matrix computations
(Johns Hopkins studies in mathematical sciences),” 1996.
[12] E. G. Williams, Fourier acoustics: sound radiation and
nearfield acoustical holography. academic press, 1999.
[13] M. Bosi and R. E. Goldberg, Introduction to digital audio
coding and standards. Springer, 2003.
[14] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and
D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous
speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA
STI/Recon Technical Report N, vol. 93, p. 27403, 1993.
[15] I. Rec, “P. 862.1: Mapping function for transforming P. 862
raw result scores to MOS-LQO,” International
Telecommunication Union, Geneva, vol. 24, 2003.