Conference PaperPDF Available

An Efficient Approach to Dynamically Weighted Multizone Wideband Reproduction of Speech Soundfields

Authors:

Abstract

This paper proposes and evaluates an efficient approach for practical reproduction of multizone soundfields for speech sources. The reproduction method, based on a previously proposed approach, utilises weighting parameters to control the soundfield reproduced in each zone whilst minimising the number of loudspeakers required. Proposed here is an interpolation scheme for predicting the weighting parameter values of the multizone soundfield model that otherwise requires significant computational effort. It is shown that initial computation time can be reduced by a factor of 1024 with only -85dB of error in the reproduced soundfield relative to reproduction without interpolated weighting parameters. The perceptual impact on the quality of the speech reproduced using the method is also shown to be negligible. By using pre-saved soundfields determined using the proposed approach, practical reproduction of dynamically weighted multizone soundfields of wideband speech could be achieved in real-time. Index Terms— multizone soundfield reproduction, wideband multizone soundfield, weighted multizone soundfield, look-up tables (LUT), interpolation, sound field synthesis (SFS)
© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
AN EFFICIENT APPROACH TO DYNAMICALLY WEIGHTED MULTIZONE
WIDEBAND REPRODUCTION OF SPEECH SOUNDFIELDS
Jacob Donley, Christian Ritz
School of Electrical Computer and Telecommunications Engineering, University of Wollongong,
Wollongong, NSW, Australia, 2522, jrd089@uowmail.edu.au, critz@uow.edu.au
ABSTRACT
This paper proposes and evaluates an efficient approach for practical
reproduction of multizone soundfields for speech sources. The
reproduction method, based on a previously proposed approach,
utilises weighting parameters to control the soundfield reproduced
in each zone whilst minimising the number of loudspeakers
required. Proposed here is an interpolation scheme for predicting the
weighting parameter values of the multizone soundfield model that
otherwise requires significant computational effort. It is shown that
initial computation time can be reduced by a factor of 1024 with
only 85dB of error in the reproduced soundfield relative to
reproduction without interpolated weighting parameters. The
perceptual impact on the quality of the speech reproduced using the
method is also shown to be negligible. By using pre-saved
soundfields determined using the proposed approach, practical
reproduction of dynamically weighted multizone soundfields of
wideband speech could be achieved in real-time.
Index Terms— multizone soundfield reproduction, wideband
multizone soundfield, weighted multizone soundfield, look-up
tables (LUT), interpolation, sound field synthesis (SFS)
1. INTRODUCTION
The reproduction of audio in spatially separated regions is of high
interest for applications such as the creation of personal listening
zones in entertainment/cinema, multi-participant teleconferencing
and vehicle cabins. This can be achieved using multizone soundfield
reproduction and was originally proposed in [1] to reproduce
multiple independent zones of active and quiet 2D soundfield
regions. In [1] a least squares pressure matching technique was used,
where estimated acoustic pressures reproduced by a set of
loudspeakers are matched to sample values within the desired
soundfield zone. This technique is also used in Sound Field
Reconstruction (SFR) [2].
Following [1], a multizone approach using cylindrical
harmonic expansion with coefficient translation and angular
windowing was proposed [3]–[5]. This approach, however, attempts
to completely suppress any interzone interference which can result
in impractically large loudspeaker signal amplitudes or
impractically low levels in zones. A method better suited for
implementation and controllability was introduced in [6]. The
approach uses orthogonal basis expansion which reduces the
problem to the reconstruction of a set of basis wavefields and allows
each zone to be weighted according to the importance of its
reproduction. This weighting improves the practicality of the system
by relaxing the ideal requirement of completely quiet zones outside
the target zone. The theory in [5] was extended in [7] to include a
similar weighting criteria as used in [6].
While originally focusing on single frequency soundfields,
more recent work attempted to create multizone soundfields [8], [9]
with frequency bandwidths equivalent to narrowband and wideband
speech. This paper investigates the use of weighted multizone
soundfield reproduction for wideband speech using the orthogonal
basis expansion of [6]. The orthogonal basis expansion approach
assumes the soundfield is reproduced as a sum of independent
planewaves reproduced by an array of loudspeakers. Each
planewave corresponds to an individual frequency and direction.
The number of chosen planewaves is governed by the desired
reproduction accuracy and varies depending on frequency. Due to
this, a formula is proposed in this work which determines the
number of planewaves to be used for a given frequency so that
neither spatial aliasing nor ill-conditioned matrices is an inherent
problem. This is necessary when dealing with the reproduction of
soundfields containing multiple frequencies.
While [6] assumes the same weight for each frequency,
dynamically deriving the weights can be used to control the
reproduction accuracy of individual frequency components within
the bright and quiet zones. For example, the weightings can be based
on the perceptual importance of particular frequencies in the zones
in an effort to improve the overall perceived sound quality.
However, this results in increases in computational complexity. To
reduce this complexity and create a more practical solution, this
paper proposes the interpolation of spatial components of the
reproduction along different domains, such as the weighting domain
and frequency domain.
A system is synthesised with varying linear interpolation
distances by using different resolution lookup tables (LUTs) for
storing pre-computed loudspeaker weights and soundfield values.
The synthesis comprises of reproducing wideband zones where
frequency domain content is weighted uniformly with weights that
are in the centre of interpolation regions. The approach is validated
by comparing the reproduced zone signals from the interpolation
method with signals reproduced without interpolation using Mean
Squared Error (MSE) and Perceptual Evaluation of Speech Quality
(PESQ) [10] measures.
Section 2 describes the orthogonal basis expansion approach
to multizone soundfield reproduction. The proposed dynamically
weighted multizone approach is described in Section 3 and a novel
approach to selecting the number of orthogonal planewaves is
presented in Section 4. Section 5 describes the interpolation method.
Evaluation results are given in Section 6 and conclusions outlined in
Section 7.
2. MONOFREQUENT MULTIZONE SOUNDFIELD
REPRODUCTION
A multizone soundfield reproduction layout is depicted in Figure 1.
The reproduction region, , of radius
R
, contains three sub
regions called the bright zone, quiet zone and unattended zone,
denoted by b, q and ()bq

, respectively. The radius of
b and q is r and their centres are located on a circle of radius
z
r. The angle of the desired planewave in bis
and is
reproduced by loudspeakers positioned on an arc of angle
L
and
radius l
R
with the first loudspeaker starting at angle
.
In the orthogonal basis expansion approach to multizone
soundfield reproduction [11], a soundfield function (,)xSk that
fulfils the wave equation, where x is an arbitrary spatial
sampling point and k is the wavenumber of the soundfield, is
written as a summation of an orthogonal set of the Helmholtz
equation solutions [12] as
  
,,,
nn
n
Sk CkwG k
xx
(1)
where { }
nG is a series of weighted basis functions and given a
desired soundfield, (,)x
d
Sk, the expansion coefficients are derived
as *
(, ) (,) (,) ()
d
nn
Ckw S kG kw dxxxx
using a weighted inner-
product, where w(x) represents the weighting function. The
weighted QR factorisation may also use a weighted inner-product.
Here, the weighting function, ( )x
w, can be written as ( )xbbww
,
()xqqww and ( )xuuww, where xbb, xqq
and
()xbqu
, indicating different values for the weights
within each of the bright ( b), quiet ( q), and unattended zones,
respectively as illustrated in Figure 1.
Following a QR factorisation on a set of
N
planewaves,
),(x
nFk, arriving from angles 02
, (1) becomes,
,, ,,xx
jj
j
Skw P wFkk (2)
where the coefficients for the wavefields are
1
(, ) (, )( )Rj
n
jnnkw C wPk
, {1, , }jN , {1, , }nN and R
is the upper triangular matrix from the QR factorisation [6].
The reproduction of the soundfield in (2) at a particular
location can then be expressed as a summation of the
discontinuously located loudspeaker signals [3],
 

(1)
0
1
,, , 4
L
a
disc l l
l
i
Skw kwHk

xxx
(3)
where (, )
lkw
is the th
l complex loudspeaker weight,
L
is the
number of loudspeakers, 1i
, (1)
0(||||)Hk is a zeroth-order
Hankel function of the first kind [12] and xl is the position of the
th
l loudspeaker. The complex loudspeaker weights are defined as,




1
(1)
(, ) 2 , pw l
Mim im
m
lmlj s
j
mM
kw i H kR P kwie e


 (4)
where
M
kR
is the truncation length [3],
R
and l
R
are from
Figure 1, ( 1)
pw j
 are the wavefield angles, 2 /
J
,
l
is the angle of the th
l loudspeaker from the x-axis and
s
is
the angular spacing of the loudspeakers. The minimum number of
loudspeakers to avoid aliasing is given by,
221
max
LeRk
(5)
where
L
is the minimum number of required loudspeakers, e is
Euler’s number and max
k
is the wavenumber of the highest
frequency (where frequency,
f
kc [12] and 1
343cms
is the
speed of sound). See [3] for further details.
3. WEIGHTED MULTIZONE WIDEBAND SOUNDFIELDS
A wideband soundfield is described here as a linear combination of
planewaves corresponding to each source frequency, similar to a
Fourier series [12] and the approach of [8], [9]. The pressure
generated at any point in the reproduced soundfield is given by,
  
ˆ,,,xx
K
a
disc
k
pw S kw (6)
where there are
K
different sinusoidal components. Multiple
nested summations are required to derive ˆ(, )
xpw as:
  
*
,,
ˆ,
x
xxxxx
KL M J N
kl km m
kl mM
n
n
jn
j
d
SkGkww


  
(7)
where 1
()R
j
njn
, pw
im
m
mie
, (1)
2/(())
l
im
km l
sm
eiHkR


and (1)
0(||() ||)/4xxxkl l
iH k
. Here, a summation occurs for
every sample in , for
N
planewaves,
J
wavefields, 21
modes,
L
loudspeakers and
K
sinusoidal components.
This paper extends the approach of [6] to allow for frequency-
dependent weighting functions such that the leaked frequency
spectrum can be controlled. This may benefit in cases where
occlusion is a problem [9] and may improve the perceivable error
compared to a single zone weight. To weight signals dynamically
for each frequency, during the soundfield construction in (3),
loudspeaker weights and soundfield samples are obtained per time-
frequency component for a free field reproduction scenario (future
work will look into reproduction in reverberant rooms). For example,
if a signal from an arbitrary location in the reproduction area,
defined in (3), was to be synthesised,
  
ˆ,,,
a
FdiscF
YkS kwYkxx (8)
where ˆ(,)
xFYk
is the frequency domain signal at an arbitrary
location, x, in the reproduction region, , ( )FYk is obtained from
the short-time Fourier transform of the windowed frame of input
()yn and | |
denotes the absolute value. If loudspeaker signals are
preferred to be synthesised then |(,,)|x
a
disc
Skw can be replaced with
Figure 1 – Weighted multizone soundfield reproduction layout
(, )
lkw
from (4) and ˆ(, )
xFYk
becomes ˆ()
FlYk. Time domain
loudspeaker signals, ˆ()
l
yn, and/or soundfields, ˆ(,)xyn, are derived
from (8) via overlap-add reconstruction [13].
4. ORTHOGONAL PLANEWAVE SELECTION
The wavefield coefficients depend on the expansion coefficients and
the upper triangular matrix resulting from QR factorisation
computed on a set of
N
planewave functions arriving from angles
02
. Hence, the computation increases with both
N
and the
spatial sampling density of the reproduction zone. A larger
N
results in a more accurate reproduction, however, in practice, this
also results in poorly conditioned upper triangular matrices. A lower
N
causes spatial aliasing. The effect of N is also frequency
dependent. Selecting
N
for the layout in Figure 1 is chosen to
balance the impact of spatial aliasing and ill-conditioning:
  

1.2 1.2
max min max min min mincc c
N
fNf Nf f f ff Nf
  
(9)
where min
f
and max
f
are the minimum and maximum frequencies,
respectively, ()
c
N
f
is the largest number of planewaves that
satisfies 10
()0() 1Rf
or )min( ) (
q
M
f
for a given frequency,
f
.
Here, ( )R
is the condition number of the upper triangular matrix,
q
M
is the mean squared error (MSE) in the quiet zone and is limited
to 60
q
MdB   
. This work focuses on reproducing small room
sized soundfields with a bandwidth 150 8
f
Hz kHz. Due to the
size of the soundfields, zone weighting for 150
f
Hz has a
negligible effect, as can be seen to trend in Figure 2. Analysis shows
that (150 30)
cN and (8000) 300cN based on (9). This allows
the wideband system to be synthesised using the orthogonal basis
expansion method with minimal error caused by the selection of
N
.
5. LOOK-UP TABLE BASED SYNTHESIS AND
WEIGHTING
It is computationally demanding to construct a weighted multizone
soundfield using the methods of Sections 2 and 3 due to the QR
factorisation involved for all time-frequency components (e.g. a
three second audio file sampled at 16kHz requires approximately
48 103 independent reproductions). In order to best make use of
these repeated reproductions, the loudspeaker weights and
soundfield pressure samples can be reproduced and stored for later
use. Once enough values have been stored, interpolation between
them can further reduce computation and error caused by quantised
values. In this paper we propose the use of Look-Up Tables (LUT)
to store pre-determined weighted soundfield values to be used for a
given setup or wideband reproduction, with an example shown in
Figure 2. A LUT may be described as a matrix of soundfield
reproduction values for a given frequency and weight range,


,, ,,
,, ,,
xx
A
xx
aa
disc min min disc min max
aa
disc max min di
u
sc max
v
max
Skw Skw
Skw Skw

(10)
where ( , , )
a
disc
Skwx from (3) is a soundfield reproduction value for
wavenumber, k, and weighting, w, at x and Auv is a LUT
with u number of frequencies and v number of weights in the
range { , , }min maxkk
and { },,min maxww
, respectively. The set of
frequencies is logarithmically spaced as it closely resembles the
spacing of the Bark scale [13] and the set of weights is
logarithmically spaced as it provides larger control ranges in the
decibel scale. The table can be built for loudspeaker signals by
replacing (, , )x
a
disc
Skw with (, )
lkw
from (4).
In order to evaluate the error and perceptual effects of
quantising and interpolating soundfield values, a comparison is
made between two LUTs (see Section 6). The MSE is evaluated as
the difference between the interpolated values of lower and higher
resolution LUT values as,
2
1
uv uvLuvTu
Uv
  

AAA
(11)
where
L
UT
is the MSE for the given interpolated LUT, Auv
,
relative to the highest resolution LUT, Auv
 , u
is the highest
frequency resolution, v
is the highest weight resolution, u and v
are the set of frequency and weight resolutions to evaluate,
respectively. The interpolated LUT, Auv

, is a matrix of size uv
obtained from interpolation of a smaller matrix, Auv . In this work
bilinear interpolation is used.
6. RESULTS
This section describes simulations of reproduced soundfields using
the above approach evaluated using MSE and PESQ measures.
6.1. Evaluation Setup
The multizone soundfield layout of Figure 1 is evaluated, where
0.3rm
, 0.6zrm
, 1
R
m, 1.5l
R
m
,
1
sin ( / 2 ) 14.5zrr
 and 3.14159c
. This setup is similar
to [6] and
is chosen such that an evanescent planewave with
instant decay would interfere with half the quiet zone resulting in a
large range of weighting control with a slight occlusion problem.
Signals sampled at 16 kHz are converted to the time-frequency
domain using a Hamming window (50% overlap) and Fast Fourier
transform (FFT) of length 1024 . The tables are analysed for a
reproduction that meets and doesn’t meet the aliasing criteria for the
minimum number of reproduction loudspeakers given by (5). The
LUTs are built for a reproduction where 16L, / 2

and
L
, referred here on as the aliasing setup. They are also built
where 65L
and 2L
, referred to as the non-aliasing setup.
The aliasing setup aliases above 4kHz and the non-aliasing setup
aliases above 8kHz . The aliasing setup is included to show that
visualising the soundfield values can provide other benefits.
The tables are built with the pressures for all bq
x and
averaged across b and q, from which the soundfield zones can
be approximated. The zone weights are chosen as 1bw
and
0.05uw
following [6] and the variable weight is qw. The effect
of qwon the input signal is evaluated using (8), which can also be
reversed to find a soundfield level to obtain a desired output. This
Figure 2 – Example high resolution table of values for SPL.
Frequency (Hz)
SPL in
Q
uiet Zone for Various Wei
g
hts and Fre
q
uencies
Weight
SPL
(
dB
)
soundfield level will map to a particular qwin the LUT and, if
implemented, to (, )
lkw
that would reproduce the level.
Without interpolation or LUT’s, the highest frequency
resolution is 512 (based on the 1024 length FFT) and 256 different
weight values (results in negligible reconstruction error). Each table
is built for resolutions consecutively halving and decreasing in
resolution down to 16 frequencies and 8 weights. In this work we
have 24
{1 0 , , 1 0 }qw
which extends the range used in [6]. The
error between the different LUTs is evaluated using (11) where the
highest resolution for frequency is 512u and for weight
/2vu
, and the set of frequency and weight resolutions to
evaluate are {16,32,64,128,256}u and /2vu, respectively.
The proposed approach is further evaluated using PESQ [10]
to estimate the perceptual quality of the reproduced soundfields.
Speech files for the evaluation are taken from the TIMIT corpus [14]
where 20 files are chosen randomly. The male to female speaker
ratio of these files is 50 : 50 . The reference signal for the PESQ
algorithm is the original speech signal. PESQ values are obtained
for the reproduced speech soundfields using the different resolution
LUTs and then mapped to the PESQ Mean Opinion Score (MOS)
[15]. These reproductions use 0.5 0.5 1.5 2.5
{1 0 , 1 0 , 1 0 , 1 0}qw
such
that they lie primarily in the centre of the interpolation regions. This
allows the highest resolution LUT to be evaluated, however, due to
the computational complexity is limited to four different weights.
6.2. Evaluation Results
Figure 3 illustrates the aliasing threshold described by (5).
Here, it is expected that aliasing will occur above 4kHz and a
pattern is noticeable, as noted by the black and white line, where it
can be seen that the amplitude in both the bright and quiet zone
becomes discernibly larger. It can also be seen that at about 8kHz
in the quiet zone a significant aliasing occurs and when the
weighting is increased it occurs at lower frequencies.
Analysing the MSE results between the different interpolation
distances (Figure 4) indicate the lower resolution LUTs require
significantly less computations than those of the higher resolution.
This can be observed from the labels that show the relative decrease
in the number of reproduced soundfields, which measured up to
1024 times less at just 0.10% the number of computations of the
highest resolution LUT and with an MSE of 85dB, comparable to
high end audio systems. In general, an increase in the interpolation
distance increases the MSE.
As can be seen in Figure 5, the increased MSE caused by
higher interpolation distances appears to have no discernible impact
on the perceptual quality where the maximum mapped MOS is
indicated by the red line. Figure 5 does show, however, a slight
increase in the variability of the PESQ MOS, as indicated by the
95% confidence interval markers, where higher interpolation
distances are required. This shows that interpolating the weighted
soundfield values has an indiscernible perceptual effect on the
reproduction and decreases the computational complexity of the
problem with 1024 times less individual soundfield reproductions.
7. CONCLUSIONS
This paper proposed a method for building multizone soundfields
for speech signals which allows dynamic control of the weighting
between zones. We have proposed a method for reducing the
computational effort involved when dynamically weighting zones
for speech signals. We have also proposed a method for determining
the number of planewaves required for multiple frequency systems
utilising the orthogonal basis expansion approach. The method has
been evaluated and shows indiscernible impact on perceptual quality
of reproductions and decreased computational complexity. The
evaluations show PESQ MOS lie around 4.4 , MSE around 85dB
and 1024 times less individual reproductions. Also demonstrated in
this paper is the use of LUTs for visualising speech soundfields and
possible reproduction problems. For instance, media designers or
engineers may benefit from the visualisation of these LUTs to better
predict when aliasing or other phenomena occur in a system.
8. ACKNOWLEDGEMENTS
The authors would like to thank Wenyu Jin for his helpful insight
into the orthogonal basis expansion method.
Figure 5 – PESQ MOS between weighted speech files
reproduced by different LUTS with 95% confidence intervals.
Labels show the relative complexity decrease from . Red
line indicates maximum mapped PESQ MOS.
1024x 512x 256x 128x 64x
512x 256x 128x 64x 32x
256x 128x 64x 32x 16x
128x 64x 32x 16x 8x
64x 32x 16x 8x 4x
4.2
4.4
4.6
16 32 64 128 256
PESQ MOS
No of Frequencies
Reproduction PESQ of LUTs 8
Weights
16
Weights
32
Weights
64
Weights
128
Weights
Figure 4 – MSE between different LUT resolutions. Labels
show the relative complexity decrease from .
1024x 512x
256x 128x 64x
512x 256x
128x
64x 32x
256x 128x
64x
32x
16x
128x 64x
32x
16x
8x
64x
32x
16x
8x
4x
-150
-130
-110
-90
16 32 64 128 256
MSE (dB)
No of Frequencies
Mean Squared Error between LUTs 8
Weights
16
Weights
32
Weights
64
Weights
128
Weights
Figure 3 – LUT from the aliasing setup for the bright (A) and
quiet (B) zones. The black and white line marks 4kHz aliasing.
Frequency (Hz)
Frequency (Hz)
Wei
g
ht Wei
g
ht
Amplitude Amplitude
A
B
Zone Samples for various Weights and Frequencies
9. REFERENCES
[1] M. Poletti, “An Investigation of 2-D Multizone Surround
Sound Systems,” presented at the Audio Engineering Society
Convention 125, 2008.
[2] M. Kolundzija, C. Faller, and M. Vetterli, “Reproducing
Sound Fields Using MIMO Acoustic Channel Inversion,”
JAES, vol. 59, no. 10, pp. 721–734, Nov. 2011.
[3] Y. J. Wu and T. D. Abhayapala, “Theory and design of
soundfield reproduction using continuous loudspeaker
concept,” Audio, Speech, and Language Processing, IEEE
Transactions on, vol. 17, no. 1, pp. 107–116, 2009.
[4] Y. J. Wu and T. D. Abhayapala, “Spatial multizone
soundfield reproduction,” in Acoustics, Speech and Signal
Processing, 2009. ICASSP 2009. IEEE International
Conference on, 2009, pp. 93–96.
[5] Y. J. Wu and T. D. Abhayapala, “Spatial multizone
soundfield reproduction: Theory and design,” Audio, Speech,
and Language Processing, IEEE Transactions on, vol. 19,
no. 6, pp. 1711–1720, 2011.
[6] W. Jin, W. B. Kleijn, and D. Virette, “Multizone soundfield
reproduction using orthogonal basis expansion,” presented at
the Acoustics, Speech and Signal Processing (ICASSP),
2013 IEEE International Conference on, 2013, pp. 311–315.
[7] H. Chen, T. D. Abhayapala, and W. Zhang, “Enhanced sound
field reproduction within prioritized control region,” in
INTER-NOISE and NOISE-CON Congress and Conference
Proceedings, 2014, vol. 249, pp. 4055–4064.
[8] N. Radmanesh and I. S. Burnett, “Generation of isolated
wideband sound fields using a combined two-stage lasso-ls
algorithm,” Audio, Speech, and Language Processing, IEEE
Transactions on, vol. 21, no. 2, pp. 378–387, 2013.
[9] N. Radmanesh and I. S. Burnett, “Reproduction of
independent narrowband soundfields in a multizone surround
system and its extension to speech signal sources,” in
Acoustics, Speech and Signal Processing (ICASSP), 2011
IEEE International Conference on, 2011, pp. 461–464.
[10] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra,
“Perceptual evaluation of speech quality (PESQ)-a new
method for speech quality assessment of telephone networks
and codecs,” in 2001 IEEE International Conference on
Acoustics, Speech, and Signal Processing, 2001.
Proceedings. (ICASSP ’01), 2001, vol. 2, pp. 749–752 vol.2.
[11] G. H. Golub and C. F. van Van Loan, “Matrix computations
(Johns Hopkins studies in mathematical sciences),” 1996.
[12] E. G. Williams, Fourier acoustics: sound radiation and
nearfield acoustical holography. academic press, 1999.
[13] M. Bosi and R. E. Goldberg, Introduction to digital audio
coding and standards. Springer, 2003.
[14] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and
D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous
speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA
STI/Recon Technical Report N, vol. 93, p. 27403, 1993.
[15] I. Rec, “P. 862.1: Mapping function for transforming P. 862
raw result scores to MOS-LQO,” International
Telecommunication Union, Geneva, vol. 24, 2003.
... Multizone soundfield reproductions designed for mono-frequent soundfields have been extended to wideband soundfields including speech [8], [16]- [18]. Recent research has investigated the perceptual quality of multizone soundfields [2] and methods have been proposed to improve the quality using psychoacoustic models [16]. ...
... so that the masker source is collocated with the leakage of the target bright zone soundfield reproduction (see section V-B and Appendix A for definitions of Ĺ pq and p u o ), and a new weighting function, p w(x), is constrained to an importance of 0.05, 1 and 100 in D u , D q and D b , respectively [14], [18]. The remainder of the multizone reproduction is the same as used to generate Q l (a, k) for the speech signal. ...
... The methods are also applicable for cases where separate speech signals in each zone are desired, however, because the leaked speech between zones is not controlled, further reductions in quality may occur. Methods for controlling the leaked spectrum between zones, which may then improve quality, have been proposed in [16], [18]. ...
Article
Full-text available
Reproducing zones of personal sound is a challenging signal processing problem which has garnered considerable research interest in recent years. We introduce in this work an extended method to multizone soundfield reproduction which overcomes issues with speech privacy and quality. Measures of Speech Intelligibility Contrast (SIC) and speech quality are used as cost functions in an optimisation of speech privacy and quality. Novel spatial and (temporal) frequency domain speech masker filter designs are proposed to accompany the optimisation process. Spatial masking filters are designed using multizone soundfield algorithms which are dependent on the target speech multizone reproduction. Combinations of estimates of acoustic contrast and long term average speech spectra are proposed to provide equal masking influence on speech privacy and quality. Spatial aliasing specific to multizone soundfield reproduction geometry is further considered in analytically derived low-pass filters. Simulated and real-world experiments are conducted to verify the performance of the proposed method using semi-circular and linear loudspeaker arrays. Simulated implementations of the proposed method show that significant speech intelligibility contrast and speech quality is achievable between zones. A range of Perceptual Evaluation of Speech Quality (PESQ) Mean Opinion Scores (MOS) that indicate good quality are obtained while at the same time providing confidential privacy as indicated by SIC. The simulations also show that the method is robust to variations in the speech, virtual source location, array geometry and number of loudspeakers. Real-world experiments confirm the practicality of the proposed methods by showing that good quality and confidential privacy are achievable.
... Existing methods focus on single frequency soundfields, although there has been work attempting to create multizone soundfields for wideband speech [9]. More recently, work has been done [10] to extend a method [7] to the reproduction of weighted wideband speech soundfields whilst efficiently maintaining the weighting function in the spatial, time and frequency domain. This allows for dynamic weighting of the zones as well as individual frequency components in time thus allowing each zone's acoustic content to be controlled. ...
... In the method of weighting multizone soundfields [7], a spatial weighting filter as a function of space, w(x), is used to control the reproduction of sound within each of the zones. Subsequent work [10] extended this approach to allow for space-time-frequency dependent weighting functions, w(x, n, k), which allows for weighting functions to be adapted based on the signal characteristics of the target soundfield. We denote w b , w q and w u as the weighting functions for x b ∈ D b , x q ∈ D q and x u ∈ D∩(D b ∪D q ) , respectively. ...
... We denote w b , w q and w u as the weighting functions for x b ∈ D b , x q ∈ D q and x u ∈ D∩(D b ∪D q ) , respectively. The reproduced soundfield pressure at any point in the reproduction region is defined as the sum of space-time-frequency dependant weighted soundfield values [10], ...
Conference Paper
Full-text available
In this paper a method for the reproduction of multizone speech soundfields using perceptual weighting criteria is proposed. Psychoacoustic models are used to derive a space-time-frequency weighting function to control leakage of perceptually unimportant energy from the bright zone into the quiet zone. This is combined with a method for regulating the number of basis planewaves used in the reproduction to allow for an efficient implementation using a codebook of predetermined weights based on desired soundfield energy in the zones. The approach is capable of improving the mean squared error for reproduced speech in the bright zone by -10.5 decibels. Results also show that the approach leads to a significant reduction in the spatial error within the bright zone whilst requiring 65% less loudspeaker signal power for the case where the soundfield in this zone is in line with, and hence partially directed to, the quiet zone.
... Existing methods focus on single frequency soundfields, although there has been work attempting to create multizone soundfields for wideband speech [9]. More recently, work has been done [10] to extend a method [3] to the reproduction of weighted wideband speech soundfields by using the spatial weighting function. This is shown in [11] to allow each zone's acoustic content to be controlled by dynamic space-time-frequency weighting. ...
... The zone weights are constant and are chosen so that the bright zone weight is unity, the unattended zone weight is 0.05 the reproduction importance of the bright zone following [3,10] and the weight of the quiet zone is set to 10 4 . Frequency dependent zone weighting and signal filtering may give further improvements. ...
Conference Paper
Full-text available
This paper proposes two methods for providing speech privacy between spatial zones in anechoic and reverberant environments. The methods are based on masking the content leaked between regions. The masking is optimised to maximise the speech intelligibility contrast (SIC) between the zones. The first method uses a uniform masker signal that is combined with desired multizone loudspeaker signals and requires acoustic contrast between zones. The second method computes a space-time domain masker signal in parallel with the loudspeaker signals so that the combination of the two emphasises the spectral masking in the targeted quiet zone. Simulations show that it is possible to achieve a significant SIC in anechoic environments whilst maintaining speech quality in the bright zone.
... • An efficient interpolation scheme is proposed for dynamically weighting zone importance in personal sound zone reproductions [115]. ...
Thesis
Full-text available
The experience and utility of personal sound is a highly sought after characteristic of shared spaces. Personal sound allows individuals, or small groups of individuals, to listen to separate streams of audio content without external interruption from a third-party. The desired effects of personal acoustic environments can also be areas of minimal sound, where quiet spaces facilitate an effortless mode of communication. These characteristics have become exceedingly difficult to produce in busy environments such as cafes, restaurants, open plan offices and entertainment venues. The concept of, and the ability to provide, spaces of such nature has been of significant interest to researchers in the past two decades. This thesis answers open questions in the area of personal sound reproduction using loudspeaker arrays, which is the active reproduction of soundfields over extended spatial regions of interest. We first provide a review of the mathematical foundations of acoustics theory, single zone and multiple zone soundfield reproduction, as well as background on the human perception of sound. We then introduce novel approaches for the integration of psychoacoustic models in multizone soundfield reproductions and describe implementations that facilitate the efficient computation of complex soundfield synthesis. The psychoacoustic based zone weighting is shown to considerably improve soundfield accuracy, as measured by the soundfield error, and the proposed computational methods are shown capable of providing several orders of magnitude better performance with insignificant effects on synthesis quality. Consideration is then given to the enhancement of privacy and quality in personal sound zones and in particular on the effects of unwanted sound leaking between zones. Optimisation algorithms, along with a priori estimations of cascaded zone leakage filters, are then established so as to provide privacy between the sound zones without diminishing quality. Simulations and real-world experiments are performed, using linear and part-circle loudspeaker arrays, to confirm the practical feasibility of the proposed privacy and quality control techniques. The experiments show that good quality and confidential privacy are achievable simultaneously. The concept of personal sound is then extended to the active suppression of speech across loudspeaker boundaries. Novel suppression techniques are derived for linear and planar loudspeaker boundaries, which are then used to simulate the reduction of speech levels over open spaces and suppression of acoustic reflections from walls. The suppression is shown to be as effective as passive fibre panel absorbers. Finally, we propose a novel ultrasonic parametric and electrodynamic loudspeaker hybrid design for acoustic contrast enhancement in multizone reproduction scenarios and show that significant acoustic contrast can be achieved above the fundamental spatial aliasing frequency.
... A hybrid system utilising the better aspects of both MSRs and PLs would allow for high acoustic contrast at low and high frequencies. Reproduction of speech soundfields [11], [17], [18] would require low carrier SPL in PLs due to the low energy of high frequency components in speech [19], thus reducing related health risks. Further, frequency dependent PL distortions are less of a problem at higher frequencies [16]. ...
Conference Paper
Full-text available
This paper proposes a hybrid approach to personal sound zones utilising multizone soundfield reproduction techniques and parametric loudspeakers. Crossover filters are designed, to switch between reproduction methods, through analytical analysis of aliasing artifacts in multizone reproductions. By realising the designed crossover filters, wideband acoustic contrast between zones is significantly improved. The trade-off between acoustic contrast and the bandwidth of the reproduced soundfield is investigated. Results show that by incorporating the proposed hybrid model the whole wideband bandwidth is spatial-aliasing free with a mean acoustic contrast consistently above 54.2dB, an improvement of up to 24.2dB from a non-hybrid approach, with as few as 16 dynamic loudspeakers and one parametric loudspeaker.
Conference Paper
In this paper a method for the reproduction of multizone speech soundfields using perceptual weighting criteria is proposed. Psychoacoustic models are used to derive a space-time-frequency weighting function to control leakage of perceptually unimportant energy from the bright zone into the quiet zone. This is combined with a method for regulating the number of basis planewaves used in the reproduction to allow for an efficient implementation using a codebook of predetermined weights based on desired soundfield energy in the zones. The approach is capable of improving the mean squared error for reproduced speech in the bright zone by −10.5 decibels. Results also show that the approach leads to a significant reduction in the spatial error within the bright zone whilst requiring 65% less loudspeaker signal power for the case where the soundfield in this zone is in line with, and hence partially directed to, the quiet zone.
Article
Full-text available
Sound fields are essentially band-limited phe-nomena, both temporally and spatially. This implies that a spatially sampled sound field respecting the Nyquist crite-rion is effectively equivalent to its continuous original. We describe Sound Field Reconstruction (SFR)—a technique that uses the previously stated observation to express the reproduction of a continuous sound field as an inversion of the discrete acoustic channel from a loudspeaker array to a grid of control points. The acoustic channel is inverted us-ing truncated singular value decomposition (SVD) in order to provide optimal sound field reproduction subject to a limited effort constraint. Additionally, a detailed procedure for obtaining loudspeaker driving signals that involves selection of active loudspeakers, coverage of the listening area with control points, and frequency-domain FIR filter design is described. Extensive simulations comparing SFR with Wave Field Synthesis show that on average, SFR provides higher sound field reproduction accuracy.
Article
Full-text available
The prohibitive number of speakers required for the reproduction of isolated soundfields is the major limitation preventing solution deployment. This paper addresses the provision of personal soundfields (zones) to multiple listeners using a limited number of speakers with an underlying assumption of fixed virtual sources. For such multizone systems, optimization of speaker positions and weightings is important to reduce the number of active speakers. Typically, single stage optimization is performed, but in this paper a new two-stage pressure matching optimization is proposed for wideband sound sources. In the first stage, the least-absolute shrinkage and selection operator (Lasso) is used to select the speakers' positions for all sources and frequency bands. A second stage then optimizes reproduction using all selected speakers on the basis of a regularized least-squares (LS) algorithm. The performance of the new, two-stage approach is investigated for different reproduction angles, frequency range and variable total speaker weight powers. The results demonstrate that using two-stage Lasso-LS optimization can give up to 69 dB improvement in the mean squared error (MSE) over a single-stage LS in the reproduction of two isolated audio signals within control zones using e.g. 84 speakers.
Article
Full-text available
The Texas Instruments/Massachusetts Institute of Technology (TIMIT) corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT contains speech from 630 speakers representing 8 major dialect divisions of American English, each speaking 10 phonetically-rich sentences. The TIMIT corpus includes time-aligned orthographic, phonetic, and word transcriptions, as well as speech waveform data for each spoken sentence. The release of TIMIT contains several improvements over the Prototype CD-ROM released in December, 1988: (1) full 630-speaker corpus, (2) checked and corrected transcriptions, (3) word-alignment transcriptions, (4) NIST SPHERE-headered waveform files and header manipulation software, (5) phonemic dictionary, (6) new test and training subsets balanced for dialectal and phonetic coverage, and (7) more extensive documentation.
Conference Paper
Full-text available
While higher order ambisonic approaches can be used to generate multiple zone soundfields, this paper adopts a Least Squares matching approach which provides a more flexible formulation. The base approach, adopted from [1] computes speaker weights which allow for the placement of single sources in the soundfield. In this paper the approach is extended firstly to two multifrequency sources and then to narrowband speech signals. The results for multi-frequency sources explore the zonal soundfield errors resulting from varied source positions. For speech signals, the approach provides a potential solution for multiple conversation reproduction in a multi user environment. The paper results indicate that the approach is feasible for zones which do not suffer occlusion effects from other zones. However, for more versatile multizone soundfield reproduction a 3D approach is recommended.
Article
Surround sound systems can produce a desired sound field over an extended region of space by using higher order Ambisonics. One application of this capability is the production of multiple independent soundfields in separate zones. This paper investigates multi-zone surround systems for the case of two dimensional reproduction. A least squares approach is used for deriving the loudspeaker weights for producing a desired single frequency wave field in one of N zones, while producing silence in the other N-1 zones. It is shown that reproduction in the active zone is more difficult when an inactive zone is in-line with the virtual sound source and the active zone. Methods for controlling this problem are discussed.
Conference Paper
We introduce a method for 2-D spatial multizone soundfield reproduction based on describing the desired multizone soundfield as an orthogonal expansion of basis functions over the desired reproduction region. This approach finds the solution to the Helmholtz equation that is closest to the desired soundfield in a weighted least squares sense. The basis orthogonal set is formed using QR factorization with as input a suitable set of solutions of the Helmholtz equation. The coefficients of the Helmholtz solution wavefields can then be calculated, reducing the multizone sound reproduction problem to the reconstruction of a set of basis wavefields over the desired region. The method facilitates its application with a more practical loudspeaker configuration. The approach is shown effective for both accurately reproducing sound in the selected bright zone and minimizing sound leakage into the predefined quiet zone.
Article
Spatial multizone soundfield reproduction over an extended region of open space is a complex and challenging problem in acoustic signal processing. In this paper, we provide a framework to recreate 2-D spatial multizone soundfields using a single array of loudspeakers which encompasses all spatial regions of interest. The reproduction is based on the derivation of an equivalent global soundfield consisting of a number of individual multizone soundfields. This is achieved by using spatial harmonic coefficients translation between coordinate systems. A multizone soundfield reproduction problem is then reduced to the reproduction over the entire region. An important advantage of this approach is the full use of the available dimensionality of the soundfield. This paper provides quantitative performances of a 2-D multizone system and reveals some fundamental limits on 2-D multizone soundfield reproduction. The extensions of the multizone soundfield reproduction design in reverberant rooms are also included.
Book
Intended for use as both a textbook and a reference, "Fourier Acoustics" develops the theory of sound radiation uniquely from the viewpoint of Fourier Analysis. This powerful perspective of sound radiation provides the reader with a comprehensive and practical understanding which will enable him or her to diagnose and solve sound and vibration problems in the 21st Century. As a result of this perspective, "Fourier Acoustics" is able to present thoroughly and simply, for the first time in book form, the theory of nearfield acoustical holography, an important technique which has revolutionised the measurement of sound. Relying little on material outside the book, "Fourier Acoustics" will be invaluable as a graduate level text as well as a reference for researchers in academia and industry. It talks about the physics of wave propogation and sound vibration in homogeneous media. It deals with acoustics, such as radiation of sound, and radiation from vibrating surfaces; inverse problems, such as the theory of nearfield acoustical holography; and, mathematics of specialized functions, such as spherical harmonics.
Article
Reproduction of a soundfield is a fundamental problem in acoustic signal processing. A common approach is to use an array of loudspeakers to reproduce the desired field where the least-squares method is used to calculate the loudspeaker weights. However, the least-squares method involves matrix inversion which may lead to errors if the matrix is poorly conditioned. In this paper, we use the concept of theoretical continuous loudspeaker on a circle to derive the discrete loudspeaker aperture functions by avoiding matrix inversion. In addition, the aperture function obtained through continuous loudspeaker method reveals the underlying structure of the solution as a function of the desired soundfield, the loudspeaker positions, and the frequency. This concept can also be applied for the 3-D soundfield reproduction using spherical harmonics analysis with a spherical array. Results are verified through computer simulations.
Conference Paper
Spatial multizone soundfield reproduction is a difficult problem, which has many potential applications. This paper provides a framework to recreate 2D spatial multizone soundfields using an array of loudspeakers. We derive the desired global soundfield by translating individual desired soundfields to a single global co-ordinate system and applying appropriate angular window functions. We reveal some of the fundamental limits of 2D multizone soundfield reproduction. We show that the ability of multizone reproduction is dependent on (i) maximum radius of multizones, (ii) window length (size, and nature), and (iii) radial distance to the furthermost zone. We illustrate the framework by designing and simulating a two dimensional two zone soundfield.