Content uploaded by Jacob Donley

Author content

All content in this area was uploaded by Jacob Donley on Jun 24, 2015

Content may be subject to copyright.

© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including

reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or

reuse of any copyrighted component of this work in other works.

AN EFFICIENT APPROACH TO DYNAMICALLY WEIGHTED MULTIZONE

WIDEBAND REPRODUCTION OF SPEECH SOUNDFIELDS

Jacob Donley, Christian Ritz

School of Electrical Computer and Telecommunications Engineering, University of Wollongong,

Wollongong, NSW, Australia, 2522, jrd089@uowmail.edu.au, critz@uow.edu.au

ABSTRACT

This paper proposes and evaluates an efficient approach for practical

reproduction of multizone soundfields for speech sources. The

reproduction method, based on a previously proposed approach,

utilises weighting parameters to control the soundfield reproduced

in each zone whilst minimising the number of loudspeakers

required. Proposed here is an interpolation scheme for predicting the

weighting parameter values of the multizone soundfield model that

otherwise requires significant computational effort. It is shown that

initial computation time can be reduced by a factor of 1024 with

only 85dB of error in the reproduced soundfield relative to

reproduction without interpolated weighting parameters. The

perceptual impact on the quality of the speech reproduced using the

method is also shown to be negligible. By using pre-saved

soundfields determined using the proposed approach, practical

reproduction of dynamically weighted multizone soundfields of

wideband speech could be achieved in real-time.

Index Terms— multizone soundfield reproduction, wideband

multizone soundfield, weighted multizone soundfield, look-up

tables (LUT), interpolation, sound field synthesis (SFS)

1. INTRODUCTION

The reproduction of audio in spatially separated regions is of high

interest for applications such as the creation of personal listening

zones in entertainment/cinema, multi-participant teleconferencing

and vehicle cabins. This can be achieved using multizone soundfield

reproduction and was originally proposed in [1] to reproduce

multiple independent zones of active and quiet 2D soundfield

regions. In [1] a least squares pressure matching technique was used,

where estimated acoustic pressures reproduced by a set of

loudspeakers are matched to sample values within the desired

soundfield zone. This technique is also used in Sound Field

Reconstruction (SFR) [2].

Following [1], a multizone approach using cylindrical

harmonic expansion with coefficient translation and angular

windowing was proposed [3]–[5]. This approach, however, attempts

to completely suppress any interzone interference which can result

in impractically large loudspeaker signal amplitudes or

impractically low levels in zones. A method better suited for

implementation and controllability was introduced in [6]. The

approach uses orthogonal basis expansion which reduces the

problem to the reconstruction of a set of basis wavefields and allows

each zone to be weighted according to the importance of its

reproduction. This weighting improves the practicality of the system

by relaxing the ideal requirement of completely quiet zones outside

the target zone. The theory in [5] was extended in [7] to include a

similar weighting criteria as used in [6].

While originally focusing on single frequency soundfields,

more recent work attempted to create multizone soundfields [8], [9]

with frequency bandwidths equivalent to narrowband and wideband

speech. This paper investigates the use of weighted multizone

soundfield reproduction for wideband speech using the orthogonal

basis expansion of [6]. The orthogonal basis expansion approach

assumes the soundfield is reproduced as a sum of independent

planewaves reproduced by an array of loudspeakers. Each

planewave corresponds to an individual frequency and direction.

The number of chosen planewaves is governed by the desired

reproduction accuracy and varies depending on frequency. Due to

this, a formula is proposed in this work which determines the

number of planewaves to be used for a given frequency so that

neither spatial aliasing nor ill-conditioned matrices is an inherent

problem. This is necessary when dealing with the reproduction of

soundfields containing multiple frequencies.

While [6] assumes the same weight for each frequency,

dynamically deriving the weights can be used to control the

reproduction accuracy of individual frequency components within

the bright and quiet zones. For example, the weightings can be based

on the perceptual importance of particular frequencies in the zones

in an effort to improve the overall perceived sound quality.

However, this results in increases in computational complexity. To

reduce this complexity and create a more practical solution, this

paper proposes the interpolation of spatial components of the

reproduction along different domains, such as the weighting domain

and frequency domain.

A system is synthesised with varying linear interpolation

distances by using different resolution lookup tables (LUTs) for

storing pre-computed loudspeaker weights and soundfield values.

The synthesis comprises of reproducing wideband zones where

frequency domain content is weighted uniformly with weights that

are in the centre of interpolation regions. The approach is validated

by comparing the reproduced zone signals from the interpolation

method with signals reproduced without interpolation using Mean

Squared Error (MSE) and Perceptual Evaluation of Speech Quality

(PESQ) [10] measures.

Section 2 describes the orthogonal basis expansion approach

to multizone soundfield reproduction. The proposed dynamically

weighted multizone approach is described in Section 3 and a novel

approach to selecting the number of orthogonal planewaves is

presented in Section 4. Section 5 describes the interpolation method.

Evaluation results are given in Section 6 and conclusions outlined in

Section 7.

2. MONOFREQUENT MULTIZONE SOUNDFIELD

REPRODUCTION

A multizone soundfield reproduction layout is depicted in Figure 1.

The reproduction region, , of radius

R

, contains three sub

regions called the bright zone, quiet zone and unattended zone,

denoted by b, q and ()bq

, respectively. The radius of

b and q is r and their centres are located on a circle of radius

z

r. The angle of the desired planewave in bis

and is

reproduced by loudspeakers positioned on an arc of angle

L

and

radius l

R

with the first loudspeaker starting at angle

.

In the orthogonal basis expansion approach to multizone

soundfield reproduction [11], a soundfield function (,)xSk that

fulfils the wave equation, where x is an arbitrary spatial

sampling point and k is the wavenumber of the soundfield, is

written as a summation of an orthogonal set of the Helmholtz

equation solutions [12] as

,,,

nn

n

Sk CkwG k

xx

(1)

where { }

nG is a series of weighted basis functions and given a

desired soundfield, (,)x

d

Sk, the expansion coefficients are derived

as *

(, ) (,) (,) ()

d

nn

Ckw S kG kw dxxxx

using a weighted inner-

product, where w(x) represents the weighting function. The

weighted QR factorisation may also use a weighted inner-product.

Here, the weighting function, ( )x

w, can be written as ( )xbbww

,

()xqqww and ( )xuuww, where xbb, xqq

and

()xbqu

, indicating different values for the weights

within each of the bright ( b), quiet ( q), and unattended zones,

respectively as illustrated in Figure 1.

Following a QR factorisation on a set of

N

planewaves,

),(x

nFk, arriving from angles 02

, (1) becomes,

,, ,,xx

jj

j

Skw P wFkk (2)

where the coefficients for the wavefields are

1

(, ) (, )( )Rj

n

jnnkw C wPk

, {1, , }jN , {1, , }nN and R

is the upper triangular matrix from the QR factorisation [6].

The reproduction of the soundfield in (2) at a particular

location can then be expressed as a summation of the

discontinuously located loudspeaker signals [3],

(1)

0

1

,, , 4

L

a

disc l l

l

i

Skw kwHk

xxx

(3)

where (, )

lkw

is the th

l complex loudspeaker weight,

L

is the

number of loudspeakers, 1i

, (1)

0(||||)Hk is a zeroth-order

Hankel function of the first kind [12] and xl is the position of the

th

l loudspeaker. The complex loudspeaker weights are defined as,

1

(1)

(, ) 2 , pw l

Mim im

m

lmlj s

j

mM

kw i H kR P kwie e

(4)

where

M

kR

is the truncation length [3],

R

and l

R

are from

Figure 1, ( 1)

pw j

are the wavefield angles, 2 /

J

,

l

is the angle of the th

l loudspeaker from the x-axis and

s

is

the angular spacing of the loudspeakers. The minimum number of

loudspeakers to avoid aliasing is given by,

221

max

LeRk

(5)

where

L

is the minimum number of required loudspeakers, e is

Euler’s number and max

k

is the wavenumber of the highest

frequency (where frequency,

f

kc [12] and 1

343cms

is the

speed of sound). See [3] for further details.

3. WEIGHTED MULTIZONE WIDEBAND SOUNDFIELDS

A wideband soundfield is described here as a linear combination of

planewaves corresponding to each source frequency, similar to a

Fourier series [12] and the approach of [8], [9]. The pressure

generated at any point in the reproduced soundfield is given by,

ˆ,,,xx

K

a

disc

k

pw S kw (6)

where there are

K

different sinusoidal components. Multiple

nested summations are required to derive ˆ(, )

xpw as:

*

,,

ˆ,

x

xxxxx

KL M J N

kl km m

kl mM

n

n

jn

j

d

SkGkww

(7)

where 1

()R

j

njn

, pw

im

m

mie

, (1)

2/(())

l

im

km l

sm

eiHkR

and (1)

0(||() ||)/4xxxkl l

iH k

. Here, a summation occurs for

every sample in , for

N

planewaves,

J

wavefields, 21

M

modes,

L

loudspeakers and

K

sinusoidal components.

This paper extends the approach of [6] to allow for frequency-

dependent weighting functions such that the leaked frequency

spectrum can be controlled. This may benefit in cases where

occlusion is a problem [9] and may improve the perceivable error

compared to a single zone weight. To weight signals dynamically

for each frequency, during the soundfield construction in (3),

loudspeaker weights and soundfield samples are obtained per time-

frequency component for a free field reproduction scenario (future

work will look into reproduction in reverberant rooms). For example,

if a signal from an arbitrary location in the reproduction area,

defined in (3), was to be synthesised,

ˆ,,,

a

FdiscF

YkS kwYkxx (8)

where ˆ(,)

xFYk

is the frequency domain signal at an arbitrary

location, x, in the reproduction region, , ( )FYk is obtained from

the short-time Fourier transform of the windowed frame of input

()yn and | |

denotes the absolute value. If loudspeaker signals are

preferred to be synthesised then |(,,)|x

a

disc

Skw can be replaced with

Figure 1 – Weighted multizone soundfield reproduction layout

(, )

lkw

from (4) and ˆ(, )

xFYk

becomes ˆ()

FlYk. Time domain

loudspeaker signals, ˆ()

l

yn, and/or soundfields, ˆ(,)xyn, are derived

from (8) via overlap-add reconstruction [13].

4. ORTHOGONAL PLANEWAVE SELECTION

The wavefield coefficients depend on the expansion coefficients and

the upper triangular matrix resulting from QR factorisation

computed on a set of

N

planewave functions arriving from angles

02

. Hence, the computation increases with both

N

and the

spatial sampling density of the reproduction zone. A larger

N

results in a more accurate reproduction, however, in practice, this

also results in poorly conditioned upper triangular matrices. A lower

N

causes spatial aliasing. The effect of N is also frequency

dependent. Selecting

N

for the layout in Figure 1 is chosen to

balance the impact of spatial aliasing and ill-conditioning:

1.2 1.2

max min max min min mincc c

N

fNf Nf f f ff Nf

(9)

where min

f

and max

f

are the minimum and maximum frequencies,

respectively, ()

c

N

f

is the largest number of planewaves that

satisfies 10

()0() 1Rf

or )min( ) (

q

M

f

for a given frequency,

f

.

Here, ( )R

is the condition number of the upper triangular matrix,

q

M

is the mean squared error (MSE) in the quiet zone and is limited

to 60

q

MdB

. This work focuses on reproducing small room

sized soundfields with a bandwidth 150 8

f

Hz kHz. Due to the

size of the soundfields, zone weighting for 150

f

Hz has a

negligible effect, as can be seen to trend in Figure 2. Analysis shows

that (150 30)

cN and (8000) 300cN based on (9). This allows

the wideband system to be synthesised using the orthogonal basis

expansion method with minimal error caused by the selection of

N

.

5. LOOK-UP TABLE BASED SYNTHESIS AND

WEIGHTING

It is computationally demanding to construct a weighted multizone

soundfield using the methods of Sections 2 and 3 due to the QR

factorisation involved for all time-frequency components (e.g. a

three second audio file sampled at 16kHz requires approximately

48 103 independent reproductions). In order to best make use of

these repeated reproductions, the loudspeaker weights and

soundfield pressure samples can be reproduced and stored for later

use. Once enough values have been stored, interpolation between

them can further reduce computation and error caused by quantised

values. In this paper we propose the use of Look-Up Tables (LUT)

to store pre-determined weighted soundfield values to be used for a

given setup or wideband reproduction, with an example shown in

Figure 2. A LUT may be described as a matrix of soundfield

reproduction values for a given frequency and weight range,

,, ,,

,, ,,

xx

A

xx

aa

disc min min disc min max

aa

disc max min di

u

sc max

v

max

Skw Skw

Skw Skw

(10)

where ( , , )

a

disc

Skwx from (3) is a soundfield reproduction value for

wavenumber, k, and weighting, w, at x and Auv is a LUT

with u number of frequencies and v number of weights in the

range { , , }min maxkk

and { },,min maxww

, respectively. The set of

frequencies is logarithmically spaced as it closely resembles the

spacing of the Bark scale [13] and the set of weights is

logarithmically spaced as it provides larger control ranges in the

decibel scale. The table can be built for loudspeaker signals by

replacing (, , )x

a

disc

Skw with (, )

lkw

from (4).

In order to evaluate the error and perceptual effects of

quantising and interpolating soundfield values, a comparison is

made between two LUTs (see Section 6). The MSE is evaluated as

the difference between the interpolated values of lower and higher

resolution LUT values as,

2

1

uv uvLuvTu

Uv

AAA

(11)

where

L

UT

is the MSE for the given interpolated LUT, Auv

,

relative to the highest resolution LUT, Auv

, u

is the highest

frequency resolution, v

is the highest weight resolution, u and v

are the set of frequency and weight resolutions to evaluate,

respectively. The interpolated LUT, Auv

, is a matrix of size uv

obtained from interpolation of a smaller matrix, Auv . In this work

bilinear interpolation is used.

6. RESULTS

This section describes simulations of reproduced soundfields using

the above approach evaluated using MSE and PESQ measures.

6.1. Evaluation Setup

The multizone soundfield layout of Figure 1 is evaluated, where

0.3rm

, 0.6zrm

, 1

R

m, 1.5l

R

m

,

1

sin ( / 2 ) 14.5zrr

and 3.14159c

. This setup is similar

to [6] and

is chosen such that an evanescent planewave with

instant decay would interfere with half the quiet zone resulting in a

large range of weighting control with a slight occlusion problem.

Signals sampled at 16 kHz are converted to the time-frequency

domain using a Hamming window (50% overlap) and Fast Fourier

transform (FFT) of length 1024 . The tables are analysed for a

reproduction that meets and doesn’t meet the aliasing criteria for the

minimum number of reproduction loudspeakers given by (5). The

LUTs are built for a reproduction where 16L, / 2

and

L

, referred here on as the aliasing setup. They are also built

where 65L

and 2L

, referred to as the non-aliasing setup.

The aliasing setup aliases above 4kHz and the non-aliasing setup

aliases above 8kHz . The aliasing setup is included to show that

visualising the soundfield values can provide other benefits.

The tables are built with the pressures for all bq

x and

averaged across b and q, from which the soundfield zones can

be approximated. The zone weights are chosen as 1bw

and

0.05uw

following [6] and the variable weight is qw. The effect

of qwon the input signal is evaluated using (8), which can also be

reversed to find a soundfield level to obtain a desired output. This

Figure 2 – Example high resolution table of values for SPL.

Frequency (Hz)

SPL in

Q

uiet Zone for Various Wei

g

hts and Fre

q

uencies

Weight

SPL

(

dB

)

soundfield level will map to a particular qwin the LUT and, if

implemented, to (, )

lkw

that would reproduce the level.

Without interpolation or LUT’s, the highest frequency

resolution is 512 (based on the 1024 length FFT) and 256 different

weight values (results in negligible reconstruction error). Each table

is built for resolutions consecutively halving and decreasing in

resolution down to 16 frequencies and 8 weights. In this work we

have 24

{1 0 , , 1 0 }qw

which extends the range used in [6]. The

error between the different LUTs is evaluated using (11) where the

highest resolution for frequency is 512u and for weight

/2vu

, and the set of frequency and weight resolutions to

evaluate are {16,32,64,128,256}u and /2vu, respectively.

The proposed approach is further evaluated using PESQ [10]

to estimate the perceptual quality of the reproduced soundfields.

Speech files for the evaluation are taken from the TIMIT corpus [14]

where 20 files are chosen randomly. The male to female speaker

ratio of these files is 50 : 50 . The reference signal for the PESQ

algorithm is the original speech signal. PESQ values are obtained

for the reproduced speech soundfields using the different resolution

LUTs and then mapped to the PESQ Mean Opinion Score (MOS)

[15]. These reproductions use 0.5 0.5 1.5 2.5

{1 0 , 1 0 , 1 0 , 1 0}qw

such

that they lie primarily in the centre of the interpolation regions. This

allows the highest resolution LUT to be evaluated, however, due to

the computational complexity is limited to four different weights.

6.2. Evaluation Results

Figure 3 illustrates the aliasing threshold described by (5).

Here, it is expected that aliasing will occur above 4kHz and a

pattern is noticeable, as noted by the black and white line, where it

can be seen that the amplitude in both the bright and quiet zone

becomes discernibly larger. It can also be seen that at about 8kHz

in the quiet zone a significant aliasing occurs and when the

weighting is increased it occurs at lower frequencies.

Analysing the MSE results between the different interpolation

distances (Figure 4) indicate the lower resolution LUTs require

significantly less computations than those of the higher resolution.

This can be observed from the labels that show the relative decrease

in the number of reproduced soundfields, which measured up to

1024 times less at just 0.10% the number of computations of the

highest resolution LUT and with an MSE of 85dB, comparable to

high end audio systems. In general, an increase in the interpolation

distance increases the MSE.

As can be seen in Figure 5, the increased MSE caused by

higher interpolation distances appears to have no discernible impact

on the perceptual quality where the maximum mapped MOS is

indicated by the red line. Figure 5 does show, however, a slight

increase in the variability of the PESQ MOS, as indicated by the

95% confidence interval markers, where higher interpolation

distances are required. This shows that interpolating the weighted

soundfield values has an indiscernible perceptual effect on the

reproduction and decreases the computational complexity of the

problem with 1024 times less individual soundfield reproductions.

7. CONCLUSIONS

This paper proposed a method for building multizone soundfields

for speech signals which allows dynamic control of the weighting

between zones. We have proposed a method for reducing the

computational effort involved when dynamically weighting zones

for speech signals. We have also proposed a method for determining

the number of planewaves required for multiple frequency systems

utilising the orthogonal basis expansion approach. The method has

been evaluated and shows indiscernible impact on perceptual quality

of reproductions and decreased computational complexity. The

evaluations show PESQ MOS lie around 4.4 , MSE around 85dB

and 1024 times less individual reproductions. Also demonstrated in

this paper is the use of LUTs for visualising speech soundfields and

possible reproduction problems. For instance, media designers or

engineers may benefit from the visualisation of these LUTs to better

predict when aliasing or other phenomena occur in a system.

8. ACKNOWLEDGEMENTS

The authors would like to thank Wenyu Jin for his helpful insight

into the orthogonal basis expansion method.

Figure 5 – PESQ MOS between weighted speech files

reproduced by different LUTS with 95% confidence intervals.

Labels show the relative complexity decrease from . Red

line indicates maximum mapped PESQ MOS.

1024x 512x 256x 128x 64x

512x 256x 128x 64x 32x

256x 128x 64x 32x 16x

128x 64x 32x 16x 8x

64x 32x 16x 8x 4x

4.2

4.4

4.6

16 32 64 128 256

PESQ MOS

No of Frequencies

Reproduction PESQ of LUTs 8

Weights

16

Weights

32

Weights

64

Weights

128

Weights

Figure 4 – MSE between different LUT resolutions. Labels

show the relative complexity decrease from .

1024x 512x

256x 128x 64x

512x 256x

128x

64x 32x

256x 128x

64x

32x

16x

128x 64x

32x

16x

8x

64x

32x

16x

8x

4x

-150

-130

-110

-90

16 32 64 128 256

MSE (dB)

No of Frequencies

Mean Squared Error between LUTs 8

Weights

16

Weights

32

Weights

64

Weights

128

Weights

Figure 3 – LUT from the aliasing setup for the bright (A) and

quiet (B) zones. The black and white line marks 4kHz aliasing.

Frequency (Hz)

Frequency (Hz)

Wei

g

ht Wei

g

ht

Amplitude Amplitude

A

B

Zone Samples for various Weights and Frequencies

9. REFERENCES

[1] M. Poletti, “An Investigation of 2-D Multizone Surround

Sound Systems,” presented at the Audio Engineering Society

Convention 125, 2008.

[2] M. Kolundzija, C. Faller, and M. Vetterli, “Reproducing

Sound Fields Using MIMO Acoustic Channel Inversion,”

JAES, vol. 59, no. 10, pp. 721–734, Nov. 2011.

[3] Y. J. Wu and T. D. Abhayapala, “Theory and design of

soundfield reproduction using continuous loudspeaker

concept,” Audio, Speech, and Language Processing, IEEE

Transactions on, vol. 17, no. 1, pp. 107–116, 2009.

[4] Y. J. Wu and T. D. Abhayapala, “Spatial multizone

soundfield reproduction,” in Acoustics, Speech and Signal

Processing, 2009. ICASSP 2009. IEEE International

Conference on, 2009, pp. 93–96.

[5] Y. J. Wu and T. D. Abhayapala, “Spatial multizone

soundfield reproduction: Theory and design,” Audio, Speech,

and Language Processing, IEEE Transactions on, vol. 19,

no. 6, pp. 1711–1720, 2011.

[6] W. Jin, W. B. Kleijn, and D. Virette, “Multizone soundfield

reproduction using orthogonal basis expansion,” presented at

the Acoustics, Speech and Signal Processing (ICASSP),

2013 IEEE International Conference on, 2013, pp. 311–315.

[7] H. Chen, T. D. Abhayapala, and W. Zhang, “Enhanced sound

field reproduction within prioritized control region,” in

INTER-NOISE and NOISE-CON Congress and Conference

Proceedings, 2014, vol. 249, pp. 4055–4064.

[8] N. Radmanesh and I. S. Burnett, “Generation of isolated

wideband sound fields using a combined two-stage lasso-ls

algorithm,” Audio, Speech, and Language Processing, IEEE

Transactions on, vol. 21, no. 2, pp. 378–387, 2013.

[9] N. Radmanesh and I. S. Burnett, “Reproduction of

independent narrowband soundfields in a multizone surround

system and its extension to speech signal sources,” in

Acoustics, Speech and Signal Processing (ICASSP), 2011

IEEE International Conference on, 2011, pp. 461–464.

[10] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra,

“Perceptual evaluation of speech quality (PESQ)-a new

method for speech quality assessment of telephone networks

and codecs,” in 2001 IEEE International Conference on

Acoustics, Speech, and Signal Processing, 2001.

Proceedings. (ICASSP ’01), 2001, vol. 2, pp. 749–752 vol.2.

[11] G. H. Golub and C. F. van Van Loan, “Matrix computations

(Johns Hopkins studies in mathematical sciences),” 1996.

[12] E. G. Williams, Fourier acoustics: sound radiation and

nearfield acoustical holography. academic press, 1999.

[13] M. Bosi and R. E. Goldberg, Introduction to digital audio

coding and standards. Springer, 2003.

[14] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and

D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous

speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA

STI/Recon Technical Report N, vol. 93, p. 27403, 1993.

[15] I. Rec, “P. 862.1: Mapping function for transforming P. 862

raw result scores to MOS-LQO,” International

Telecommunication Union, Geneva, vol. 24, 2003.