Content uploaded by Ivan V. Bajic
Author content
All content in this area was uploaded by Ivan V. Bajic on Oct 20, 2017
Content may be subject to copyright.
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
A Simulation Study of a 3D Sound Field
Reproduction System for Immersive Communication
Hanieh Khalilian, Member, IEEE, Ivan V. Baji´
c, Senior Member, IEEE, and Rodney G. Vaughan, Fellow, IEEE
Abstract—Immersive communication systems promise a
greatly improved user experience through the use of advanced
technologies tailored to various human senses, such as sight, hear-
ing, and touch. This paper focuses on a simulation study of 3D
Sound Field Reproduction (SFR) for immersive communication.
At the transmitting end, the incident sound field is captured via a
microphone array, active talkers are detected, and a clean version
of the signals corresponding to active talkers are found by existing
methods. The captured information is transmitted to the receiving
end, where a 3D sound field from virtual sources corresponding
to active talkers is synthesized around listeners’ heads by the
proposed SFR method. In our system, the radiation patterns of
the higher-order (directive) loudspeakers are optimized by the
Constrained Matching Pursuit algorithm. Implementation of the
directive loudspeaker patterns is not addressed here, and their
deployment is assumed. Simulation results quantify the benefit
of higher-order loudspeakers for speech sound field synthesis in
reverberant rooms.
Index Terms—3D sound field reproduction, immersive com-
munication, higher-order loudspeakers, pattern and placement
optimization.
I. INT ROD UC TI ON
SOUND Field Reproduction (SFR), also known as sound
field synthesis, is the process of reproducing a desired
sound field within a listening region of interest by an array of
loudspeakers. The desired field corresponds to a signal from a
virtual source called the primary source. SFR is an important
part of immersive communication. As 3D video provides
visual localization of sound sources in the (virtual) 3D space,
the role of SFR is to complement the visual information by
providing the correct auditory localization of sound sources.
In this paper, we design an SFR system by jointly optimizing
the pattern and placement of the loudspeakers.
An overview of the requirements and technical challenges
related to audio immersive systems can be found in [1]–[3].
In [4], a 3D sound field reproduction system was designed and
implemented based on the boundary surface control principle.
The loudspeaker and microphone arrays were in the shape
of a dome. In this method, the sound field of an environ-
ment (such as a jungle, orchestra, etc.) is recorded by the
microphone array and then reproduced in a room. Subjective
listening experiments showed that most of the subjects rated
the reproduced sound field as “very good,” which was the best
out of five ratings.
In [5], a sound field reproduction system for immersive
audio is suggested in which the loudspeakers are placed
The authors are with the School of Engineering Science, Simon
Fraser University, Burnaby, BC, V5A 1S6, Canada. Tel: 1-778-782-7159.
Fax: 1-778-782-4951. E-mail: hkhalili@sfu.ca, ibajic@ensc.sfu.ca, rod-
ney vaughan@sfu.ca.
around the audience. The loudspeakers’ excitations (driving
functions) were determined using Wave Field Synthesis [6]–
[8]. Here, again, subjective tests showed that the quality of the
reproduced field was “high.” In the immersive audio system
described in [9], loudspeakers are located around the desktop
and the sound field originating from a point source located
in the middle of the screen is recreated. While the above
approaches focused on headphones-free systems, the authors
of [10] proposed a low cost sound field reproduction system
with the aid of headsets.
Conventional approaches to SFR include Ambisonics [11]–
[13] and the above-mentioned Wave Field Synthesis. While
these methods foster closed-form solutions to the SFR problem
for certain array geometries, such as circular and spherical,
they are not so well-suited to studying approximations to the
desired field in the presence of power constraints. For this
reason, our focus is on SFR by direct approximation (a.k.a.
pressure matching) [14]–[18]. Although closed-form solutions
will not always be available with direct approximation-based
SFR, the approach allows us to study, through simulations,
arbitrary array geometries and various loudspeaker radiation
patterns.
The contribution of this paper is a new design approach
for SFR in the context of immersive communications. Given
the possible positions of the talkers and listeners at the two
ends of the communication link, the placement and radiation
patterns of a number of higher-order loudspeakers are jointly
optimized in the design phase to enable the synthesis of the
desired sound fields. Since the SFR error function is not
convex in terms of the locations of the loudspeakers [16], a
Constrained Matching Pursuit (CMP) algorithm is employed
for joint optimization of loudspeaker patterns and locations.
During the operation phase, direct approximation is employed
to reproduce the desired sound field around listeners’ heads.
It is emphasized that this is a simulation study and we use
3D loudspeaker patterns synthesized with up to 5-th order
spherical harmonics. For practical deployment, this synthesis
is not yet possible, although third-order synthesis in 2D has
already been reported in [19], [20]. It is expected that in the
near future such pattern synthesis will be possible in practice.
The reasons for employing the higher order loudspeakers are
two-fold: 1- Power constraint is an important factor in SFR
for reasons such as controlling the sound field outside of the
listening area and improving the system robustness [16], [21].
Employing higher order loudspeakers enhances the system
performance under power limitation by concentrating most of
the loudspeaker output power towards the listening area. 2- In
some SFR applications in a reverberant environment, the goal
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
Fig. 1. Illustration of an immersive communication system.
is to increase the Direct to Reverberant Ratio (DRR) in the
listening area [22], [23]. Employing higher order loudspeak-
ers helps to improve DRR, due to loudspeakers’ directivity
patterns.
The paper is organized as follows. An overview of the
audio layer of an audiovisual immersive communication sys-
tem is presented in Section II, outlining its basic operation,
assumptions and requirements. The preliminaries for sound
field reproduction are presented in Section III. The proposed
SFR method is presented in two phases: the design phase in
Section IV and the operation phase in Section V. Simulations
evaluating the system performance are described in Section VI,
followed by conclusions in Section VII.
II. SY ST EM OV ERV IE W
Fig. 1 depicts an immersive communication system. Two
rooms are equipped with the necessary hardware, such as
3D displays, texture and depth cameras, microphone and
loudspeaker arrays, etc., and connected to each other via a
link with sufficiently high rate and low latency to support two-
way real-time transfer of the necessary information. There are
several participants in each room.
The goal of the immersive communication system is to make
the participants feel as if the physical distance between the
rooms has vanished and the two screens shown in Fig. 1 have
merged. To the participants, it should appear as if the screen
that they are watching is an open window to the other room.
Conceptually, each room should be seen as a virtual extension
of the other room, as illustrated in Fig. 2. In this figure,
one “talker” in room 1 is addressing two listeners in room
2. The immersive communication system makes it appear to
the two listeners, whose head positions are indicated by the
right-most dots, that the talker’s head is located at the position
indicated by left-most dot in the virtual room 1. The sound
field generated around the heads of the two listeners should
be the same as one generated by a source at the left-most dot.
A configuration for the loudspeaker and microphone arrays
is required. In this study, the microphones and loudspeakers
are arranged in concentric rectangular arrays around the video
screen, as shown in Fig. 3. Other setups can be deployed
without changing the methodology.
In our simulations, the talkers are modeled as omni-
directional sound sources, which is a reasonable approximation
for face-to-face conversation [24]. Through the use of texture
and depth cameras, the 3D positions of the talkers’ and
listeners’ heads can be estimated via existing algorithms for
face and lip detection [25], [26].
It is important to realize that transmitting only the pressures
detected by the microphones is insufficient to achieve an
Fig. 2. The concept of a virtual extension for immersive communication:
Virtual Room 1 becomes a virtual extension of Room 2.
Video screen
Fig. 3. Microphone (red) and speaker (blue) arrays for SFR.
immersive effect illustrated in Fig. 2(a), since this would not
create the virtual source at the required 3D position in the
virtual room. The actual position(s) of active talkers must be
transmitted as well. Since all participants are being tracked
by face and lip detection, what remains to be determined is
which of them are active talkers at any given time, and what
is the signal corresponding to each active talker. However, this
is not the focus of the present paper, so we assume that the
clean version of speech signals corresponding to active talkers
are found using existing methods, for example [27].
The detected locations of the active talkers are transmitted to
the other end, where they are used to synthesize the sound field
around listeners’ heads. In practice, the sound and position in-
formation must be compressed prior to transmission. There are
a number of methods for multichannel audio compression [28],
while 3D positions of the talkers’ heads can be encoded
as point clouds [29], or even transmitted losslessly if the
number of participants is small. Compression, transmission,
echo cancellation (e.g. [30]) are assumed to be provided by
existing methods, while our focus is on SFR, as discussed in
the remainder of the paper.
III. SOUND FIE LD RE PRO DU CT IO N
Let the participants in Room 1 be talkers and those in Room
2 be listeners. Suppose that the active talker’s location(s) and
their speech signal(s) are sent to Room 2 for sound field
reproduction. SFR systems are characterized by two types of
degrees of freedom (DOFs). Static DOFs are those that can
be exploited in the system design stage, such as the position
of loudspeakers, but remain fixed during system operation.
Dynamic DOFs are those that can be updated during system
operation, such as the loudspeaker excitations or driving
signals. The computational complexity of an SFR algorithm
is mostly influenced by the dynamic DOFs, as these need to
be computed in real time, during system operation.
The SFR literature is not always consistent on what consti-
tutes static and dynamic DOFs. For example, in the methods
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
Fig. 4. Illustration of the cubic listening areas around the listener’s ears.
proposed in [12], [14], [16], as well as this paper, the lo-
cations and radiation patterns of loudspeakers are treated as
static DOFs, while loudspeaker excitations are dynamic DOFs.
However, in [31], the radiation patterns, or equivalently the
harmonic (expansion) coefficients of the higher-order loud-
speakers, are also treated as dynamic DOFs and are updated
with any change in the primary source. The added flexibility
afforded by the higher number of dynamic DOFs means that
systems such as in [31] can be expected to perform better,
although such strategy is more computationally expensive. The
SFR method presented in this paper will be compared against
a loudspeaker array with omni-directional patterns, which is
referred to as the “benchmark” system. The benchmark system
has the same number of loudspeakers, hence the same number
of dynamic DOFs, as the proposed system, while the number
of its static DOFs is smaller because its radiation patterns are
of lower order (omni-directional) and not optimized.
In the remainder of this section, the notation and terminol-
ogy employed in this paper are explained. The design phase
and the operation phase of the proposed SFR system are then
explained in detail in Sections IV and V, respectively.
Let Nbe the number of loudspeakers, Mbe the number
of sampling points in the listening area, xnbe the location
of the n-th loudspeaker in the array, and ymbe the location
of the m-th sampling point in Room 2. Also, let Vn(f)be
the input voltage of the loudspeaker in [V], Ln(f, θ, φ)be the
dimensionless radiation pattern of the loudspeaker, E(f)in
[Pa·m/V] be the transfer function of the loudspeaker (which
converts voltage to the loudspeaker excitation in [Pa·m]), and
gm,n(f;kxn−ymk2)be the free space Green’s function in
[m]−1. The dynamic pressure at ymresulting from the n-th
loudspeaker is given by:
p(f, xn,ym) = Vn(f)·E(f)· Ln(f , θn
m, φn
m)·
gm,n(f;kxn−ymk2)· Mm(f , θm
n, φm
n)
=sn(f)· Ln(f, θn
m, φn
m)·
gm,n(f;kxn−ymk2)· Mm(f , θm
n, φm
n),
(1)
where Mm(·)is the dimensionless receiving pattern at
the m-th sampling point, and Ln(f, θm
n, φm
n)·g(f;kx−
yk2)M(f, θm
n, φm
n)is the Acoustic Transfer Function (ATF)
between the n-th loudspeaker and the m-th sampling point.
In this equation (θn
m, φn
m)are the spherical angles of the m-
th sampling point with respect to the n-th loudspeaker, and
(θm
n, φm
n)are the spherical angles of the n-th loudspeaker with
respect to the m-th sampling point. Following the convention
in the literature [14]–[16], we consider the complex amplitude
as the input to the loudspeaker, so sn(f) = Vn(f)·E(f)
in [Pa·m] is the complex amplitude (known as excitation or
driving function) of the input to the n-th loudspeaker. Note
that this excitation is not a voltage or current.
The free space Green’s function between xnand ymis
given by:
gm,n(f;kxn−ymk2) = eikkxn−ymk2
4πkxn−ymk2
,(2)
where i=√−1,k=2π
λ= 2πf /C is the wave number, λis
the wavelength, and Cis the speed of sound.
Let s= [s1, ...sN]Tbe the complex amplitudes of loud-
speakers and Gbe the ATF matrix between the loudspeaker
array and sampling points, whose (m, n)-th element is equal
to gm,n ·Ln·Mm. The pressure produced at sampling points
is expressed by:
p=Gs,(3)
where pis an M×1vector.
Let x0be the location of an active talker in Room 1, and
s0be the corresponding complex amplitude at frequency f.
The desired pressure at the m-th sampling point is calculated
from Eq. (4) assuming that the talker is modeled as an omni-
directional source (L= 1):
pdes(f , x0,ym) = sn(f)·1·gm,n(f;kx0−ymk2)·
Mm(f, θm
n, φm
n).(4)
The number and arrangement of the sampling points in the
design phase (Section IV) and the operation phase (Section V)
will be different. During the design phase, M=Ksampling
points are distributed uniformly throughout the volume cov-
ering possible locations of the listeners. The desired vector is
denoted by pdes
d, with subscript didentifying the design phase.
Each element of this vector represents a pressure at a sampling
point, calculated from (4) with Mm= 1.
During the operation phase, two cubic listening areas are
considered around each listeners’ ears, as shown in Fig. 4. Mv
sampling points are distributed uniformly in each cubic region.
Therefore, with N2listeners in Room 2, the total number of
sampling points is M=Ms= 2 ·N2·Mv. The desired field
and the ATF matrix in this phase are denoted by pdes
oand Go
respectively, with subscript odenoting the operation phase.
Each element of the desired vector is calculated from (4)
assuming that this point is located on a rigid sphere, which is
a model for the human head. Therefore, the pressure at this
sampling point is influenced by the Head Related Transfer
Function (HRTF), and the receiving pattern of this point
(Mm) is replaced by the HRTF, Hm, described in [32].
Specifically, to calculate Hm, it is assumed that the sampling
point is located on a rigid sphere, with spherical coordinate
(r, θ, φ) = (0.11, π/2,0) for the right ear and (0.11, π/2, π )
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
for the left ear, with respect to the center of the rigid sphere.
For each sampling point in the 10cm cubes around the ears in
Fig. 4, the center of the rigid sphere is found first, and then
the HRTF Hmcorresponding to this rigid sphere is computed.
This allows for some head movement and imprecision in head
position estimates. Similarly, the elements of the ATF matrix
in this phase are gm,n · LnHm.
IV. DES IG N PHASE
In the design phase, the locations and radiation patterns of
higher-order loudspeakers are optimized for a given frequency
range and possible locations of talkers and listeners using a
combination of Matching Pursuit (MP) [33] and Constrained
Matching Pursuit (CMP) [34] algorithms. During the design
phase, the frequency range and the region of space in which
the talkers may be located are sampled. The desired field
corresponding to each sample frequency and talker’s location
is calculated. Then the locations and patterns of a number
of loudspeakers are optimized for a pair of frequency and
talker’s location. As mentioned earlier, the SFR error function
is not convex in terms of loudspeaker locations [16]. MP and
CMP algorithms are not guaranteed to find globally optimal
solutions. But they are able to find solutions that provide
substantial gain over the benchmark array of omni-directional
loudspeakers.
We first present the CMP algorithm [34], which is an exten-
sion of MP [33] that incorporates a constraint on the norm of
the vector of expansion coefficients. Then the joint optimiza-
tion of loudspeaker patterns and locations for a monochromatic
primary source with known location is presented. Finally, this
joint optimization is used as a building block to optimize the
loudspeaker patterns for a range of frequencies and primary
source locations.
A. Constrained Matching Pursuit
In MP [33], a desired vector is approximated as a linear
combination of vectors from a collection called the dictionary.
Let pdes
dbe the desired vector and D={d1,d2, .., dNv}be
the dictionary. The result of the MP algorithm is:
pdes
d=
N
X
n=1
αnd(n)+RN+1(pdes
d),(5)
where the first summation is an approximation of pdes
das a
linear combination of dictionary members and the last term is
the error vector after Niterations. The symbol d(n)represents
the dictionary member selected at the n-th iteration and αnis
its assigned coefficient:
αn= (d(n))HRn(pdes
d).(6)
The CMP algorithm [34] is a version of MP where a desired
vector is approximated in terms of the dictionary members
while the squared ℓ2-norm of α= [α1, α2, ..., αN]is kept
below pa, that is, kαk2
2≤pa. The CMP is summarized in
Algorithm 1.
The main difference between MP and CMP is in Step 3,
equation (7). If the magnitudes of the dictionary members are
Algorithm 1 Constrained Matching Pursuit
Input: D⊲dictionary
Input: pdes
d⊲desired vector
Input: N ⊲ number of iterations
Input: pa⊲max. squared ℓ2-norm of the coeff. vector
Output: d(n)⊲selected dictionary members
Output: α⊲coefficient vector
1: Set R1(pdes
d) = pdes
d.
2: for n= 1 to Ndo
3: Select the dictionary member and its corresponding
coefficient by solving the following optimization problem:
(d(n), αn) = arg min
d∈D,|α|2≤pnkαd−Rn(pdes
d)k2
2,(7)
where pn=pa/N is the energy limit for the n-th
coefficient.
4: Compute the new error vector
Rn+1(pdes
d) = Rn(pdes
d)−αnd(n).
5: end for
6: return {d(n)},α= [α1, α2, ..., αN].
equal, the one that is most correlated with the current error
vector will be selected, same as in MP, and the corresponding
coefficient will be:
αn=(√pn(d(n))HRn(pdes
d)
|(d(n))HRn(pdes
d)|if √pn≤ |(d(n))HRn(pdes
d)|,
(d(n))HRn(pdes
d)otherwise.
(8)
However, if the magnitudes of the dictionary members are not
equal, the selected vector will not necessarily be the one that
is most correlated with the current error vector, but the one
that leads to the minimum residual error under the constraint
on the magnitude of its coefficient |αn|2≤pn.
In [33], the convergence of MP is analyzed and it is shown
that the approximation error decreases exponentially with the
number of iterations. The convergence of CMP was studied
in [21], which shows that the CMP error also decreases
exponentially with the number of iterations, albeit at a slower
rate than MP.
B. Monochromatic primary source with known location
First, we assume that the primary source is at a known
location and emits a monochromatic sound field. We will later
extend the optimization to the case where the primary source
covers a range of frequencies and possible locations.
To optimize the loudspeakers’ patterns, each loudspeaker
in the array is considered to be an L-th order loudspeaker.
The pressure applied at point r= (r, θ, φ)by an L-th order
loudspeaker located at the origin is given by ( [35], [36]):
p(f;r, θ, φ) = s(f)
L
X
l=0
l
X
md=−l
Cl,mdhl(kr)Ymd
l(θ, φ),(9)
where s(f)is the complex amplitude of the higher-order
loudspeaker, hl(kr)is the l-th order Hankel function, kis the
wave number, Ymd
l(θ, φ)is the spherical harmonic function
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
of order land degree md, and Cl,md’s are the harmonic
coefficients. The term corresponding to l= 0 is omni-
directional. For the far-field (kr ≫1):
hl(kr) = eikr
kr (−i)l+1.(10)
To be consistent with the formulations and notations of
Section III, we modify (9) for far-field as follows:
p(f;r, θ, φ) = s(f)·eikr
4πr ·
L
X
l=0
l
X
md=−l
C′
l,mdY′md
l(θ, φ)
=s(f)·g(f;r)· L(f, θ, φ),
(11)
where g(f;r)is the free space Green’s function, and
L(f, θ, φ) = PL
l=0 Pl
md=−lC′
l,mdY′md
l(θ, φ)is the radia-
tion pattern of the higher-order loudspeaker. In this equation
C′
l,md=√4πCl,md(−i)l+1 /k are called the expansion coef-
ficients of the higher-order loudspeaker, and Y′md
l(θ, φ) =
√4πY md
l(θ, φ). For a fair comparison between the perfor-
mance of a system employing higher-order loudspeakers and a
benchmark system employing omni-directional loudspeakers,
these systems should use the same input power. The ATF of
an omni-directional source is given by (2), which is equal
to the l= 0 term in (11) with C′
0,0= 1. In order to
compare the performance of the two systems under the same
power constraint, the integrals of the radiation patterns of an
omni-directional source (L(f, θ, φ) = 1) and a higher-order
loudspeaker (L(f, θ, φ) = PL
l=1 Pl
md=−lC′
l,mdY′md
l(θ, φ))
should be equal over the unit sphere. This leads to kck2
2≤1
where cis a (L+ 1)2×1vector containing the expansion
coefficients in increasing order of land md. Under these
conditions, the ATF corresponding to the term (l, md)is
g(f, r)Y′md
l(θ, φ). Note that (r, θ, φ)’s are calculated with
respect to the location of the loudspeakers.
It is worth mentioning that in practice, the expansion
coefficients C′
l,mdchange with frequency. However, in our
simulation study, as in other theoretical models [14], [16], [22],
[23], [31], [37], it is assumed that these expansion coefficients
do not change with frequency, which makes the radiation
pattern constant across the frequency range. For example, in
all cited (theoretical) papers, the radiation pattern of an omni-
directional loudspeaker is assumed to remain omni across all
frequencies.
Now, suppose a region containing the possible locations of
the listeners is given. Kvirtual sampling points are uniformly
distributed throughout this region. From these Kpoints we
collect samples of a monochromatic sound field originating
from an omni-directional source (a model of the talker) at the
known location of the talker in Virtual Room 1 (which is the
virtual extension of Room 2, Fig. 2). These samples form the
desired free-space field vector pdes
d.
For each loudspeaker location there are (L+ 1)2pattern
coefficients (up to order L). Let Bbe a K×(L+ 1)2matrix
whose elements are the terms g(f;r)Y′md
l(θ, φ)evaluated at
the Kvirtual sampling points, in increasing order of land
md. This matrix contains the ATFs of each term: g(f;r), from
a given loudspeaker location to each of the virtual sampling
points multiplied by the corresponding pattern, Y′md
l(θ, φ).
Algorithm 2 Loudspeaker location and pattern optimization
Input: D⊲dictionary
Input: pdes
d⊲desired vector
Input: N ⊲ number of loudspeakers in the array
Input: Ns⊲number of loudspeakers to be optimized
Output: {cn}⊲loudspeaker pattern coefficients
Output: A ⊲ loudspeaker locations
1: Set A= Θ,R1(pdes
d) = pdes
d.
2: for n= 1 to Nsdo
3: Find d∈Dthat is most correlated with Rnpdes
d.
4: Find the loudspeaker location that dcorresponds to,
and save it in set A=A∪x(n).
5: Extract all members of Dthat correspond to this
location and place them into set B.
6: Apply Algorithm 1 with Rn(pdes
d)as the desired
vector, Bas the dictionary, pa= 1, and (L+ 1)2as
the number of iterations.
7: The output of Algorithm 1 will be the coefficient
vector cn= [cn
1, cn
2, ..., cn
(L+1)2], containing the pattern
coefficients of the n-th selected loudspeaker (C′
l,md), and
selected vectors {b(j)}from B.
8: Set D=D\B.
9: Compute the new error vector:
Rn+1(pdes
d) = Rn(pdes
d)−((b
pn)HRn(pdes))b
pn.(12)
where b
pn=P(L+1)2
j=1 cn
jb(j).
10: end for
11: return {cn}and A.
Algorithm 2 designs the loudspeaker radiation patterns and
locations of Nsout of Nloudspeakers for a monochromatic
primary source. First, matrix Bis formed for each loudspeaker
location in the array, and its columns are placed in the set Das
dictionary members. Since each Bhas (L+ 1)2columns and
there are Nsuch matrices, set Dinitially contains N(L+ 1)2
members. Each dictionary member is identified with a possible
higher-order pattern at a corresponding loudspeaker location.
The algorithm seeks a combination of dictionary members that
minimizes the ℓ2error between the synthesized and the desired
field.
Algorithm 2 is the basic building block for loudspeaker
pattern design. It optimizes the locations and patterns of
Nsloudspeakers out of Nloudspeakers in the array for a
monochromatic primary source at a known, fixed location.
More generally, there will be a certain volume of space in the
rooms that provides comfortable viewing of the 3D display,
and this entire volume should be taken into account when
designing loudspeaker patterns as possible locations of the
talkers and listeners. An example is given in Fig. 5, where
the volume of interest is illustrated as an inscribed rectangular
parallelepiped. In addition, the sound fields of interest will
contain a range of frequencies, rather than a single tone. In
the next subsection, we use Algorithm 2 as a building block
for optimizing an array of loudspeakers for this more general
case.
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
Fig. 5. Possible locations of listeners’ heads in room 2. The inscribed
parallelepiped is the possible volume of interest for sound field control.
C. Primary source with a range of frequencies and possible
locations
In [16], two methods were proposed for optimizing the
loudspeaker locations when primary source has multiple fre-
quency components: frequency-wise design (FWD) and joint
frequency design (JFD). FWD optimizes locations of the
loudspeakers by considering one frequency at a time, whereas
JFD considers all frequencies at once. We tested both strategies
in our system which, in addition to locations also tries to
optimize loudspeaker patterns for various frequencies and
source locations. We found that FWD-like strategy results in
better directivity of the designed patterns. Hence, our strategy
for dealing with multiple source frequencies and locations
is to partition the design space and optimize a group of
loudspeakers for each frequency-location pair.
As before, to account for all possible listeners’ positions
in the volume of interest, we distribute Kvirtual sampling
points across the entire volume of interest (inscribed par-
allelepiped in Fig. 5): {yl
1,yl
2, ..., yl
K}. Next, to account
for various possible active talker locations, we distribute
Wpoints {xt
1,xt
2, ..., xt
W}uniformly across the volume of
interest, as representative locations. Finally, to account for
the frequency band of interest (which is typically taken as
less than 4000 Hz where most of the energy of the human
speech is concentrated [38]), we distribute Yfrequency points
f1< f2< ... < fYacross this range. The idea is simple -
select a different group of speakers for each pair (xt
w, fy)
and optimize their locations and patterns for that position-
frequency pair using Algorithm 2. In particular, let nw,y be
the number of loudspeakers allocated to the pair (xt
w, fy), so
that PW
w=1 PY
y=1 nw,y =N. Pattern design is then performed
via Algorithm 3.
It is also possible to extend the design to a region exterior to
the inscribed parallelepiped in Fig. 5 and force the sound field
to be reduced in the exterior in order to diminish undesired
reverberation, as discussed for the 2D case in [39].
In Algorithm 3, in the first iteration, locations and patterns
of n1,Y out of Nloudspeakers are optimized; in the second
iteration n2,Y out of the remaining N−n1,Y loudspeakers
are optimized, and so on. Therefore, this algorithm jointly
optimizes the placement and expansion coefficients of the
loudspeakers at different iterations and for different frequency-
position pairs of the primary source. The number of loud-
speaker locations that are available to choose from decreases
with increasing iterations. Hence, the number of static DOFs
is largest for the first frequency-position pair, and it decreases
Algorithm 3 Extended loudspeaker pattern optimization
Input: {f1, f2, ..., fY}⊲frequencies of interest
Input: {xt
1,xt
2, ..., xt
W}⊲possible talkers’ locations
Input: {yl
1,yl
2, ..., yl
K}⊲possible listeners’ locations
Input: {nw,y }⊲number of loudspeakers per
frequency-location pair
Output: {cn}⊲loudspeaker pattern coefficients
1: for y=Yto 1do
2: for w= 1 to Wdo
3: Compute the sound pressure at frequency fyfrom
a source at xt
wto each of the Ksampling points yl
k, and
store the results in pdes
d.
4: Compute the matrix Bat frequency fyfor all
remaining loudspeakers to all Ksampling points yl
k, and
store its columns as dictionary members in D.
5: Run Algorithm 2 with Das the dictionary, pdes
das
the desired vector, and nw,y as the number of loudspeak-
ers.
6: Store the resulting pattern coefficients cn’s for the
selected loudspeakers.
7: Remove the loudspeakers selected at this iteration
from further consideration.
8: end for
9: end for
10: return {cn}
for the subsequent pairs. Our simulations confirm that high
frequencies are the most challenging for SFR (which is ex-
pected from the smaller wavelengths), so the algorithm starts
with the highest frequency of interest (fY), and assigns more
static DOFs (placement and expansion coefficients) to these
frequencies, then moves towards lower frequencies. In this
way, most flexibility is afforded to higher frequencies, while
lower frequencies end up with a limited selection of remaining
loudspeakers.
It is worth noting that Algorithm 2 results in a sub-optimal
solution in general, because the reproduction error is not a
convex function of the loudspeakers’ locations [16]. Since
Algorithm 3 depends on Algorithm 2, this also means that
the result of Algorithm 3 is sub-optimal in general.
The designed radiation patterns can be implemented by
combining simple (and sufficiently small) monopole loud-
speakers in a specific configuration array [35] (chapter 6, page
198). The implementation can be considered as a microelec-
tronics design challenge, where printed loudspeakers and their
array weights are integrated into a single device. The higher-
order loudspeakers are used for 2D SFR in the reverberant
room in [40]. The authors have implemented and tested third-
order microphones and loudspeakers in practice in [19], [20].
Another way to implement higher-order loudspeakers is to
approximate the desired pattern by off-the-shelf loudspeakers
in a specific configuration.
V. OP ER ATIO N PH AS E
In order to reproduce the sound field around the listeners’
heads, the following steps are performed:
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
Step 1: An omni-directional virtual primary source is placed
in the virtual Room 1 (Fig. 2) at the location of the active
talker, which is transmitted from Room 1.
Step 2: With the help of the monitoring system, the listen-
ers’ heads are detected, and two cubic regions around their
ears (Fig. 4) are considered as listening areas. Mvsampling
points are distributed uniformly in each cubic region, so for
N2listeners in Room 2, the total number of sampling points
is Ms= 2 ·Mv·N2.
Step 3: The desired field, pdes
o, is calculated as the pressure
sensed around listeners’ ears from the virtual primary source.
The pressure sensed by the listeners at the sampling points
is computed from (4), with Mreplaced by the Head-Related
Transfer Function (HRTF) of the listeners, H.
Step 4: The ATF matrix, Go, from the loudspeaker array
to the sampling points is calculated. The (ms, n)-th element
of this matrix is the ATF from the n-th loudspeaker to
the ms-th sampling point: Ln(f, θn
ms, φn
ms)·gms,n(f;kxn−
ymsk2)Hms(f, xn,yms).
Step 5: The complex excitations of the loudspeakers
constitute the dynamic DOFs, which will be updated as the
desired sound field changes. The optimal excitation vector sopt
is found by solving
sopt = arg min
ksk2≤pmax kpdes
o−Gosk2
2,(13)
which results in the following solution:
sopt = (GH
oGo+γI)−1GH
opdes
o,(14)
where γ > 0is the regularization factor, Iis the identity
matrix, and pmax is the maximum normalized power allowed
for the loudspeaker array. In our simulations, γis calculated
as a function of pmax as described in [15]. Specifically, γ
is used to deal with ill-conditioning of the ATF matrix. At
low frequencies, the sound pressure at the virtual sampling
points in the cubic regions around listeners’ ears becomes
very similar, which increases the condition number of the ATF
matrix. Also, increasing the number of loudspeakers results in
increasing the condition number of the ATF matrix because the
distance between loudspeaker decreases and their ATF become
more similar. Therefore, the regularization factor γis used in
solving (13), as described in [14], [15].
Using the above steps, the complex amplitudes of loud-
speakers are found in the frequency domain for simulation
purposes. The time-domain filters for updating the amplitude
and phase (complex amplitude) of loudspeakers are given
in [41], and these would be more appropriate for practical
deployment. Note that the locations and the expansion coef-
ficients of loudspeakers are selected in the design phase, and
they do not change during the system operation phase.
VI. SI MU LATI ON S
A. Experimental setup
System parameters: In our simulations, the width ×height
×depth dimensions of Room 2 are 6.4m×3m×5m. The
video screen is assumed to fit within a 2m×2m frame, and
N= 48 loudspeakers are uniformly distributed on a larger
2.5m×2.5m peripheral array, as illustrated in Fig. 3. The order
of loudspeakers is L= 5, and pmax = 10−4in all simulations
unless otherwise is stated.
The screen and the loudspeaker array are placed on the x-
ywall at a distance of 0.1m from the wall. The origin is
in the center of the x-ywall, so z > 0represents points
in Room 2 while z < 0represents points from Room 1
mapped to the virtual Room 1, as shown in Fig. 2. During
the operation phase, the listening volume consists of two
10cm ×10cm ×10cm cubes located around the ears of each
listener, as shown in Fig. 4. To model the reverberant room
in our simulations, the image source model [42] is used. In
our configuration, the microphone and loudspeaker arrays are
installed on the wall, so that wall should be considered as a
rough surface. This implies that the reflections from that wall
are not fully coherent, so these reflections are modeled by an
incoherent image source method from [43]. All image sources
whose power is greater than or equal to 1% of the power of
the actual source are retained.
Evaluation metrics: HRTF-based reproduction error, ILD
error, cross-talk cancellation, and Perceptual Enhancement
Speech Quality (PESQ) are used to assess SFR system per-
formance. During the operation phase, the HRTF-based error
(in dB) is calculated as follows:
Error (dB) = 10 log10 ||pdes
o−Gosopt||2
2
||pdes
o||2
2
.(15)
In addition, in order to provide a quantitative measure for
the sense of immersion, we also investigate how well the
listeners are able to localize the virtual sources of reproduced
sound fields by computing two important parameters, Inter-
aural Time Difference (ITD) and Interaural Level Difference
(ILD). For frequencies less than 1500 Hz, ITD plays a more
important role in sound localization than ILD, while at higher
frequencies, ILD and ITD of the signal envelope are more
important [44], [45] than the ITD of the fine structure of the
signals [14], [16].
In our simulations, the normalized ILD error is calculated
as follows. The desired ILD is calculated at all pairs of the
virtual sampling points (each pair corresponds to the two ears
of one listener) and arranged in a (Ms/2 = 2MvN2/2) ×1
vector ILDdes. Then, the ILD after sound field reproduction is
calculated at the same pairs of sampling points and arranged
in another vector ILDrep. Finally, the ILD error (in dB) is
calculated as:
ILD Error (dB) = 10 log10 ||ILDrep −ILDdes||2
2
||ILDdes||2
2
.(16)
For an Nloudspeaker ×Mssampling point reproduction
system, the cross-talk cancellation is defined as follows. Let
Gobe the ATF matrix from Nloudspeakers to Mssampling
points, and G+
o= (GH
oGo+γI)−1GH
obe a pseudo-inverse of
this matrix. Let H=GoG+
o. The channel separation between
m-th and m′-th sampling points is defined as:
CSm,m′=|hm,m|
|hm,m′|(17)
Ideally, His equal to the identity matrix and CS is equal to
infinity for all pairs of the sampling points. However, for an
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
overdetermined system (N < Ms), His not necessarily equal
to the identity matrix, which means that the reproduced field
at the m-th sampling point is influenced by the field repro-
duced at the m′-th sampling point. The cross-talk cancellation
between these two sampling points is equal to CS −1
m,m′. In
our simulations, the cross-talk cancellation is calculated as the
average CS−1
m,m′in dB across all pairs of sampling points [46]:
CT C =1
Ms(Ms−1)
Ms
X
m=1
Ms
X
m′=1,m′6=m
20 log10 |hm,m′|
|hm,m|(18)
The above-mentioned metrics evaluate the system perfor-
mance for a monochromatic primary source (per frequency
bin). PESQ is recommended by ITU-T [47] for measuring the
perceptual quality of a wide-band (16 KHz) speech signal. It
takes into account human psycho-acoustic models, and it is
believed to be more representative of subjective testing com-
pared to other metrics. The obtained PESQ score is mapped
to the range [1,4.5], and this score is called Mean Opinion
Score PESQ (MOS-PESQ) or Predicted MOS (PMOS). The
details of computing PMOS are explained in [48]. In our paper,
the wide-band PMOS score is calculated for 16 KHz audio
files 1. For this purpose, two sampling points, located at the
centers of the cubes, around the listener’s head are selected
from the cubic listening areas. Then the desired audio files, and
the audio files reproduced by the optimized patterns and the
benchmark configuration are recorded at these points. Since
this score evaluates the quality of a single-channel audio file,
it is calculated for each of the two selected sampling points
separately, then averaged into a single score.
In the following, Section VI-B presents the objective eval-
uation of the proposed algorithms for SFR, and Section VI-C
compares our SFR system with three other SFR systems from
the literature [22], [23], [49].
B. Objective evaluation
Design phase: Algorithm 3 is employed for pattern design.
In this algorithm, the listeners’ locations are not known
exactly in advance. They are presumed to be somewhere
in the inscribed parallelepiped in Fig. 5, whose volume
is delimited by (±1.6,±0.05,2±0.05).K= 200 virtual
sampling points are distributed across the volume of possible
listeners’ locations (50 samples uniformly in x-direction
and 2samples in yand z-directions) to optimize the
expansion coefficients of loudspeakers using Algorithm 3.
The talkers’ locations are assumed to be on two straight
lines, one between points (−1,0,−2) and (−1,0,0), and
another line between points (1,0,−2) and (1,0,0). For
pattern designs these two lines are sampled at W= 6 points
xt
w∈ {(1,0,0),(1,0,−1),(1,0,−2),(−1,0,0),(−1,0,−1),
(−1,0,−2)}. Eight frequencies of interest (Y= 8)
are considered in the loudspeaker pattern design
(fy∈ {500,1000,1500,2000,2500,3000,3500,4000}Hz),
and nw,y = 1 loudspeaker is assigned to each location-
frequency pair. It should be noted that in the design phase,
1http://www.voxforge.org/
Fig. 6. Far-field patterns of the loudspeakers at the corners of the loudspeaker
array. The heavy lines on the left indicate the possible locations of the talkers,
and the parallelepiped at the right is the possible region of interest.
the walls are assumed to be fully absorbent. In other words,
to calculate the desired vectors and ATF matrices, the free
space propagation model is used. With this assumption, the
designed pattens are independent of the room size and the
materials used in the room.
The far-field radiation patterns of four loudspeakers at the
corners of the array are shown in Fig. 6. The black rectangle
represents the locations of other loudspeakers on the array,
whose patterns are not shown to keep the figure clear. In
this figure, the heavy horizontal lines to the left show the
possible locations of the talkers, and the parallelepiped shows
the possible locations of the listeners. Since each loudspeaker
is designed for one location-frequency pair using Algorithm 3,
the patterns are not symmetric. This is because different
loudspeakers get selected in different iterations of Algorithm 3,
and loudspeakers that get optimized cannot be selected again
in subsequent iterations. So even when we consider two
symmetric possible locations of a talker, these locations are
considered in different iterations of Algorithm 3, with different
number of available loudspeaker to optimize, so there is
no guarantee that the resulting patterns will be symmetric.
However, as indicated in the figure, a significant portion of
the power is directed towards the region where listeners are
likely to be. Unless otherwise stated, these radiation patterns
are used in the following simulations.
Operation phase - array pattern: In this simulation, it
is assumed that all walls are fully absorbent (cf., free space),
and there is only one listener in Room 2 located at (0,0,2).
One active talker is assumed to be located at (−1,0,−1/2)
with operating frequency of 1000 Hz. During the operation
phase, the complex amplitudes of loudspeakers are calculated
from (14), for both the benchmark and our proposed system.
The resulting 2D cut of the far-field radiation patterns of the
loudspeaker array are shown in Fig. 7 for plane y= 0. The
figure illustrates that, in the proposed system, a significant part
of power is focused towards the listening area. The benchmark
system directs much power in other directions, in particular,
away from the listening room. In this way, the performance
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
0.2
0.4
0.6
0.8
1
30
210
60
240
90
270
120
300
150
330
180 0
Listener
Talker
(a)
0.2
0.4
0.6
0.8
1
30
210
60
240
90
270
120
300
150
330
180 0
Listener
Talker
(b)
Fig. 7. Linear far field radiation pattern cuts of (a) benchmark configuration,
(b) our proposed method. The relative locations of the talker and listener are
shown by red and green squares.
of our proposed method is expected to be better if the two
systems use the same input power.
Fig. 8 shows a snapshot of the desired field, and the fields
reproduced by the proposed system and the benchmark system
at f= 1000 Hz, on one of the cutting planes (y= 0) near the
left and right cubic listening areas around listener’s ears. The
red squares in these figures show the locations of the cubic
listening areas. This figure confirms that the field produced by
our system is more similar to the desired one, compared to
the field produced by the benchmark system, both inside and
outside of the listening areas.
Operation phase - metrics versus frequency: In this sim-
ulation, two listeners are located at (−1.5,0,2) and (1.5,0,2)
in a reverberant room with reverberation time of T60 = 0.2sec.
The reason behind selecting this value as reverberation times
is that in [50], the reverberation time of a conference room
with volume of 146 m3is reported to be between 0.2sec to
0.4sec over the frequency range up to 4 kHz. We have selected
the lower bound because the volume of the simulated room
(96 m3) is smaller. In this test, there is one active talker at
(−1,0,−1/2). The HRTF-based reproduction error and ILD
errors are shown in Fig. 9 (a) and (b) for the frequency range
100-20,000 Hz. According to Fig. 9(a), the optimized system
outperforms the benchmark by 40 to 3dB for frequencies less
than 4kHz for which the expansion coefficients of higher-
x
z
(a)
x
z
(b)
x
z
(c)
x
z
(d)
x
z
(e)
x
z
(f)
Fig. 8. Real parts of the (a), (b) desired fields, (c), (d) fields produced by
our system, (e), (f) fields produced by the benchmark system, around the left
ear (left column) and right ear (right column). The cubic listening areas are
indicated with red squares.
order loudspeakers are optimized, in terms of the HRTF-based
error. As seen from Fig. 9(b), the ILD error is smaller in the
optimized system at lower frequencies. Note that both systems
perform better at lower frequencies. The reason is as follows.
There are Ms= 2N2Mv= 2 ·2·27 = 108 virtual sampling
points (size of pdes
o), and only N= 48 loudspeakers (size of
sopt). Hence, the system in equation (14) is over-determined.
However, at low frequencies, the pressure values do not differ
much at neighboring virtual sampling points, which leads to
linearly (almost) dependent equations in (14) and consequently
makes the number of linearly independent equations closer to
the number of unknowns. At higher frequencies this is no
longer the case, so the performance suffers.
ITD is important for sound localization at frequencies below
1500 Hz [44], [45]. This parameter is shown in Fig. 9(c) for the
listener located at (−1.5,0,2) by finding the time index that
maximizes the Interaural Cross Correlation (IACC) coefficient
between the signals at the two ears [51]. According to this
figure, the optimized system allows the synthesized field to
match the desired ITD, while the benchmark system does not
perform well at low frequencies.
The cross talk cancellation, as defined in (18), is shown in
Fig. 9(d). According to this figure, at lower frequencies the
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
TABLE I
WID E-BA ND PM OS SC OR E FO R 16 KHZ S PE ECH FI LE S.
Speech sample Benchmark Optimized
13.3400 4.2755
24.0935 4.4310
33.3980 4.3015
44.1270 4.4540
53.3930 4.1350
64.0310 4.4150
73.4290 4.1325
84.0655 4.4375
93.5990 4.2350
10 3.6100 4.2850
Average 3.7086 4.3102
proposed system has better cross talk cancellation. The reason
is that the ATF matrix of the benchmark system has a larger
condition number, so the matrix Hin this case is less similar
to identity matrix, which leads to less cross talk cancellation
between the sampling points.
The relatively poor performance of the benchmark config-
uration at low frequencies is due to regularization in (14).
Recall that the regularization parameter γin (14) is computed
as suggested in [15] to limit the maximum normalized power
to pmax = 10−4. At low frequencies, this causes the relatively
poorer sound field reproduction compared to the optimized
system. Performance of the benchmark configuration is im-
proved with higher allowable power at these frequencies.
In some SFR systems in reverberant environments, the goal
is to maximize the Direct-to-Reverberant Ratio (DRR) [22],
[23]. It may be expected that higher-order directional loud-
speakers improve DRR, and indeed our simulations confirm
this. In this experiment, it is assumed that the desired field
is specified in free space, while all other parameters are the
same as in the previous experiment. The frequency-dependent
DRR is defined as [52] (c.f. Rice factor):
DRR(f) = 10 log10 |Di(f)|2
|Rv(f)|2,(19)
where Di(f)is the direct component of the reproduced
field, and Rv(f)represents the reverberant components. The
average DRR across all sampling points for the optimized and
benchmark systems is shown in Fig. 10, where the gains in
DRR improvement brought by higher-order loudspeakers are
over 20 dB at lower frequencies.
Finally, the average PMOS of the left and right ears for
the listener located at (1.5,0,2) is shown in Table I for
10 audio files, each with 5sec length. According to these
results, the quality of the sound produced by benchmark
system is between 3 and 4 in most cases, which is interpreted
as “Good” quality. Meanwhile, the proposed system achieves
PMOS above 4 in all cases, which is interpreted as “Excellent”
quality.
Operation phase - number of listeners: In the next simu-
lation, the number of listeners varies between 3and 12. As in
the previous simulation, the active talker is at (−1,0,−1/2).
For the cases with 3-7 listeners, all listeners are inside the
inscribed parallelepiped in Fig. 5. Other listeners (8-th, 9-th,
..., 12-th) are placed outside the inscribed parallelepiped in
Frequency (Hz)
102103104
Error (dB)
-80
-60
-40
-20
0
Optimized
Benchmark
(a)
Frequency (Hz)
102103104
ILD Error (dB)
-60
-40
-20
0
Optimized
Benchmark
(b)
Frequency (Hz)
102103
ITD (msec)
0.1
0.15
0.2
0.25
Optimized
Benchmark
Original
(c)
Frequency (Hz)
102103104
Cross Talk (dB)
-30
-25
-20
-15
-10
Optimized
Benchmark
(d)
Fig. 9. (a) HRTF-based reproduction error, (b) ILD error, (c) ITD for the
listener at (−1.5,0,2), (d) Cross-talk cancellation, CT C from Eq. (18).
102103104
Frequency (Hz)
0
20
40
60
DRR (dB)
Optimized
Benchmark
Fig. 10. Direct-to-Reverberant Ratio (DRR) in dB versus frequency.
Fig. 5. The HRTF-based reproduction error, the ILD error, and
the cross talk cancellation are shown in Fig. 11 (a),(b), and
(c) when the frequency of the talker is 800 Hz. Fig. 11(d)
shows the average PMOS score over 10 audio files for a
listener with a fixed location at (1.5,0,2). The results of this
figure show that by increasing the number of listeners, the
performance degrades in both systems. The reason is that with
more listeners, the number of sampling points increases, which
means that the number of unknowns in (13) increases while
the number of dynamic DOFs is fixed. Therefore, the system
of equations in (13) is more over-determined, which results in
performance degradation. Nonetheless, the proposed optimized
system provides a substantial gain over the benchmark system.
Operation phase - length of the cubic listening area:
The next simulation shows the reproduction error and ILD
error of the optimized system in Fig. 12 in terms of the length
of the cubic listening area across the frequency range 500-
20,000 Hz when two listeners are located at (−1.5,0,2) and
(1.5,0,2), and an active talker is at (−1,0,−1/2). In this
test, the number of sampling points is fixed for all sizes of
the cubic listening area. The results of this test show that the
more precise approximation of the ear locations results in less
reproduction and ILD errors across the whole frequency range.
For example, when the length of listening area is 2cm, the
reproduction error is below −20 dB and the ILD error is below
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
Number of Listeners
4 6 8 10 12
Error (dB)
-30
-25
-20
-15
-10
-5
Optimized
Benchmark
(a)
Number of Listeners
4 6 8 10 12
ILD Error (dB)
-25
-20
-15
-10
-5
0
Optimized
Benchmark
(b)
Number of Listeners
4 6 8 10 12
Cross Talk (dB)
-28
-26
-24
-22
-20
Optimized
Benchmark
(c)
Number of Listeners
4 6 8 10 12
PMOS
3
3.5
4
4.5
Optimized
Benchmark
(d)
Fig. 11. (a) HRTF-based reproduction error, (b) ILD error (c) Cross talk
cancellation for frequency of 800 Hz, and (d) the average of PMOS scores
over 10 audio files versus the number of listeners.
Frequency (Hz)
103104
Error (dB)
-80
-60
-40
-20
0
10 cm
5 cm
2 cm
(a)
Frequency (Hz)
103104
ILD Error (dB)
-80
-60
-40
-20
0
10 cm
5 cm
2 cm
(b)
Fig. 12. (a) HRTF-based error and (b) ILD error of the proposed method
versus the length of the cubic listening area.
−10 dB for frequencies less than 10 KHz.
Operation phase - reverberation time: In this test, the
HRTF-based reproduction error and ILD error are depicted in
Fig. 13(a) and (b) for three different rooms with reverberation
times of 1.19 sec, 0.32 sec, and 0.11 sec for frequency range
500-20000 Hz. According to this figure, as the reverberation
time increases, the performance of our proposed system de-
grades. The reason is two-fold: first, the radiation patterns are
optimized in free-space, so with increasing reverberation time,
the actual room deviates more from the model used in the
design phase. Second, by increasing the reverberation time,
the strength of image sources in all directions are comparable
to that of the direct path, so directivity does not help as much.
In other words, in a highly reverberant room, the performance
of our proposed system approaches that of the benchmark
system as shown in Fig. 13(c) and (d) for frequency range
100-4000Hz. In this figure, the HRTF-based error and ILD
error of the benchmark system are shown by dashed lines.
This quantifies how the loudspeaker pattern design provides
better performance, if the walls of the conference room are
insulated by an absorbent material, such as acoustic tiles.
Design and operation phases - number of DOFs: Finally,
Frequency (Hz)
103104
Error (dB)
-60
-40
-20
0
T60=1.19
T60=0.32
T60=0.11
(a)
Frequency (Hz)
103104
ILD Error (dB)
-60
-40
-20
0
T60=1.19
T60=0.32
T60=0.11
(b)
Frequency (Hz)
1000 2000 3000 4000
Error (dB)
-60
-40
-20
0
T60=1.19
T60=0.32
T60=0.11
(c)
Frequency (Hz)
1000 2000 3000 4000
ILD Error (dB)
-60
-40
-20
0
T60=1.19
T60=0.32
T60=0.11
(d)
Fig. 13. (a) HRTF-based error and (b) ILD error for three reverberant
rooms. (c) and (d) Comparison between the optimized system (solid lines)
and benchmark system (dashed lines) for frequency range 100 Hz to 4000 Hz
for three reverberant rooms.
the effects of changing the number of static and dynamic
DOFs in our method are investigated. First, the HRTF-based
reproduction error is shown in Fig. 14(a) when there are
N= 48 loudspeakers in the array, and their orders are L= 2
and L= 5. For L= 2, the radiation patterns are designed for
the same talker and listener positions as for L= 5. There is a
factor of 4difference in the number of static DOFs ((L+ 1)2)
between these two cases. In Fig. 14(b), the order is L= 5,
while the number of loudspeakers is N= 12 and N= 48, so
in this case there is a factor of 4difference in the number of
dynamic DOFs. Again, for N= 12, the patterns are designed
for the same talker and listener positions as for N= 48. These
results show that increasing the number of dynamic DOFs
(number of loudspeakers, N) has a larger effect on improving
the system performance than increasing the number of static
DOFs (loudspeaker order, L). For example, by increasing the
number of static DOFs by a factor of 4, the error is improved
by 10 dB at lower frequencies. Meanwhile, 4-fold increase
in the dynamic DOFs improves the error by 30 dB at lower
frequencies. Intuitively this makes sense, because the static
DOFs are assigned based on the global parameters such as the
range of locations of the listeners and talkers, and the range
of frequencies of interest, while dynamic DOFs are updated
based on the exact desired sound field.
C. Comparison with other SFR systems
In this section, first, the performance of our SFR system
with the optimized radiation patterns will be compared against
the SFR methods in [22], [23]. These two methods work
based on the Kirchhoff Helmholtz (KH) integral to design the
radiation patterns of loudspeakers. In these methods, monopole
and radial dipole loudspeakers are located all around the
listening area on a surface of a sphere. In [22], as in our
proposed method, the radiation patterns of loudspeakers are
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
Frequency (Hz)
103104
Error (dB)
-50
-40
-30
-20
-10
0
L=5
L=2
(a)
Frequency (Hz)
103104
Error (dB)
-50
-40
-30
-20
-10
0
48
12
(b)
Fig. 14. Performance comparison in terms of HRTF-based reproduction error
of our SFR method in free space for (a) the same number of dynamic DOFs
(N= 48), different number of static DOFs (L= 2 and L= 5), (b) the
same number of static DOFs (L= 5), different number of dynamic DOFs
(N= 12 and N= 48).
designed in advance, and during the system operation only
their excitation (weights) vectors change based on Higher-
Order Ambisonics (HOA). Therefore, in this method there is
one static DOF per loudspeaker, and the number of dynamic
DOFs is equal to the number of loudspeakers. In [23], the
radiation patterns of all loudspeakers change during the system
operation. It implies that this method has no static DOFs,
but the number of dynamic DOFs is twice the number of
loudspeakers. To find the radiation patterns (or excitation
vector), the HOA technique is employed in this paper as well.
For comparison, we assume that the listening area is a
sphere whose center is at (0,0,2.25) with radius of 30 cm,
and the primary source is at (−1,0,−0.5). The number
of loudspeakers in [22], [23] is 144, and they are located
on a sphere with radius of 1.5m and concentric with the
listening area. For our method, loudspeakers are arranged in a
rectangular array (Fig. 3) on the x-yplane, with center at the
origin, so its distance to the center of the spheres is 2.25 m.
To find sopt in (14), the number of virtual sampling points
(dimension of pdes
o) in the spherical listening area is 125 in
case (a), and 27 in case (b) and case (c). However, once the
field is synthesized, the error is calculated at 1000 points in
the listening area. To be consistent with the methods in [22],
[23], the simulations are performed in free-space conditions,
the HRTF is not considered for sound field reproduction, and
no power constraint is taken into account. Fig. 15 shows the
results of comparison for three cases. In these tests, as in [22],
[23], [49], the simulations are performed on the narrow-band
speech frequency range (below 4kHz).
Case (a): Fig. 15(a) is obtained using the same default
parameters employed in the simulations in each paper. For
the methods from [22], [23], the number of loudspeakers is
144, and the truncation order in HOA is 10. In our method,
the number of loudspeakers is N= 48, their order is L= 5,
and the number of sampling points to specify the desired field
is 125. The radiation patterns for our system are the ones
designed during the design phase (Section VI-B). Based on
Fig. 15(a), the methods from [22], [23] perform better than our
method at the expense of higher complexity and using 3times
as many loudspeakers (144 vs. 48 for the method in [22]).
In [22] there are 3static DOFs for each loudspeaker (2for
locations and one for the pattern), so the total number of static
DOFs is 576. In our proposed method the number of static
DOFs is 38 per each loudspeakers (2for their locations, and 36
for their patterns), and the total number of static DOFs is 1824.
This means that the number of static DOFs in our method is
three times of that of [22], but the number of dynamic DOFs
in [22] is three times of that of our proposed method.
Case (b): Since the comparison in case (a) can be argued to
be unfair due to different run-time complexities of the systems
involved, for the simulations in Fig. 15(b), the number of
dynamic DOFs in the three systems is made approximately
equal. Specifically, the number of dynamic DOFs remain 48 in
our method, as in case (a), with 48 loudspeakers. The number
of loudspeakers for the system from [22] is reduced to 49 to
make its number of dynamic DOFs equal to 49. Finally, since
the system in [23] has two dynamic DOFs per loudspeaker,
we reduce its number of dynamic DOFs to 25, making the
total number of its dynamic DOFs equal to 50. Further, to
(approximately) match the size of matrices involved in the
computation of driving signals for the loudspeakers (eqns. (14)
in our case, (29) in [22], (23) in [23]), we set the number of
sampling points to specify the desired field to 27 in our case,
and set the truncation order of HOA to 4for [22], [23]. This
makes the run-time complexity of the three systems almost
the same. The reproduction error for this case is shown in
Fig. 15(b). According to this figure, our system outperforms
the other two, especially at low frequencies. As in case (a),
the number of static DOFs is larger in our proposed method,
which explains the performance gain.
Case (c): In this case, we attempt to match the overall
(not just run-time) complexity of the systems involved by
matching their total number of DOFs, static plus dynamic.
For our system we set the number of loudspeakers to N= 48
and the loudspeakers are composed of one monopole and one
dipole (same as in [22], [23]), with the dipoles aligned with
the z-direction. The gain coefficients of the monopoles and
dipoles are found through Algorithm 2 ahead of time as in
design phase (Section VI-B). Therefore, there is 1static DOF
per loudspeaker, and the total number of static DOFs is 48.
The number of dynamic DOFs is also equal to 48 (the number
of loudspeakers), so the total number of DOFs is 48+48 = 96.
For the system from [22], the number of loudspeaker is set to
49. As mentioned above, for [22], both the number of static
DOFs and the number of dynamic DOFs is 1per loudspeaker,
so the total number of DOFs is 49 + 49 = 98. In the system
from [23], the number of static DOFs is zero, but the number
of dynamic DOFs is 2per loudspeaker. Hence, setting the
number of loudspeakers to 49 makes the total number of DOFs
equal to 98 in this system as well. With these parameters, the
total number of DOFs (static plus dynamic) is approximately
the same in all systems. However, the run-time complexity
of [23] is higher, since it has (approximately) twice the number
of dynamic DOFs as the other two systems.
As in case (b), we set the number of sampling points to
specify the desired field to 27 in our system, and set the
truncation order of HOA to 4for [22], [23]. The results are
shown in Fig. 15(c), from which we see that our system
again outperforms [22], [23] across the frequency range, but
with a smaller margin than in case (b). The performance of
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
Frequency (Hz)
103
Error (dB)
-80
-60
-40
-20
0
Proposed
[22]
[23]
(a)
Frequency (Hz)
103
Error (dB)
-40
-20
0
Proposed
[22]
[23]
(b)
Frequency (Hz)
103
Error (dB)
-40
-30
-20
-10
0
[22,23]
Proposed
(c)
Fig. 15. Performance comparison in terms of the reproduction error of our
SFR method and the methods in [22] and [23] when (a) default parameters
from each paper are used, (b) the number of dynamic DOFs is matched, (c) the
total number of DOFs is matched. In (b) and (c) the results of [22] and [23]
are on the top of each other.
all three systems is now closer to each other, which is not
surprising, considering that the total number of DOFs have
been approximately matched.
From this comparison, it can be concluded that the methods
in [22], [23] outperform our approach when they employ a
larger number of loudspeakers (static and/or dynamic DOFs)
at lower frequencies. However, when the number of dynamic
DOFs is matched, our approach works better. When the total
complexity (static+dynamic DOFs) is matched among the
three systems, their performances become more similar, but
our method still has some advantage at lower frequencies.
In addition, one practical advantage of our SFR approach is
that it utilizes a rectangular array of loudspeakers as shown
in Fig. 3, which is easier to install than the spherical arrays
utilized in [22], [23].
In the next experiment, the performance of our proposed
method is compared against the mode matching (Higher Order
Ambisonics) method [37]. To evaluate the system performance
of our proposed method, the system parameters are selected as
in case (a) in the previous experiment. For the mode matching
method, the loudspeakers are located on a sphere with radius
of 2m, and the listening area is a smaller sphere inside
the loudspeaker region with radius of 30 cm. The expansion
coefficients of the loudspeakers are calculated by the mode
matching method for higher order loudspeakers proposed in
[37] for two different cases. For the first case, the order of
loudspeakers is set to L= 0, and the order of truncation in
the mode matching method is set to 4, while for the second
case, the order of loudspeakers is set to L= 5 and the order
of truncation is set to 10. The number of static DOFs for both
cases in mode matching method is 2 (for the locations) per
each loudspeakers, while it is equal to 38 per each loudspeaker
(2 for the location and 36 for the expansion coefficients) in
Frequency (Hz)
1400 3400
Error (dB)
-100
-80
-60
-40
-20
0
Proposed
L=5
L=0
Fig. 16. Performance comparison in terms of the reproduction error of our
SFR method against the mode matching method from [37] for zeroth order
and fifth order loudspeakers.
our proposed method. The number of dynamic DOFs in our
proposed method is equal to 48, it is equal to 49 for the
first case of the mode matching method, and it is equal to
49 ×36 for the second case. The results of this test are
shown in Fig. 16. According to this figure, at the same real-
time system complexity, our proposed method outperforms the
mode matching (with L= 0). The performance of the mode
matching method with L= 5 is significantly better than our
proposed method. because the number of DOFs during the
system operation is almost 36 times of this number in our
method.
In the next experiment, the performance of our proposed
structure is compared against a linear array of loudspeak-
ers employing Wave Field Synthesis (WFS) to find loud-
speaker driving functions [49]. In the first case, a linear array
of 48 dipole loudspeakers is located between (−1.25,0,0)
and (+1.25,0,0). It is assumed that two listeners are lo-
cated at (0,0,2) and (0,0.5,3), and the active talker is at
(+1,0,−1/2). For our structure, 48 loudspeakers with the
order of L= 5 (proposed method) and L= 0 (benchmark) are
located around the screen, and the radiation patterns are the
ones selected during the design phase (Section VI-B). Hence,
the number of dynamic DOFs in the three methods (linear
array, proposed and benchmark) are equal. Fig. 17(a) shows
the results of this comparison in free space, without taking
power limitation and HRTF into account. Based on this figure,
the proposed method and the benchmark outperform the linear
array across a range of frequencies because the two listeners
are not in the same plane, hence the field produced by the
linear array does not provide a good approximation to the
desired sound field.
In the second case, a linear array of 200 loudspeakers is
located between (−3,0,0) and (+3,0,0) and again WFS is
used to find the driving functions. The number of dynamic
DOFs for this array is therefore 200. For the proposed struc-
ture, 20 loudspeakers of order of 2are used. The number of
static DOFs is 9per loudspeaker (180 for the whole array)
and the number of dynamic DOFs is 20. Hence, while the
number of dynamic DOFs in the linear array is equal to the
total number of DOFs (static and dynamic) in our structure, it
is 10 times higher than the number of dynamic DOFs in our
structure, resulting in 10 times the run-time complexity. The
results of this comparison are shown in Fig. 17(b). According
to this figure, the proposed structure has better performance
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
Frequency (Hz)
1400 3400
Error (dB)
-80
-60
-40
-20
0
Proposed
Benchmark
Linear+WFS
(a)
Frequency (Hz)
1400 3400
Error (dB)
-60
-40
-20
0
Proposed
Linear+WFS
(b)
Fig. 17. Performance comparison in terms of the reproduction error of our
SFR method against a linear array + WFS when (a) the number of dynamic
DOFs is matched, (b) the number of dynamic DOFs in WFS method is
matched with the total number of DOFs in our method.
TABLE II
COM PUTATI ONA L COM PL EXI TY A ND EX EC UTI ON T IME
Complexity Time (sec)
Algorithm 2 O(MsL4Ns+MsL2NNs)0.0083 (Ns= 1)
Algorithm 3 O(MsL4N+MsL2N2)0.4
Solving (14) O(N2Ms+N3) 2.5×10−4
Mode matching [37] O(N6
I+N4
INL2) 0.01 (NI= 10)
at frequencies up to about 2700 Hz. Again, the reason is
that the linear array of loudspeakers cannot provide a good
approximation to the 3D sound field.
D. Complexity
The computational complexities of Algorithms 2, 3, as well
as the SFR operation-time complexity (solving (14)) and the
least squares mode matching complexity [37], are listed in
Table II. For mode matching, NIis the order of truncation.
The corresponding execution times per frequency component
of an unoptimized MATLAB implementation on a 3-GHz Intel
Core 2 Quad Q9650 processor are also shown. SFR operation
should be performed in real time, while Algorithms 2 and 3 are
offline algorithms for loudspeaker pattern design. In Table II,
the complexity and execution time is shown per frequency
component. If the speech signal is sampled at 8kHz and the
frame length of Short Time Fourier Transform (STFT) is 1024,
there are 1024 frequency components for each 0.128-second
segment. Hence, (un-optimized) MATLAB implementation of
SFR would not work in real time on a current commodity
processor such as a single 3-GHz Intel Core 2 Quad Q9650
processor, but real time operation would be possible on parallel
processors. Optimized, embedded implementations would be
more efficient.
Another interesting observation from Table II is that least
squares mode matching [37], which optimizes both the patterns
and the driving signals at operation time, has considerably
higher run-time complexity than our method, which only
computes driving signals at operation time. For example, for
N= 48,L= 5,Ms= 108,NI= 10 (same parameters used
for Fig. 16), the execution time of mode matching is about 40
times higher than that of the proposed method, on the platform
described above.
VII. CONCLUSION
The performance of the audio layer of an audiovisual
immersive communication system was studied through simu-
lations under a variety of conditions. At the transmitting end,
active talker detection and speech signal extraction is assumed
to be undertaken by existing methods. At the receiving end,
the sound field from active talker(s) was synthesized to match
the virtual positions of the talker(s) in the 3D visual scene.
A method to derive fixed 3D loudspeaker patterns to improve
sound field reproduction was presented. The fidelity of the
SFR system is measured by the HRTF-based reproduction
error, ILD error, cross talk cancellation, and wide-band PMOS
score. Simulation results show that the fidelity of the sound
field produced by the optimized loudspeakers is better than that
produced by a benchmark system employing omni-directional
loudspeakers. Error reduction in the range of 2-20 dB was
observed in reverberant rooms. The proposed SFR system
was also compared, through simulation, against several other
representative systems from the literature. The results quantify
the advantage in sound field reproduction, especially at lower
frequencies, compared to these other SFR systems.
REF ER EN CE S
[1] J. G. Apostolopoulos, P. A. Chou, B. Culbertson, T. Kalker, M. D. Trott,
and S. Wee, “The road to immersive communication,” Proceedings of
the IEEE, vol. 100, no. 4, pp. 974–990, Apr. 2012.
[2] Y. Huang, J. Chen, and J. Benesty, “Immersive audio schemes,” IEEE
Signal Processing Magazine, vol. 28, no. 1, pp. 20–32, Jan 2011.
[3] C. Kyriakakis, “Fundamental and technological limitations of immersive
audio systems,” Proceedings of the IEEE, vol. 86, no. 5, pp. 941–951,
May 1998.
[4] S. Enomoto, Y. Ikeda, S. Ise, and S. Nakamura, “3-D sound reproduction
system for immersive environments based on the boundary surface
control principle,” in Virtual and Mixed Reality-New Trends, pp. 174–
184. Springer, 2011.
[5] H. Teutsch, S. Spors, W. Herbordt, W. Kellermann, and R. Rabenstein,
“An integrated real-time system for immersive audio applications,” in
Proc. IEEE Workshop on Applications of Signal Processing to Audio
and Acoustics (WASPAA’03), Oct 2003, pp. 67–70.
[6] J. Ahrens and S. Spors, “Sound field reproduction using planar and linear
arrays of loudspeakers,” IEEE Trans. Audio, Speech, and Language
Processing, vol. 18, pp. 2038–2050, Nov. 2010.
[7] J. Ahrens and S. Spors, “A comparison of wave field synthesis and
higher-order ambisonics with respect to physical properties and spatial
sampling,” in In 125th Conv. of the AES, San Francisco, CA, Oct. 2008.
[8] P. Gauthier and A. Berry, “Adaptive wave field synthesis for sound field
reproduction: Theory, experiments, and future perspectives,” in In 123th
Conv. of the AES, New York, Oct. 2007.
[9] A. Sontacchi, M. Strauss, and R. Holdrich, “Audio interface for
immersive 3D-audio desktop applications,” in IEEE Intl. Symp. Virtual
Environments, Human-Computer Interfaces and Measurement Systems
(VECIMS’03), Jul. 2003, pp. 179–182.
[10] K. U. Doerr, H. Rademacher, S. Huesgen, and W. Kubbat, “Evaluation
of a low-cost 3D sound system for immersive virtual reality training
systems,” IEEE Trans. Visualization and Computer Graphics, vol. 13,
no. 2, pp. 204–212, Mar. 2007.
[11] D. Ward and T. Abhayapala, “Reproduction of a plane-wave sound
field using an array of loudspeakers,” IEEE Trans. Audio, Speech, and
Language Processing, vol. 9, pp. 697–707, Sep. 2001.
[12] A. Gupta and T. Abhayapala, “Three-dimensional sound field reproduc-
tion using multiple circular loudspeaker arrays,” IEEE Trans. Audio,
Speech, and Language Processing, vol. 19, pp. 1149–1159, July 2011.
[13] J. Daniel and S. Moreau, “Further study of sound field coding with
higher order ambisonics,” in AES 116th Convention Preprints, Berlin,
Germany, May 2004.
[14] P. Gauthier and A. Berry, “Sound-field reproduction in-room using
optimal control techniques: Simulations in the frequency domain,” J.
Acoust. Soc. Am., vol. 2, pp. 662–678, Feb. 2005.
Accepted for publication in IEEE/ACM Transactions on Audio, Speech, and Language Processing
[15] T. Betlehem and C. Withers, “Sound field reproduction with energy
constraint on loudspeaker weights,” IEEE Trans. Audio, Speech, and
Language Processing, vol. 19, pp. 2388–2392, Oct. 2012.
[16] G. N. Lilis, D. Angelosante, and G. B. Giannakis, “Sound field
reproduction using lasso,” IEEE Trans. Audio, Speech, and Language
Processing, vol. 18, pp. 1902–1921, Nov. 2010.
[17] H. Khalilian, I. V. Baji´
c, and R. G. Vaughan, “3D sound field reproduc-
tion using diverse loudspeaker patterns,” in Proc. IEEE ICME’13, San
Jose, CA, Jul. 2013.
[18] H. Khalilian, I. V. Baji´
c, and R. G. Vaughan, “Towards optimal
loudspeaker placement for sound field reproduction,” in Proc. IEEE
ICASSP’13, Vancouver, May 2013, pp. 321–325.
[19] M. Poletti, T. Betlehem, and T. Abhayapala, “Higher-order loudspeakers
and active compensation for improved 2D sound field reproduction in
rooms,” Journal of the Audio Engineering Society, vol. 63, no. 1/2, pp.
31–45, 2015.
[20] M. Poletti and T. Betlehem, “Design of a prototype variable directivity
loudspeaker for improved surround sound reproduction in rooms,” in
Audio Engineering Society Conference: 52nd International Conference:
Sound Field Control-Engineering and Perception. Audio Engineering
Society, 2013.
[21] H. Khalilian, I. V. Baji´
c, and R. G. Vaughan, “Comparison of loud-
speaker placement methods for sound field reproduction,” IEEE/ACM
Trans. Audio, Speech, and Language Processing, vol. 24, no. 8, pp.
1364–1379, Aug 2016.
[22] M. A. Poletti, F. M. Fazi, and P. A. Nelson, “Sound-field reproduction
systems using fixed-directivity loudspeakers,” The Journal of the
Acoustical Society of America, vol. 127, no. 6, pp. 3590–3601, 2010.
[23] M. A. Poletti, F. M. Fazi, and P. A. Nelson, “Sound reproduction systems
using variable-directivity loudspeakers,” The Journal of the Acoustical
Society of America, vol. 129, no. 3, pp. 1429–1438, 2011.
[24] W. T. Chu and A. Warnock, “Detailed directivity of sound fields around
human talkers,” Tech. Rep. IRC-RR-104, National Research Council
Canada, Dec. 2002.
[25] K. Kollreider, H. Fronthaler, M. I. Faraj, and J. Bigun, “Real-time face
detection and motion analysis with application in liveness assessment,”
IEEE Trans. Information Forensics and Security, vol. 2, no. 3, pp. 548–
558, Sep. 2007.
[26] D. Nguyen, D. Halupka, P. Aarabi, and A. Sheikholeslami, “Real-
time face detection and lip feature extraction using field-programmable
gate arrays,” IEEE Trans. Systems, Man, and Cybernetics, Part B:
Cybernetics, vol. 36, no. 4, pp. 902–912, Aug. 2006.
[27] D. Ba, D. Florˆ
encio, and C. Zhang, “Enhanced MVDR beamforming
for arrays of directional microphones,” in Proc. IEEE ICME’07, 2007,
pp. 1307–1310.
[28] D. Yang, High Fidelity Multichannel Audio Compression, Ph.D. thesis,
University of Southern California, Aug. 2002.
[29] J. Kammerl, N. Blodow, R.B. Rusu, S. Gedikli, M. Beetz, and E. Stein-
bach, “Real-time compression of point cloud streams,” in Proc. IEEE
ICRA’12, May 2012, pp. 778–785.
[30] K. Murano, S. Unagami, and F. Amano, “Echo cancellation and
applications,” IEEE Communications Magazine, vol. 28, no. 1, pp. 49–
55, Jan. 1990.
[31] M. A. Poletti and T. D. Abhayapala, “Spatial sound reproduction systems
using higher order loudspeakers,” in Proc. IEEE ICASSP’11, Prague,
May 2011, pp. 57–60.
[32] R. O. Duda and W. L. Martens, “Range dependence of the response
of a spherical head model,” The Journal of the Acoustical Society of
America, vol. 104, no. 5, pp. 3048–3058, 1998.
[33] S. G. Mallat and Z. Zhang, “Matching pursuits with time-frequency
dictionaries,” IEEE Trans. Signal Processing, vol. 41, pp. 3397–3415,
Dec. 1993.
[34] H. Khalilian, I. V. Baji´
c, and R. G. Vaughan, “Loudspeaker placement
for sound field reproduction by constrained matching pursuit,” in Proc.
IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics (WASPAA’13), New Paltz, NY, Oct. 2013.
[35] E. G. Williams, Fourier Acoustics, Academic Press, 1999.
[36] J. L. Stratton, Electromagnetic Theory, McGraw-Hill, 1941.
[37] P. N. Samarasinghe, M. A. Poletti, S. M. Salehin, T. Abhayapala,
and F. M. Fazi, “3D soundfield reproduction using higher order
loudspeakers,” in Proc. IEEE ICASSP’13. IEEE, 2013, pp. 306–310.
[38] J. Eargle, Handbook of recording engineering, Springer, 2005.
[39] M. A. Poletti, T. D. Abhayapala, and P. Samarasinghe, “Interior and
exterior sound field control using two dimensional higher-order variable-
directivity sources,” The Journal of the Acoustical Society of America,
vol. 131, no. 5, pp. 3814–3823, 2012.
[40] M. Poletti, T. Betlehem, and Thushara T. Abhayapala, “Higher order
loudspeakers for improved surround sound reproduction in rooms,” in
Audio Engineering Society Convention 133. Audio Engineering Society,
2012.
[41] P. A. Nelson, “Active control of acoustic fields and the reproduction of
sound,” Journal of Sound and Vibration, vol. 177, no. 4, pp. 447–477,
1994.
[42] J. Allen and D. Berkley, “Image method for efficiently simulating small-
room acoustics,” The Journal of the Acoustical Society of America, vol.
65, pp. 943–950, 1979.
[43] S. Siltanen, T. Lokki, S. Tervo, and L. Savioja, “Modeling incoherent
reflections from rough room surfaces with image sources,” The Journal
of the Acoustical Society of America, vol. 131, no. 6, pp. 4606–4614,
2012.
[44] J. Blauert, Spatial hearing: the psychophysics of human sound localiza-
tion, MIT press, 1997.
[45] E Goldstein, Sensation and perception, Cengage Learning, 2013.
[46] M. R. Bai and C. Lee, “Objective and subjective analysis of effects of
listening angle on crosstalk cancellation in spatial sound reproduction,”
The Journal of the Acoustical Society of America, vol. 120, no. 4, pp.
1976–1989, 2006.
[47] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual
evaluation of speech quality (PESQ), an objective method for end-to-
end speech quality assessment of narrowband telephone networks and
speech codecs,” ITU-T Recommendation P.862, 2001.
[48] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual
evaluation of speech quality (PESQ) - a new method for speech
quality assessment of telephone networks and codecs,” in Proc. IEEE
ICASSP’01, 2001, vol. 2, pp. 749–752.
[49] E. N. G. Verheijen, Sound reproduction by wave field synthesis, Ph.D.
thesis, Delft University of Technology, 1998.
[50] S. R. Atcherson, C. A. Franklin, and L. Smith-Olinde, Hearing assistive
and access technology, Plural Publishing, 2015.
[51] I. Choi, B. G. Shinn-Cunningham, S. B. Chon, and K.-M. Sung,
“Objective measurement of perceived auditory quality in multichannel
audio compression coding systems,” Journal of the Audio Engineering
Society, vol. 56, no. 1/2, pp. 3–17, 2008.
[52] J. Eaton, A. H. Moore, P. A. Naylor, and J. Skoglund, “Direct-
to-reverberant ratio estimation using a null-steered beamformer,” in
2015 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP). IEEE, 2015, pp. 46–50.