Conference PaperPDF Available

Six-degrees-of-freedom binaural audio reproduction of first-order ambisonics with distance information

Authors:

Abstract and Figures

First-order Ambisonics (FOA) recordings can be processed and reproduced over headphones. They can be rotated to account for the listener's head orientation. However, virtual reality (VR) systems allow the listener to move in six-degrees-of-freedom (6DoF), i.e., three rotational plus three transitional degrees of freedom. Here, the apparent angles and distances of the sound sources depend on the listener's position. We propose a technique to facilitate 6DoF. In particular, a FOA recording is described using a parametric model, which is modified based on the listener's position and information about the distances to the sources. We evaluate our method by a listening test, comparing different binaural renderings of a synthetic sound scene in which the listener can move freely.
Content may be subject to copyright.
Audio Engineering Society
Conference Paper
Presented at the Conference on
Audio for Virtual and Augmented Reality
2018 August 20 – 22, Redmond, WA, USA
This paper was peer-reviewed as a complete manuscript for presentation at this conference. This paper is available in the AES
E-Library (http://www.aes.org/e-lib) all rights reserved. Reproduction of this paper, or any portion thereof, is not permitted
without direct permission from the Journal of the Audio Engineering Society.
Six-Degrees-of-Freedom Binaural Audio Reproduction of
First-Order Ambisonics with Distance Information
Axel Plinge
1
, Sebastian J. Schlecht
1
, Oliver Thiergart
1
, Thomas Robotham
1
, Olli Rummukainen
1
, and Emanuël A.
P. Habets1
1International Audio Laboratories Erlangen*, Germany
Correspondence should be addressed to Axel Plinge (axel.plinge@iis.fraunhofer.de)
ABSTRACT
First-order Ambisonics (FOA) recordings can be processed and reproduced over headphones. They can be rotated
to account for the listener’s head orientation. However, virtual reality (VR) systems allow the listener to move in
six-degrees-of-freedom (6DoF), i.e., three rotational plus three transitional degrees of freedom. Here, the apparent
angles and distances of the sound sources depend on the listener’s position. We propose a technique to facilitate
6DoF. In particular, a FOA recording is described using a parametric model, which is modified based on the
listener’s position and information about the distances to the sources. We evaluate our method by a listening test,
comparing different binaural renderings of a synthetic sound scene in which the listener can move freely.
1 Introduction
Reproduction of sound scenes has often been focusing
on loudspeaker setups, as this was the typical repro-
duction in private, e.g., living room, and professional
contexts, e.g., cinemas. Here, the relation of the scene
to the reproduction geometry is static as it accompanies
a two-dimensional image that forces the listener to look
in the front direction. Subsequently, the spatial relation
of the sound and visual objects is defined and fixed at
production time.
In virtual reality (VR), the immersion is explicitly
achieved by allowing the user to move freely in the
scene. Therefore, it is necessary to track the user’s
*
A joint institution of the Friedrich-Alexander-University Erlang-
en-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits
(IIS)
movement and adjust the visual and auditory reproduc-
tion to the user’s position. Typically, the user is wearing
a head-mounted display (HMD) and headphones. For
an immersive experience with headphones, the audio
has to be binauralized. Binauralization is a simulation
of how the human head, ears, and upper torso change
the sound of a source depending on its direction and
distance. This is achieved by convolution of the signals
with head-related transfer functions (HRTFs) for their
relative direction [
1
,
2
]. Binauralization also makes the
sound appear to be coming from the scene rather than
from inside the head [3].
Regarding the VR rendering, we can distinguish meth-
ods by the number of degrees of freedom. Early tech-
niques supported three-degrees-of-freedom (3DoF),
where the user has three movement degrees (pitch, yaw,
roll). This has been realized in 360
video reproduc-
Plinge et al. 6DoF FOA Reproduction with Distance Information
tion [
4
,
5
]. Here, the user is either wearing an HMD
or holding a tablet or phone in their hands. By moving
her/his head or the device, the user can look around in
any direction. Visually, this is realized by projecting
the video on a sphere around the user. Audio is often
recorded with a spatial microphone [
6
], e.g., first-order
Ambisonics (FOA), close to the video camera. In the
Ambisonics domain, the user’s head rotation is adapted
directly [
7
]. The audio is then, e.g., rendered to vir-
tual loudspeakers placed around the user. These virtual
loudspeaker signals are then binauralized [8].
Modern VR applications allow for six-degrees-of-
freedom (6DoF). Additionally to the head rotation, the
user can move around resulting in translation of her/his
position in three spatial dimensions.
We may distinguish 6DoF rendering methods by the
type of information that is used. The first kind of 6DoF
rendering is synthetic, as is commonly encountered in
VR games. Here, the whole scene is synthetic with
computer-generated imagery (CGI). The audio is often
generated using object-based rendering, where each
audio object is rendered with a distance-dependent gain
and a relative direction from the user based on the
tracking data. Realism can be enhanced by artificial
reverberation and diffraction [9, 10].
The second kind is 6DoF reproduction of recorded con-
tent based on spatially distributed recording positions.
For video, arrays of cameras can be employed to gen-
erate light-field rendering [
11
]. For audio, spatially
distributed microphone arrays or Ambisonics micro-
phones are used. Different methods for interpolation of
their signals have been proposed [12, 13, 14, 15].
Thirdly, 6DoF reproduction can be realized from
a recording at a single spatial position. Ambison-
ics recordings can be reproduced binaurally for off-
center positions via simulating higher-order Ambison-
ics (HOA) playback and listener movement within a
virtual loudspeaker array [
8
], translating along plane-
waves [
16
], or re-expanding the sound field [
17
]. As
there is no information on the distance of the sources
from the microphone, it is often assumed that all sound
sources are beyond the range of the listener.
In order to realize spatial sound modifications in a tech-
nically convenient way, parametric sound processing
or coding techniques can be employed (cf. [
18
] for
an overview). Directional audio coding (DirAC) [
19
]
is a popular method to transform the recording into a
representation that consists of an audio spectrum and
parametric side information on the sound direction and
diffuseness.
An example of parametric spatial sound manipulation
from a single spot of recording is that of ‘acoustic
zoom’ techniques [
20
,
21
]. Here, the listener posi-
tion is virtually moved into the recorded scene, similar
to zooming into an image. The user chooses one di-
rection or image portion and can then listen to this
from a translated point. This entails that all the DoAs
are changing relative to the original, non-zoomed re-
production. An interesting application for the second
kind of 6DoF reproduction is using recordings of dis-
tributed microphone arrays to generate the signal of a
‘virtual microphone’ placed at an arbitrary position in
the room [13].
The method proposed in this paper uses the parametric
representation of DirAC for 6DoF reproduction from
the recording of a single FOA microphone. In order
to correctly reproduce the sound of nearby objects to
the perspective of the listener, information about the
distance of the sound sources in the recording is incor-
porated.
For evaluation with a listening test, the multiple stim-
uli with hidden reference and anchor (MUSHRA)
paradigm [
22
] is adapted for VR. By using CGI and
synthetically generated sound, we can create an object-
based reference for comparison. A virtual FOA record-
ing takes place at the tracked position of the user, ren-
dering the 6DoF-adjusted signals. Additionally to the
proposed method, the reproduction without distance in-
formation and translation were presented as conditions
in the listening test.
The remainder of this paper is organized in six parts: A
problem statement explaining the goal and introducing
the notation (Section 2), a detailed description of the
method itself (Section 3), a description of the experi-
ment and its methodology (Section 4), presentation of
the results (Section 5) a discussion thereof (Section 6),
and a short summary (Section 7).
2 Problem Statement
Our goal is to obtain a virtual binaural signal at the
listener’s position given a signal at the original record-
ing position and information about the distances of
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 2 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
r
r
rr
reproduced distance dl
recorded distance dr
sound source
listener translation l
l
l
listener
microphone
x
y
r
r
rl
o
o
o
Fig. 1:
The 6DoF reproduction of spatial audio. A sound source is recorded by a microphone with the direction of
arrival (DoA)
r
r
rr
in the distance
dr
relative to the microphones position and orientation (black line and arc).
It has to be reproduced relative to the moving listener with the DoA
r
r
rl
and distance
d
d
dl
(red dashed). This
has to consider the listeners translation l
l
land rotation o
o
o(blue dotted).
sound sources from the recording position. The physi-
cal sources are assumed to be separable by their angle
towards the recording position.
The scene is recorded from the point of view (PoV) of
the microphone, the position of which is used as the
origin of the reference coordinate system. The scene
has to be reproduced from the PoV of the listener, who
is tracked in 6DoF, cf. Fig. 1. A single sound source is
shown here for illustration, the relation holds for each
time-frequency bin.
The sound source at the coordinates
d
d
drR3
is recorded
from the direction of arrival (DoA) expressed by the
unit vector
r
r
rr=d
d
dr/kd
d
drk
. This DoA can be estimated
from analysis of the recording. It is coming from the
distance
dr=kd
d
drk
. We assume this information can
be estimated automatically, e.g., using a time-of-flight
camera, to obtain distance information in the form of a
depth map
m(r
r
r)
, which maps each direction
r
r
r
from the
recording position to the distance of the closest sound
source in meters.
The listener is tracked in 6DoF. At a given time, he
is at a position
l
l
lR3
relative to the microphone and
has a rotation
o
o
oR3
relative to the microphones’ coor-
dinates system. We deliberately choose the recording
position as the origin of our coordinate system to sim-
plify the notation.
Thus the sound has to be reproduced with a different
distance
dl
, leading to a changed volume, and a dif-
ferent DoA
r
r
rl
that is the result of both translation and
subsequent rotation.
We propose a method for obtaining a virtual signal from
the listeners perspective by dedicated transformations
based on a parametric representation, as explained in
the following section.
3 Method for 6DoF Reproduction
The proposed method is based on the basic DirAC ap-
proach for parametric spatial sound encoding cf. [
19
].
The recording is transformed into a time-frequency rep-
resentation using short time Fourier transform (STFT).
We denote the time frame index with
n
and the fre-
quency index with
k
. It is assumed that there is one
dominant direct source per time-frequency instance of
the analyzed spectrum and that the latter can be treated
independently. The transformed recording is then an-
alyzed, estimating directions
r
r
rr(k,n)
and diffuseness
ψ(k,n)
for each time-frequency bin of the complex
spectrum
P(k,n)
. In the synthesis, the signal is divided
into a direct and diffuse part. Here, loudspeaker signals
are computed by panning the direct part depending on
the speaker positions and adding the diffuse part.
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 3 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
ψ(k,n)
P(k,n)
Tracking
Distance
map
DirAC
encoder
FOA
r
r
rr(k,n)d
d
dp(k,n)d
d
dv(k,n)DirAC
decoder
Binaural
renderer
channels
Translation
transform,
Volume
scaling
Rotation
transform
l
l
l(n)o
o
o(n)
m(r
r
r)
Distance
gain
Fig. 2:
Proposed method of 6DoF reproduction. The recorded FOA signal in B-Format is processed by a DirAC
encoder that computes direction and diffuseness values for each time-frequency bin of the complex spectrum.
The direction vector is then transformed by the listener’s tracked position and according to the distance
information given in a distance map. The resulting direction vector is then rotated according to the head
rotation. Finally, signals for 8+4+4 virtual loudspeaker channels are synthesized in the DirAC decoder.
These are then binauralized.
The method for transforming an FOA signal according
to the listeners perspective in 6DoF can be divided into
five steps, cf. Fig. 2. The input signal is analyzed in
the DirAC encoder, the distance information is added
from the distance map
m(r
r
r)
, then the listeners tracked
translation and rotation are applied in the novel trans-
forms. The DirAC decoder synthesizes signals for 8+4
virtual loudspeakers, which are in turn binauralized for
headphone playback. Note that as the rotation of the
sound scene after the translation is an independent op-
eration, it could be alternatively applied in the binaural
renderer. The only parameter transformed for 6DoF
is the direction vector. By the model definition, we
assume the diffuse part to be omnidirectional and thus
keep it unchanged.
3.1 DirAC Encoding
The input to the DirAC encoder is an FOA sound
signal in B-format representation. It consists of four
channels, i.e., the omnidirectional sound pressure and
the three first-order gradients, which under certain
assumptions are proportional to the particle velocity.
This signal is encoded in a parametric way, cf. [
23
].
The parameters are derived from the complex sound
pressure
P(k,n)
, which is the transformed omnidirec-
tional signal and the complex particle velocity vector
U
U
U(k,n) = [UX(k,n),UY(k,n),UZ(k,n)]T
correspond-
ing to the transformed gradient signals.
The DirAC representation consists of the signal
P(k,n)
,
the diffuseness
ψ(k,n)
and direction
r
r
r(k,n)
of the
sound wave at each time-frequency bin. To derive the
latter, first, the active sound intensity vector
I
I
Ia(k,n)
is computed as the real part (denoted by
Re(·)
) of the
product of pressure vector with the complex conjugate
(denoted by (·)) of the velocity vector [23]:
I
I
Ia(k,n) = 1
2Re(P(k,n)U
U
U(k,n)). (1)
The diffuseness is estimated from the coefficient of
variation of this vector [23]:
ψ(k,n) = s1kE{I
I
Ia(k,n)}k
E{kI
I
Ia(k,n)k}, (2)
where
E
denotes the expectation operator along time
frames, implemented as moving average.
Since we intend to manipulate the sound using a
direction-based distance map, the variance of the di-
rection estimates should be low. As the frames are
typically short, this is not always the case. Therefore,
a moving average is applied to obtain a smoothed di-
rection estimate
I
I
Ia(k,n)
. The DoA of the direct part of
the signal is then computed as unit length vector in the
opposite direction:
r
r
rr(k,n) = I
I
Ia(k,n)
I
I
Ia(k,n)
. (3)
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 4 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
3.2 Translation Transformation
As the direction is encoded as a three-dimensional vec-
tor of unit length for each time-frequency bin, it is
straightforward to integrate the distance information.
We multiply the direction vectors with their correspond-
ing map entry such that the vector length represents the
distance of the corresponding sound source dr(k,n):
d
d
dr(k,n) = r
r
rr(k,n)dr(k,n)
=r
r
rr(k,n)m(r
r
rr(k,n)),(4)
where
d
d
dr(k,n)
is a vector pointing from the recording
position of the microphone to the sound source active
at time nand frequency bin k.
The listener position, measured with respect to the
recording position, is given by the tracking system
for the current processing frame as
l
l
l(n)
. With the
vector representation of source positions, we can sub-
tract the tracking position vector
l
l
l(n)
to yield the
new, translated direction vector
d
d
dl(k,n)
with the length
dl(k,n) = kd
d
dl(k,n)k
, cf. Fig. 1. The distances from
the listener’s PoV to the sound sources are derived, and
the DoAs are adapted in a single step:
d
d
dl(k,n) = d
d
dr(k,n)l
l
l(n). (5)
An important aspect of realistic reproduction is the
distance attenuation or amplification . We assume the
adjustment is a function of the distance between sound
source and listener [
24
]. The length of the direction vec-
tors is to encode this adjustment. We have the distance
to the recording position encoded in
d
d
dr(k,n)
according
to the distance map, and the distance to be reproduced
encoded in
d
d
dl(k,n)
. If we normalize the vectors to unit
length and then multiply by the ratio of new and old
distance, we see that the required length is given by
dividing d
d
dl(k,n)by the length of the original vector:
d
d
dv(k,n) = d
d
dl(k,n)
kd
d
dl(k,n)kkd
d
dl(k,n)k
kd
d
dr(k,n)k=d
d
dl(k,n)
kd
d
dr(k,n)k. (6)
3.3 Rotation Transformation
We apply the changes for the listener’s orientation in the
following step. The orientation given by the tracking
can be written as vector composed of the pitch, yaw,
and roll
o
o
o(n)=[oX(n),oZ(n),oY(n)]T
, relative to the
recording position as the origin. The source direction
is rotated according to the listener orientation, which is
implemented using 2D rotation matrices, cf. eqn. (23)
in [7]:
d
d
dp(k,n) = R
R
RY(oY(n))R
R
RZ(oZ(n))R
R
RX(oX(n))d
d
dv(k,n)
(7)
The resulting DoA for the listener is then given by the
vector normalized to unit length:
r
r
rp(k,n) = d
d
dp(k,n)
kd
d
dp(k,n)k. (8)
3.4 DirAC Decoding and Binauralization
The transformed direction vector, the diffuseness, and
the complex spectrum are used to synthesize signals
for a uniformly distributed 8+4+4 virtual loudspeaker
setup. Eight virtual speakers are located in 45
azimuth
steps on the listener plane (elevation 0
), and four in a
90
cross formation above and below at
±
45
elevation.
The synthesis is split into a direct and diffuse part for
each loudspeaker channel
1iI
, where
I=16
is
the number of loudspeakers [19]:
Yi(k,n) = Yi,S(k,n) +Yi,D(k,n). (9)
For the direct part, edge fading amplitude panning
(EFAP) is applied to reproduce the sound from the right
direction given the virtual loudspeaker geometry [
25
].
Given the DoA vector
r
r
rp(k,n)
, this provides a panning
gain
Gi(r
r
r)
for each virtual loudspeaker channel
i
. The
distance-dependent gain for each DoA is derived from
the resulting length of the direction vector,
dp(k,n)
.
The direct synthesis for channel ibecomes:
Yi,S(k,n) =p1ψ(k,n)P(k,n)·(10)
Gir
r
rp(k,n)kd
d
dp(k,n)kγ,
where the exponent
γ
is a tuning factor that is typically
set to about 1 [
24
]. Note that with
γ=0
the distance-
dependent gain is turned off and which is equivalent to
the non-VR version of the decoder [19].
The pressure
P(k,n)
is used to generate
I
decorrelated
signals
˜
P
i(k,n)
. These decorrelated signals are added
to the individual loudspeaker channels as the diffuse
component. This follows the standard method [19]:
Yi,D(k,n) = pψ(k,n)1
I
˜
P
i(k,n). (11)
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 5 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
Fig. 3:
The VR scene. The sound is coming from the person, the radio and the open window, each source marked
with concentric circles. The microphone position for the FOA recording is marked by a cross. The user can
walk in the area marked by the dashed rectangle on the floor.
The diffuse and direct part of each channel are added
together, and the signals are transformed back into the
time domain by an inverse STFT. These channel time
domain signals are convolved with HRTFs for the left
and right ear depending on the loudspeaker position to
create binauralized signals.
4 Perceptual Evaluation
For the evaluation, a single scene in a virtual living
room is reproduced. Different rendering conditions are
used to reproduce three simultaneously active sound
sources. A novel MUSHRA-VR technique was used to
asses the quality with the help of test subjects.
4.1 The Scene
The virtual environment in the experiment is an indoor
room with three sound sources at different distances
from the recording position. At about 50 cm there is
a human speaker, at 1 m a radio and at 2 m an open
window, cf. Fig. 3.
The visual rendering is done using Unity and an HTC
VIVE. The audio processing is implemented with the
help of virtual studio technology (VST) plugins and
Max/MSP. The tracking data and conditions are ex-
changed via open sound control (OSC) messages. The
walking area is about 3.5 ×4.5 m.
4.2 MUSHRA-VR
While there are established standards for evaluation
of static audio reproduction, these are usually not di-
rectly applicable for VR. Especially for 6DoF, novel
approaches for evaluation of the audio quality have to
be developed as the experience is more complicated
than in audio-only evaluation, and the presented con-
tent depends on the unique motion path of each listener.
Novel methods such as wayfinding in VR [
26
] or phys-
iological responses to immersive experiences [
27
] are
actively researched, but traditional well-tested methods
can also be adapted to a VR environment to support
development work done today.
MUSHRA is a widely adopted audio quality evalua-
tion method applied to a wide range of use cases from
speech quality evaluation to multichannel spatial audio
setups [
22
]. It allows side-by-side comparison of a
reference with multiple renderings of the same audio
content and provides an absolute quality scale through
the use of a hidden reference and anchor test items. In
this test, the MUSHRA methodology is adopted into
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 6 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
DirAC
Source signals
Tracking
Artificial
reverberation
Downmix to
bformat
B-Format
synthesis
Binaural
Renderer
Source
positions
Distance map
m
6DoF
REF
C1-C3
‘dry’
‘wet’
Fig. 4:
The signal paths for reference rendering and DirAC. In the reference case, the tracking data is used to
change the positioning and rotation of the object-based B-format synthesis (top left). In the other conditions
C1-C3, the tracking data is applied in the DirAC domain (right).
a VR setting, and thus some departures from the rec-
ommended implementation are necessary. Specifically,
the version implemented here does not allow looping
of the audio content, and the anchor item is the 3DoF
rendering.
The different conditions are randomly assigned to the
test conditions in each run. Each participant is asked to
evaluate the audio quality of each condition and give a
score on a scale of 0 to 100. They know that one of the
conditions is, in fact, identical to the reference and as
such to be scored with 100 points. The worst ‘anchor’
condition is to be scored 20 (bad) or lower; all other
conditions should be scored in between.
The MUSHRA panel was designed in such a way that
ratings of systems-under-test can be done at any time
while having an unobtrusive interface in the virtual
environment. By pressing a button on the hand-held
controller, a semi-transparent interface is instantiated
at eye level in the user’s field of view (FoV), at a dis-
tance suitable for natural viewing. A laser pointer is
present that replicates mouse-over states (inactive, ac-
tive, pressed, highlighted) for buttons to assist with
interaction. Pressing the same button on the hand-held
controller removes the panel but maintains all current
ratings and condition selection playback. All ratings
are logged in real-time to a file including a legend for
the randomization of conditions.
4.3 Conditions
A total of four different conditions were implemented
for the experiment.
REF
Object-based rendering. This is the reference
condition. The B-format is generated on the fly for
the listener’s current position and then rendered
via the virtual speakers.
C1
3DoF reproduction. The listener position is ig-
nored, i.e.
l
l
l(n) = 0
0
0
, but his head rotation
o
o
o(n)
is
still applied. The gain is set to that of sources in a
distance of 2 m from the listener. This condition
is used as an anchor.
C2
The proposed method for 6DoF reproduction with-
out distance information. The listener position is
used to change the direction vector. All sources
are located on a sphere outside of the walking
area. The radius of the sphere was fixed to 2 m,
i.e.,
m(r
r
r) = 2r
r
r
, and the distance-dependent gain
is applied (γ=1).
C3
The proposed method of 6DoF reproduction with
distance information. The listener position
l
l
l(n)
is
used to change the direction vector. The distance
information
m(r
r
r)
is used to compute the correct
DoA at the listener position (5), and the distance-
dependent gain (6) is applied (γ=1).
4.4 Rendering
The same signal processing pipeline is used for all con-
ditions. This was done to ensure that the comparison is
focused on the spatial reproduction only and the result
is not influenced by coloration or other effects. The
pipeline is shown in Fig. 4. Two B-Format signals are
computed from the three mono source signals. A di-
rect (dry) signal is computed online. A reverberation
(wet) signal is precomputed off-line. These are added
together and processed by DirAC which renders to vir-
tual loudspeakers, which are then binauralized. The
difference lies in the application of the tracking data.
In the reference case, it is applied before the synthesis
of the B-format signal, such that it is virtually recorded
at the listener position. In the other cases, it is applied
in the DirAC domain.
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 7 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
4.4.1 Reference
Object-based rendering is used as a reference scenario.
Virtually, the listener is equipped with a B-format mi-
crophone on her/his head and produces a recording at
his/her head position and rotation. This is implemented
straightforwardly: The objects are placed relative to the
tracked listener position. An FOA signal is generated
from each source with distance attenuation. The syn-
thetic direct B-Format signal
s
s
si
for a source signal
si(t)
at distance
di
, direction with azimuth
θ
and elevation
ψis:
s
s
si(t) = 1
di
1
cosθcosφ
sinθcosφ
sinφ
si(tdi/c), (12)
where
c
is the speed of sound in m/s. Thereafter, the
tracked rotation is applied in the FOA domain [7].
4.4.2 Artificial Reverberation
Artificial reverberation is added to the source signal
in a time-invariant manner to enhance the realism of
the rendered in-door sound scene. Early reflections
from the boundaries of the shoebox-shaped room are
added with accurate delay, direction, and attenuation.
Late reverberation is generated with a spatial feedback
delay network (FDN) which distributes the multichan-
nel output to the virtual loudspeaker setup [
28
]. The
frequency-dependent reverberation time
T60
was be-
tween 90 to 150 ms with a mean of 110 ms. A tonal
correction filter with a lowpass characteristic was ap-
plied subsequently.
The reverberated signal is then converted from 8+4+4
virtual speaker setup to B-format by multiplying each
of the virtual speaker signals with the B-format pattern
of their DoA as in (12). The reverberant B-format
signal is added to the direct signal.
4.4.3 Rendering and Binauralization
The summed B-format is processed in the DirAC do-
main. The encoding is done using a quadrature mirror
filter (QMF) filterbank with 128 bands, chosen due to
its high temporal resolution and low temporal aliasing.
Both direction and diffuseness are estimated with a
moving average smoothing of 42 ms. The decoding is
generating 8+4+4 virtual loudspeaker signals. These 16
signals are then convolved with non-individual HRTFs
for binaural playback.
C1
3DoF
C2
no distance
C3
6DoF
REF
objects
0
25
50
75
100
** **
**
**
**
condition
MUSHRA score
Fig. 5:
MUSHRA ratings (N=21) as box plots. The
dotted line represents the median score, the
boxes the 1st to 3rd quartile, the whiskers are at
±
1.5 inter-quartile range (IQR).Stars indicate
significant differences according to a pairwise
permutation test.
5 Results
The test was done with 24 subjects. Three subjects’
results were excluded as they scored the reference be-
low or equal to 70. The N=21 used were around 29.3
(SD=4.7) years old, 5 were female, none reported any
hearing impairment. All but five reported a bit of expe-
rience with VR, most in the rage of 0.5-2 hours total.
Only five of those reported to have done listening ex-
periments with active tracking. All but seven subjects
had done a MUSHRA test before. On average, it took
8.5 (SD=6) minutes to complete the test.
Fig. 5 shows the overall score distribution as box
plots. It can be clearly seen that the proposed method
achieved the second highest rating after the reference.
Both the 3DoF and reproduction without distance in-
formation are scored lower. All pairs of conditions
were compared with a permutation test. No significant
difference was found between C1 and C2. All others
were significantly different (p0.01,N=250000).
6 Discussion
Even though the conditions were found to be signif-
icantly different, the variance in the responses was
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 8 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
relatively large. One reason for this could be the dif-
ferent experience levels of the test subjects with VR.
However, having used a range of novice to expert in
VR and listening tests while still producing significant
effects shows that the results hold across these factors.
Some participants had difficulty spotting the 3DoF con-
dition as anchor. This is understandable as the reproduc-
tion without distance information also lead to the sound
coming from the wrong direction, i.e. when walking
behind the person or radio. This effect may have been
perceived stronger than the missing distance attenua-
tion. However, it may simplify the procedure and help
with consistency to provide an additional, non-spatial
anchor, such as a mono mix of the sound sources.
Regarding the proposed reproduction method, we see
that it allows for reproduction of FOA content, recorded
at a single point in space, in 6DoF. While the test
participants rated the ideal B-Format signal reference
higher, the proposed method achieved the highest mean
score for reproduction among the other conditions. The
proposed method works even when the sound sources
in the recording are located at different distances from
the microphones. In that case, the distances have to be
recorded as meta-data for reproduction in 6DoF. The
results show that the distance reproduction enhances
the quality of the experience. The effect is stronger
when the walking area allows for the users to walk
around sound sources.
The MUSHRA-VR worked for comparing the different
conditions. Most participants liked the interface and
used the possibility to open and close it at will. None
required additional instructions after the introduction
before putting on the HMD. We did not encounter any
severe usability issues.
7 Summary
A novel method of audio reproduction in six-degrees-
of-freedom was proposed. The audio is recorded as
first-order Ambisonics at a single position and distance
data for the sound sources is acquired as side informa-
tion. Using this information, the audio is reproduced
with respect to the live tracking of the listener in the
parametric directional audio coding domain.
A subjective test showed that the proposed method is
ranked closely to object-based rendering. This implies
that the proposed reproduction method can successfully
provide a virtual playback beyond three degrees of
freedom when the distance information is taken into
account.
References
[1]
Liitola, T., Headphone sound externalization,
Ph.D. thesis, Helsinki University of Technology.
Department of Electrical and Communications
Engineering Laboratory of Acoustics and Audio
Signal Processing., 2006.
[2]
Blauert, J., Spatial Hearing - Revised Edition:
The Psychophysics of Human Sound Localization,
The MIT Press, 1996, ISBN 0262024136.
[3]
Zhang, W., Samarasinghe, P. N., Chen, H., and
Abhayapala, T. D., “Surround by Sound: A Re-
view of Spatial Audio Recording and Reproduc-
tion,” Applied Sciences, 7(5), p. 532, 2017.
[4]
Bates, E. and Boland, F., “Spatial Music, Virtual
Reality, and 360 Media,” in Audio Eng. Soc. Int.
Conf. on Audio for Virtual and Augmented Reality,
Los Angels, CA, U.S.A., 2016.
[5]
Anderson, R., Gallup, D., Barron, J. T., Kontka-
nen, J., Snavely, N., Esteban, C. H., Agarwal, S.,
and Seitz, S. M., “Jump: Virtual Reality Video,
ACM Transactions on Graphics, 35(6), p. 198,
2016.
[6]
Merimaa, J., Analysis, Synthesis, and Perception
of Spatial Sound: Binaural Localization Model-
ing and Multichannel Loudspeaker Reproduction,
Ph.D. thesis, Helsinki University of Technology,
2006.
[7]
Kronlachner, M. and Zotter, F., “Spatial Trans-
formations for the Enhancement of Ambisonic
Recordings,” in 2nd International Conference on
Spatial Audio, Erlangen, Germany, 2014.
[8]
Noisternig, M., Sontacchi, A., Musil, T., and
Holdrich, R., “A 3D Ambisonic Based Binaural
Sound Reproduction System,” in Audio Eng. Soc.
Conf., 2003.
[9]
Taylor, M., Chandak, A., Mo, Q., Lauterbach, C.,
Schissler, C., and Manocha, D., “Guided multi-
view ray tracing for fast auralization,IEEE Trans.
Visualization & Comp. Graphics, 18, pp. 1797–
1810, 2012.
[10]
Rungta, A., Schissler, C., Rewkowski, N., Mehra,
R., and Manocha, D., “Diffraction Kernels for
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 9 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
Interactive Sound Propagation in Dynamic Envi-
ronments,” IEEE Trans. Visualization & Comp.
Graphics, 24(4), pp. 1613–1622, 2018.
[11]
Ziegler, M., Keinert, J., Holzer, N., Wolf, T.,
Jaschke, T., op het Veld, R., Zakeri, F. S., and
Foessel, S., “Immersive Virtual Reality for Live-
Action Video using Camera Arrays,” in IBC, Am-
sterdam, Netherlands, 2017.
[12]
Mariette, N., Katz, B. F. G., Boussetta, K., and
Guillerminet, O., “SoundDelta: A Study of Au-
dio Augmented Reality Using WiFi-Distributed
Ambisonic Cell Rendering,” in Audio Eng. Soc.
Conv. 128, 2010.
[13]
Thiergart, O., Galdo, G. D., Taseska, M., and Ha-
bets, E. A. P., “Geometry-Based Spatial Sound
Acquisition using Distributed Microphone Ar-
rays,”
IEEE
Trans. Audio, Speech, Language Pro-
cess., 21(12), pp. 2583–2594, 2013.
[14]
Schörkhuber, C., Hack, P., Zaunschirm, M., Zot-
ter, F., and Sontacchi, A., “Localization of Mul-
tiple Acoustic Sources with a Distributed Array
of Unsynchronized First-Order Ambisonics Mi-
crophones,” in Congress of Alps-Adria Acoustics
Assosiation, Graz, Austria, 2014.
[15]
Tylka, J. G. and Choueiri, E. Y., “Soundfield Nav-
igation using an Array of Higher-Order Ambison-
ics Microphone,” in Audio Eng. Soc. Conf. on
Audio for Virtual and Augmented Reality, Los
Angeles, California, U.S.A., 2016.
[16]
Schultz, F. and Spors, S., “Data-Based Binau-
ral Synthesis Including Rotational and Transla-
tory Head-Movements,” in Audio Eng. Soc. Conf.:
Sound Field Control - Engineering and Percep-
tion, 2013.
[17]
Tylka, J. G. and Choueiri, E., “Comparison of
Techniques for Binaural Navigation of Higher-
Order Ambisonic Soundfields,” in Audio Eng. Soc.
Conv. 139, 2015.
[18]
Kowalczyk, K., Thiergart, O., Taseska, M.,
Del Galdo, G., Pulkki, V., and Habets, E. A. P.,
“Parametric Spatial Sound Processing: A Flexi-
ble and Efficient Solution to Sound Scene Acqui-
sition, Modification, and Reproduction,IEEE
Signal Process. Mag., 32(2), pp. 31–42, 2015.
[19]
Pulkki, V., “Spatial Sound Reproduction with
Directional Audio Coding,” J. Audio Eng. Soc.,
55(6), pp. 503–516, 2007.
[20]
Thiergart, O., Kowalczyk, K., and Habets, E.
A. P., “An Acoustical Zoom based on Informed
Spatial Filtering,” in Int. Workshop on Acoustic
Signal Enhancement, pp. 109–113, 2014.
[21]
Khaddour, H., Schimmel, J., and Rund, F., “A
Novel Combined System of Direction Estimation
and Sound Zooming of Multiple Speakers,Ra-
dioengineering, 24(2), 2015.
[22]
International Telecommunication Union, “ITU-R
BS.1534-3, Method for the subjective assessment
of intermediate quality level of audio systems,
2015.
[23]
Thiergart, O., Del Galdo, G., Kuech, F., and Prus,
M., “Three-Dimensional Sound Field Analysis
with Directional Audio Coding Based on Signal
Adaptive Parameter Estimators,” in Audio Eng.
Soc. Conv. Spatial Audio: Sense the Sound of
Space, 2010.
[24]
Kuttruff, H., Room Acoustics, Taylor & Francis,
4 edition, 2000.
[25]
Borß, C., “A polygon-based panning method for
3D loudspeaker setups,” in Audio Eng. Soc. Conv.,
pp. 343–352, Los Angeles, CA, USA, 2014.
[26]
Rummukainen, O., Schlecht, S., Plinge, A., and
Habets, E. A. P., “Evaluating Binaural Reproduc-
tion Systems from Behavioral Patterns in a Virtual
Reality – A Case Study with Impaired Binaural
Cues and Tracking Latency,” in Audio Eng. Soc.
Conv. 143, New York, NY, USA, 2017.
[27]
Engelke, U., Darcy, D. P., Mulliken, G. H., Bosse,
S., Martini, M. G., Arndt, S., Antons, J.-N.,
Chan, K. Y., Ramzan, N., and Brunnström, K.,
“Psychophysiology-Based QoE Assessment: A
Survey,” IEEE Selected Topics in Signal Process-
ing, 11(1), pp. 6–21, 2017.
[28]
Schlecht, S. J. and Habets, E. A. P., “Sign-
agnostic Matrix Design for Spatial Artificial Re-
verberation with Feedback Delay Networks,” in
Audio Eng. Soc. Conf. on Spatial Reproduction,
Tokyo, Japan, 2018.
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 10 of 10
... Owing to the additional parameters involved, parametric methods have the potential to achieve effective translation over significantly larger regions, compared with nonparametric, physically inspired methods, provided that the assumed model matches the real scene [47]. Suitable models either assume a single source component with an isotropic diffuse component per time-frequency bin or sub-band [6,16,20,25,26,30,39,41]; multiple source components accompanied by direc-tional ambience per time-frequency bin [21,23]; two source components per time-frequency bin [31]; sinusoidal components with spatial noise modeling [29]; or statistically independent source components [27,31]. Regarding translation of single-point recordings, an early approach, based on DirAC, projected the analyzed DoAs of source signals onto a fixed arbitrary geometry and rendered the source components as point sources while leaving the diffuse component unchanged. ...
... This was subsequently explored in a VR and game audio application context [6,16]. The method was later augmented with known source distance information in [20]. A similar projection approach, based on COM-PASS, with a number of multiple time-varying source components, was also studied recently in [21,23]. ...
... They also permitted varying levels of listener navigable freedoms. This is largely due to the difficulty of conducting such listening tests, since a comprehensive perceptual evaluation would require a realtime dynamic implementation of the system, along with, for example, a head-mounted display (HMD) providing visual context for the audio rendering, while simultaneously presenting an interactive test interface to the listener [20]. Therefore, nonparametric methods based on sound-field decompositions or expansions have primarily relied on comparing pressure reconstruction errors, which may be used to instead infer their perceptual performance [37,32]. ...
Article
Full-text available
This article proposes a system for object-based six-degrees-of-freedom (6DoF) rendering of spatial sound scenes that are captured using a distributed arrangement of multiple Ambisonic receivers. The approach is based on first identifying and tracking the positions of sound sources within the scene, followed by the isolation of their signals through the use of beamformers. These sound objects are subsequently spatialized over the target playback setup, with respect to both the head orientation and position of the listener. The diffuse ambience of the scene is rendered separately by first spatially subtracting the source signals from the receivers located nearest to the listener position. The resultant residual Ambisonic signals are then spatialized, decorrelated, and summed together with suitable interpolation weights. The proposed system is evaluated through an in situ listening test conducted in 6DoF virtual reality, whereby real-world sound sources are compared with the auralization achieved through the proposed rendering method. The results of 15 participants suggest that in comparison to a linear interpolation-based alternative, the proposed object-based approach is perceived as being more realistic.
... Examples of parametric modifications include rotations of the entire recorded sound scene or manipulations of the locations of individual directional sounds [25,116,117]. For reproduction of musical recordings, the diffuse signal is usually subject to decorrelation before it is fed to the loudspeakers in order to increase the feeling of spaciousness and plausible listener envelopment [112,118]. ...
... Translatory head movements based on a plane wave expansion of the sound field to be rendered was presented in [158,159], which demonstrated fundamental limitations of this framework. As a consequence, 6-DoF binaural rendering methods typically use a more application-oriented sound field representation and come in four flavors: (1) methods that perform a mathematical translation of the orthogonal sound field decomposition or a re-expansion around a different center [160][161][162][163], (2) parameterization and adaptation of a single-viewpoint recording (or RIR measurement) with a microphone array [116,117,[164][165][166], (3) interpolation between microphone array recordings performed at different locations [167][168][169][170], and (4) interpolation between parameterizations of recordings performed at different locations [171][172][173][174][175]. ...
Article
Full-text available
The domain of spatial audio comprises methods for capturing, processing, and reproducing audio content that contains spatial information. Data-based methods are those that operate directly on the spatial information carried by audio signals. This is in contrast to model-based methods, which impose spatial information from, for example, metadata like the intended position of a source onto signals that are otherwise free of spatial information. Signal processing has traditionally been at the core of spatial audio systems, and it continues to play a very important role. The irruption of deep learning in many closely related fields has put the focus on the potential of learning-based approaches for the development of data-based spatial audio applications. This article reviews the most important application domains of data-based spatial audio including well-established methods that employ conventional signal processing while paying special attention to the most recent achievements that make use of machine learning. Our review is organized based on the topology of the spatial audio pipeline that consist in capture, processing/manipulation, and reproduction. The literature on the three stages of the pipeline is discussed, as well as on the spatial audio representations that are used to transmit the content between them, highlighting the key references and elaborating on the underlying concepts. We reflect on the literature based on a juxtaposition of the prerequisites that made machine learning successful in domains other than spatial audio with those that are found in the domain of spatial audio as of today. Based on this, we identify routes that may facilitate future advancement.
... In VR applications, it is often sufficient to obtain a signal representation at the translated position which is plausible to a human listener but not necessarily physically meaningful. Benevolent properties of the human auditory system are leveraged in the development of algorithms which allow for translations beyond the signal's sweet spot [3,4,5]. Recently, three novel approaches from the authors' lab were presented, namely the space warping (SW) [6], adaptive beamforming (ABF) [7], and adaptive space warping (ASW) [8] methods which will be further investigated and compared in this paper. ...
... The ABF approach, in contrast, aims to leave such ambient sound unchanged. Only the primary part, which consists mainly of direct sound, is changed, in order to yield psychoacoustically reasonable sound field translation [12,3,5]. The operation is defined as ...
Conference Paper
Full-text available
Higher-order Ambisonics recordings obtained from a single spherical microphone array inherently allow for immersive reproduction with all three rotational degrees of freedom (3DoF). On the other hand, physically correct implementation of user movement in the additional translational degrees of freedom is possible within a narrow range only. Therefore, a key technology to advance to 3DoF+ or 6DoF are sound field translation methods, which omit physically correct reconstruction in favor of psychoacoustic plausi-bility. In this work, we review recent advances in the field and juxtapose adaptive and non-adaptive methods. Relevant properties of different methods are compared using a novel visualization scheme. We complete this work with a discussion of limitations and opportunities of the approaches.
... For these reasons, we propose to exploit Du-alQNNs properties, including the ability to model 6DOF transformations in the 3D space, to represent an array of two Ambisonics microphones, whose 6DOF [25] perfectly fit with our augmented characterization. Thanks to the unit dual quaternion representation, our method is able to more precisely reconstruct and augment the spatial sound field, thus improving the localization capability of the model and the quality of user audio immersion in AR and VR applications. ...
Preprint
Spatial audio methods are gaining a growing interest due to the spread of immersive audio experiences and applications, such as virtual and augmented reality. For these purposes, 3D audio signals are often acquired through arrays of Ambisonics microphones, each comprising four capsules that decompose the sound field in spherical harmonics. In this paper, we propose a dual quaternion representation of the spatial sound field acquired through an array of two First Order Ambisonics (FOA) microphones. The audio signals are encapsulated in a dual quaternion that leverages quaternion algebra properties to exploit correlations among them. This augmented representation with 6 degrees of freedom (6DOF) involves a more accurate coverage of the sound field, resulting in a more precise sound localization and a more immersive audio experience. We evaluate our approach on a sound event localization and detection (SELD) benchmark. We show that our dual quaternion SELD model with temporal convolution blocks (DualQSELD-TCN) achieves better results with respect to real and quaternion-valued baselines thanks to our augmented representation of the sound field. Full code is available at: https://github.com/ispamm/DualQSELD-TCN.
... These methods identify multiple directional sound components, and in some in some cases also non-uniform ambient components [32]. Parametric approaches have also improved techniques for sound field navigation, to extend range-limited auditory translation from single-perspective recordings, so-called 3DoF+ systems [33,34,35,36,37]. Despite great progress in enhanced perceptual rendering and increased mobility within virtual sound fields by parametric methods, navigability of a sound field captured from a single SMA remains inherently confined to the location of recording. ...
Thesis
Full-text available
In this thesis, the performance of high-order sound field analysis by spatial filtering is evaluated in the context of 3D source localization. First, various factors in ‘sector- based’ processing—including analysis order, the presence of interfering noise, and source direction—are assessed using two key sound field indicators derived from intensimetric analysis: sound field diffuseness estimation and source direction-of- arrival (DoA) estimation. This carries forward to the evaluation of a 3D source localization technique which utilizes concurrent analyses from multiple receivers. The evaluation is carried out by simulation of ideal spherical harmonic receiver signals of sound fields comprised of fundamental components—a mixture of plane wave sound sources and an isotropic diffuse field—which adhere to an underlying plane wave model that has previously been characterized analytically. This model is then challenged with two scenarios: the presence of an interfering sound source, and an anisotropic, partially-correlated reverberant field. The high-order, sector- based approach to source localization is shown to be a consistent improvement upon the (first-order) method without spatial filtering, operating primarily through the principle of increasing the direct-to-diffuse (or signal-to-noise) energy ratio. However, numerous considerations must be taken if robust localization is to be performed over a large spatial extent. These include sector orientation, the arrangement and number of receivers, and estimation filtering and culling. Informed by the system performance under the tested conditions as well as the analytical model describing the sound field under the action of spatial filtering, optimization techniques are proposed, tested, and found to be successful under specified constraints. These include employing the diffuseness metric for weighted-DoA estimation in the localization task, an iterative approach localization using the estimated source distance to weight the contribution of DoA estimates, and a spatial sweep technique which applies a diffuseness constraint for DoA estimation culling which moves toward multisource localization.
... Numerous authors have sought to address these limitations by parametric decomposition of the recorded sound field, whereby salient properties of the sound field are analyzed and parameterized to enable selective reconstruction of the sound scene [2,3]. Indeed, parametric approaches have been used to extend the navigable range of singleperspective [4,5] and multiperspective recordings [6]. ...
Conference Paper
Full-text available
The success of parametric approaches to spatial sound reproduction and sound field navigation depend on the accuracy of the initial analysis and decomposition of the sound field. In this work, the sector-based high-order extension to intensimetric sound field analysis is evaluated in the context of 3D source localization. The evaluation is performed with simulations of ideal spherical harmonic receiver signals using two intensimetric estimators: source direction-of-arrival and sound field diffuseness. The technique is first assessed for a single receiver with regard to influential factors of analysis order, source incidence angle, and the presence of diffuse noise. The technique is then applied to 3D source localization, utilizing concurrent analyses from multiple receivers. Results for different analysis orders are compared and mitigating factors for robust localization over a broad spatial region are discussed. Optimization strategies targeting specific conditions are proposed , tested, and found to improve localization accuracy.
... In their synthesis stages, these methods usually employ a geometric source model which requires oracle information on the source distances. Methods for firstorder Ambisonics are presented in [14] and [15]. Methods generalizing to higher orders are the matching-pursuit-based approach presented in [16] and our recent work [17]. ...
Conference Paper
We propose a novel approach for sound field translation of higher-order Ambisonics with applications in spatial audio and virtual reality. Our proposition is based on space warping allowing to change the origin of a sound field representation even for displacements beyond the sweet spot. The basic idea is to squeeze and stretch the angular source distribution according to a geometric model with known source distance. We propose to resign from correct phase reconstruction in favor of optimizing towards psychoacoustically motivated performance indicators. Furthermore, we show how an existing sound field method can be related to the empiric mathematical framework of space warping. In an experiment with different translation techniques, our approach achieves superior performance in terms of different instrumental metrics.
Thesis
Binaural rendering aims to immerse the listener in a virtual acoustic scene, making it an essential method for spatial audio reproduction in virtual or augmented reality (VR/AR) applications. The growing interest and research in VR/AR solutions yielded many different methods for the binaural rendering of virtual acoustic realities, yet all of them share the fundamental idea that the auditory experience of any sound field can be reproduced by reconstructing its sound pressure at the listener's eardrums. This thesis addresses various state-of-the-art methods for 3 or 6 degrees of freedom (DoF) binaural rendering, technical approaches applied in the context of headphone-based virtual acoustic realities, and recent technical and psychoacoustic research questions in the field of binaural technology. The publications collected in this dissertation focus on technical or perceptual concepts and methods for efficient binaural rendering, which has become increasingly important in research and development due to the rising popularity of mobile consumer VR/AR devices and applications. The thesis is organized into five research topics: Head-Related Transfer Function Processing and Interpolation, Parametric Spatial Audio, Auditory Distance Perception of Nearby Sound Sources, Binaural Rendering of Spherical Microphone Array Data, and Voice Directivity. The results of the studies included in this dissertation extend the current state of research in the respective research topic, answer specific psychoacoustic research questions and thereby yield a better understanding of basic spatial hearing processes, and provide concepts, methods, and design parameters for the future implementation of technically and perceptually efficient binaural rendering.
Article
Full-text available
We present a novel method to generate plausible diffraction effects for interactive sound propagation in dynamic scenes. Our approach precomputes a diffraction kernel for each dynamic object in the scene and combines them with interactive ray tracing algorithms at runtime. A diffraction kernel encapsulates the sound interaction behavior of individual objects in the free field and we present a new source placement algorithm to significantly accelerate the precomputation. Our overall propagation algorithm can handle highly-tessellated or smooth objects undergoing rigid motion. We have evaluated our algorithm's performance on different scenarios with multiple moving objects and demonstrate the benefits over prior interactive geometric sound propagation methods. We also performed a user study to evaluate the perceived smoothness of the diffracted field and found that the auditory perception using our approach is comparable to that of a wave-based sound propagation method.
Article
Full-text available
Abstract In this article, a systematic overview of various recording and reproduction techniques for spatial audio is presented. While binaural recording and rendering is designed to resemble the human two-ear auditory system and reproduce sounds specifically for a listener’s two ears, soundfield recording and reproduction using a large number of microphones and loudspeakers replicate an acoustic scene within a region. These two fundamentally different types of techniques are discussed in the paper. A recent popular area, multi-zone reproduction, is also briefly reviewed in the paper. The paper is concluded with a discussion of the current state of the field and open problems.
Article
Full-text available
This article presents a new system for estimating the direction of multiple speakers and zooming the sound of one of them at a time. The proposed system is a combination of two levels; namely, sound source direction estimation, and acoustic zooming. The sound source direction estimation uses the so-called energetic analysis method for estimating the direction of multiple speakers, whereas the acoustic zooming is based on modifying the parameters of the directional audio coding (DirAC) in order to zoom the sound of a selected speaker among the others. Both listening tests and objective assessments are performed to evaluate this system using different time-frequency transforms.
Conference Paper
Full-text available
Distributed microphone arrays exploit the spatial diversity of an acoustic scene and obtain higher signal-to-noise ratios than compact microphone arrays that sample the sound field only locally. However, as distances between distributed microphones grow, wired connections become infeasible and Wireless Acoustic Sensor Networks (WASN) need to be employed. Due to synchronization and bandwidth issues in such networks, many established algorithms e.g. for self-calibration, source localization or source extraction relying on synchronized data and the availability of all microphone signals at a central processing unit are no longer applicable. In this paper we propose a solution to the acoustic source localization problem of multiple sources in 3 dimensions using unsynchronized sensor nodes, each equipped with a compact tetrahedral microphone array and a local processing unit.
Article
Full-text available
Flexible and efficient spatial sound acquisition and subsequent processing are of paramount importance in communication and assisted listening devices such as mobile phones, hearing aids, smart TVs, and emerging wearable devices (e.g., smart watches and glasses). In application scenarios where the number of sound sources quickly varies, sources move, and nonstationary noise and reverberation are commonly encountered, it remains a challenge to capture sounds in such a way that they can be reproduced with a high and invariable sound quality. In addition, the objective in terms of what needs to be captured, and how it should be reproduced, depends on the application and on the user?s preferences. Parametric spatial sound processing has been around for two decades and provides a flexible and efficient solution to capture, code, and transmit, as well as manipulate and reproduce spatial sounds.
Conference Paper
This paper proposes a method for evaluating real-time binaural reproduction systems by means of a wayfinding task in six degrees of freedom. Participants physically walk to sound objects in a virtual reality created by a head-mounted display and binaural audio. The method allows for comparative evaluation of different rendering and tracking systems. We show how the localization accuracy of spatial audio rendering is reflected by objective measures of the participants' behavior and task performance. As independent variables we add tracking latency or reduce the binaural cues. We provide a reference scenario with loudspeaker reproduction and an anchor scenario with monaural reproduction for comparison.
Article
We present Jump, a practical system for capturing high resolution, omnidirectional stereo (ODS) video suitable for wide scale consumption in currently available virtual reality (VR) headsets. Our system consists of a video camera built using off-the-shelf components and a fully automatic stitching pipeline capable of capturing video content in the ODS format. We have discovered and analyzed the distortions inherent to ODS when used for VR display as well as those introduced by our capture method and show that they are small enough to make this approach suitable for capturing a wide variety of scenes. Our stitching algorithm produces robust results by reducing the problem to one of pairwise image interpolation followed by compositing. We introduce novel optical flow and compositing methods designed specifically for this task. Our algorithm is temporally coherent and efficient, is currently running at scale on a distributed computing platform, and is capable of processing hours of footage each day.
Article
We present a survey of psychophysiology-based assessment for Quality of Experience (QoE) in advanced multimedia technologies. We provide a classification of methods relevant to QoE and describe related psychological processes, experimental design considerations, and signal analysis techniques. We summarise multimodal techniques and discuss several important aspects of psychophysiology-based QoE assessment, including the synergies with psychophysical assessment and the need for standardised experimental design. This survey is not considered to be exhaustive but serves as a guideline for those interested to further explore this emerging field of research.
Conference Paper
Directional audio coding (DirAC) provides an efficient description of spatial sound in terms of an audio downmix signal and parametric side information, namely the direction of arrival and diffuseness of sound. The sound scene can be reproduced based on this information with any audio reproduction system such as multichannel playback or binaural rendering. Input to the DirAC analysis are acoustic signals, e. g., captured by a microphone array. The accuracy of the DirAC parameter estimation can suffer from a low signal-to-noise ratio (SNR) and a high temporal variance of the input signals. To handle these problems, this contribution proposes signal adaptive parameter estimators which increase the estimation accuracy by considering the SNR and the stationarity interval of the input. Simulations show that the DirAC analysis is significantly improved.