Content uploaded by Axel Plinge
Author content
All content in this area was uploaded by Axel Plinge on Aug 26, 2018
Content may be subject to copyright.
Audio Engineering Society
Conference Paper
Presented at the Conference on
Audio for Virtual and Augmented Reality
2018 August 20 – 22, Redmond, WA, USA
This paper was peer-reviewed as a complete manuscript for presentation at this conference. This paper is available in the AES
E-Library (http://www.aes.org/e-lib) all rights reserved. Reproduction of this paper, or any portion thereof, is not permitted
without direct permission from the Journal of the Audio Engineering Society.
Six-Degrees-of-Freedom Binaural Audio Reproduction of
First-Order Ambisonics with Distance Information
Axel Plinge
1
, Sebastian J. Schlecht
1
, Oliver Thiergart
1
, Thomas Robotham
1
, Olli Rummukainen
1
, and Emanuël A.
P. Habets1
1International Audio Laboratories Erlangen*, Germany
Correspondence should be addressed to Axel Plinge (axel.plinge@iis.fraunhofer.de)
ABSTRACT
First-order Ambisonics (FOA) recordings can be processed and reproduced over headphones. They can be rotated
to account for the listener’s head orientation. However, virtual reality (VR) systems allow the listener to move in
six-degrees-of-freedom (6DoF), i.e., three rotational plus three transitional degrees of freedom. Here, the apparent
angles and distances of the sound sources depend on the listener’s position. We propose a technique to facilitate
6DoF. In particular, a FOA recording is described using a parametric model, which is modified based on the
listener’s position and information about the distances to the sources. We evaluate our method by a listening test,
comparing different binaural renderings of a synthetic sound scene in which the listener can move freely.
1 Introduction
Reproduction of sound scenes has often been focusing
on loudspeaker setups, as this was the typical repro-
duction in private, e.g., living room, and professional
contexts, e.g., cinemas. Here, the relation of the scene
to the reproduction geometry is static as it accompanies
a two-dimensional image that forces the listener to look
in the front direction. Subsequently, the spatial relation
of the sound and visual objects is defined and fixed at
production time.
In virtual reality (VR), the immersion is explicitly
achieved by allowing the user to move freely in the
scene. Therefore, it is necessary to track the user’s
*
A joint institution of the Friedrich-Alexander-University Erlang-
en-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits
(IIS)
movement and adjust the visual and auditory reproduc-
tion to the user’s position. Typically, the user is wearing
a head-mounted display (HMD) and headphones. For
an immersive experience with headphones, the audio
has to be binauralized. Binauralization is a simulation
of how the human head, ears, and upper torso change
the sound of a source depending on its direction and
distance. This is achieved by convolution of the signals
with head-related transfer functions (HRTFs) for their
relative direction [
1
,
2
]. Binauralization also makes the
sound appear to be coming from the scene rather than
from inside the head [3].
Regarding the VR rendering, we can distinguish meth-
ods by the number of degrees of freedom. Early tech-
niques supported three-degrees-of-freedom (3DoF),
where the user has three movement degrees (pitch, yaw,
roll). This has been realized in 360
◦
video reproduc-
Plinge et al. 6DoF FOA Reproduction with Distance Information
tion [
4
,
5
]. Here, the user is either wearing an HMD
or holding a tablet or phone in their hands. By moving
her/his head or the device, the user can look around in
any direction. Visually, this is realized by projecting
the video on a sphere around the user. Audio is often
recorded with a spatial microphone [
6
], e.g., first-order
Ambisonics (FOA), close to the video camera. In the
Ambisonics domain, the user’s head rotation is adapted
directly [
7
]. The audio is then, e.g., rendered to vir-
tual loudspeakers placed around the user. These virtual
loudspeaker signals are then binauralized [8].
Modern VR applications allow for six-degrees-of-
freedom (6DoF). Additionally to the head rotation, the
user can move around resulting in translation of her/his
position in three spatial dimensions.
We may distinguish 6DoF rendering methods by the
type of information that is used. The first kind of 6DoF
rendering is synthetic, as is commonly encountered in
VR games. Here, the whole scene is synthetic with
computer-generated imagery (CGI). The audio is often
generated using object-based rendering, where each
audio object is rendered with a distance-dependent gain
and a relative direction from the user based on the
tracking data. Realism can be enhanced by artificial
reverberation and diffraction [9, 10].
The second kind is 6DoF reproduction of recorded con-
tent based on spatially distributed recording positions.
For video, arrays of cameras can be employed to gen-
erate light-field rendering [
11
]. For audio, spatially
distributed microphone arrays or Ambisonics micro-
phones are used. Different methods for interpolation of
their signals have been proposed [12, 13, 14, 15].
Thirdly, 6DoF reproduction can be realized from
a recording at a single spatial position. Ambison-
ics recordings can be reproduced binaurally for off-
center positions via simulating higher-order Ambison-
ics (HOA) playback and listener movement within a
virtual loudspeaker array [
8
], translating along plane-
waves [
16
], or re-expanding the sound field [
17
]. As
there is no information on the distance of the sources
from the microphone, it is often assumed that all sound
sources are beyond the range of the listener.
In order to realize spatial sound modifications in a tech-
nically convenient way, parametric sound processing
or coding techniques can be employed (cf. [
18
] for
an overview). Directional audio coding (DirAC) [
19
]
is a popular method to transform the recording into a
representation that consists of an audio spectrum and
parametric side information on the sound direction and
diffuseness.
An example of parametric spatial sound manipulation
from a single spot of recording is that of ‘acoustic
zoom’ techniques [
20
,
21
]. Here, the listener posi-
tion is virtually moved into the recorded scene, similar
to zooming into an image. The user chooses one di-
rection or image portion and can then listen to this
from a translated point. This entails that all the DoAs
are changing relative to the original, non-zoomed re-
production. An interesting application for the second
kind of 6DoF reproduction is using recordings of dis-
tributed microphone arrays to generate the signal of a
‘virtual microphone’ placed at an arbitrary position in
the room [13].
The method proposed in this paper uses the parametric
representation of DirAC for 6DoF reproduction from
the recording of a single FOA microphone. In order
to correctly reproduce the sound of nearby objects to
the perspective of the listener, information about the
distance of the sound sources in the recording is incor-
porated.
For evaluation with a listening test, the multiple stim-
uli with hidden reference and anchor (MUSHRA)
paradigm [
22
] is adapted for VR. By using CGI and
synthetically generated sound, we can create an object-
based reference for comparison. A virtual FOA record-
ing takes place at the tracked position of the user, ren-
dering the 6DoF-adjusted signals. Additionally to the
proposed method, the reproduction without distance in-
formation and translation were presented as conditions
in the listening test.
The remainder of this paper is organized in six parts: A
problem statement explaining the goal and introducing
the notation (Section 2), a detailed description of the
method itself (Section 3), a description of the experi-
ment and its methodology (Section 4), presentation of
the results (Section 5) a discussion thereof (Section 6),
and a short summary (Section 7).
2 Problem Statement
Our goal is to obtain a virtual binaural signal at the
listener’s position given a signal at the original record-
ing position and information about the distances of
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 2 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
r
r
rr
reproduced distance dl
recorded distance dr
sound source
listener translation l
l
l
listener
microphone
x
y
r
r
rl
o
o
o
Fig. 1:
The 6DoF reproduction of spatial audio. A sound source is recorded by a microphone with the direction of
arrival (DoA)
r
r
rr
in the distance
dr
relative to the microphones position and orientation (black line and arc).
It has to be reproduced relative to the moving listener with the DoA
r
r
rl
and distance
d
d
dl
(red dashed). This
has to consider the listeners translation l
l
land rotation o
o
o(blue dotted).
sound sources from the recording position. The physi-
cal sources are assumed to be separable by their angle
towards the recording position.
The scene is recorded from the point of view (PoV) of
the microphone, the position of which is used as the
origin of the reference coordinate system. The scene
has to be reproduced from the PoV of the listener, who
is tracked in 6DoF, cf. Fig. 1. A single sound source is
shown here for illustration, the relation holds for each
time-frequency bin.
The sound source at the coordinates
d
d
dr∈R3
is recorded
from the direction of arrival (DoA) expressed by the
unit vector
r
r
rr=d
d
dr/kd
d
drk
. This DoA can be estimated
from analysis of the recording. It is coming from the
distance
dr=kd
d
drk
. We assume this information can
be estimated automatically, e.g., using a time-of-flight
camera, to obtain distance information in the form of a
depth map
m(r
r
r)
, which maps each direction
r
r
r
from the
recording position to the distance of the closest sound
source in meters.
The listener is tracked in 6DoF. At a given time, he
is at a position
l
l
l∈R3
relative to the microphone and
has a rotation
o
o
o∈R3
relative to the microphones’ coor-
dinates system. We deliberately choose the recording
position as the origin of our coordinate system to sim-
plify the notation.
Thus the sound has to be reproduced with a different
distance
dl
, leading to a changed volume, and a dif-
ferent DoA
r
r
rl
that is the result of both translation and
subsequent rotation.
We propose a method for obtaining a virtual signal from
the listeners perspective by dedicated transformations
based on a parametric representation, as explained in
the following section.
3 Method for 6DoF Reproduction
The proposed method is based on the basic DirAC ap-
proach for parametric spatial sound encoding cf. [
19
].
The recording is transformed into a time-frequency rep-
resentation using short time Fourier transform (STFT).
We denote the time frame index with
n
and the fre-
quency index with
k
. It is assumed that there is one
dominant direct source per time-frequency instance of
the analyzed spectrum and that the latter can be treated
independently. The transformed recording is then an-
alyzed, estimating directions
r
r
rr(k,n)
and diffuseness
ψ(k,n)
for each time-frequency bin of the complex
spectrum
P(k,n)
. In the synthesis, the signal is divided
into a direct and diffuse part. Here, loudspeaker signals
are computed by panning the direct part depending on
the speaker positions and adding the diffuse part.
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 3 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
ψ(k,n)
P(k,n)
Tracking
Distance
map
DirAC
encoder
FOA
r
r
rr(k,n)d
d
dp(k,n)d
d
dv(k,n)DirAC
decoder
Binaural
renderer
channels
Translation
transform,
Volume
scaling
Rotation
transform
l
l
l(n)o
o
o(n)
m(r
r
r)
Distance
gain
Fig. 2:
Proposed method of 6DoF reproduction. The recorded FOA signal in B-Format is processed by a DirAC
encoder that computes direction and diffuseness values for each time-frequency bin of the complex spectrum.
The direction vector is then transformed by the listener’s tracked position and according to the distance
information given in a distance map. The resulting direction vector is then rotated according to the head
rotation. Finally, signals for 8+4+4 virtual loudspeaker channels are synthesized in the DirAC decoder.
These are then binauralized.
The method for transforming an FOA signal according
to the listeners perspective in 6DoF can be divided into
five steps, cf. Fig. 2. The input signal is analyzed in
the DirAC encoder, the distance information is added
from the distance map
m(r
r
r)
, then the listeners tracked
translation and rotation are applied in the novel trans-
forms. The DirAC decoder synthesizes signals for 8+4
virtual loudspeakers, which are in turn binauralized for
headphone playback. Note that as the rotation of the
sound scene after the translation is an independent op-
eration, it could be alternatively applied in the binaural
renderer. The only parameter transformed for 6DoF
is the direction vector. By the model definition, we
assume the diffuse part to be omnidirectional and thus
keep it unchanged.
3.1 DirAC Encoding
The input to the DirAC encoder is an FOA sound
signal in B-format representation. It consists of four
channels, i.e., the omnidirectional sound pressure and
the three first-order gradients, which under certain
assumptions are proportional to the particle velocity.
This signal is encoded in a parametric way, cf. [
23
].
The parameters are derived from the complex sound
pressure
P(k,n)
, which is the transformed omnidirec-
tional signal and the complex particle velocity vector
U
U
U(k,n) = [UX(k,n),UY(k,n),UZ(k,n)]T
correspond-
ing to the transformed gradient signals.
The DirAC representation consists of the signal
P(k,n)
,
the diffuseness
ψ(k,n)
and direction
r
r
r(k,n)
of the
sound wave at each time-frequency bin. To derive the
latter, first, the active sound intensity vector
I
I
Ia(k,n)
is computed as the real part (denoted by
Re(·)
) of the
product of pressure vector with the complex conjugate
(denoted by (·)∗) of the velocity vector [23]:
I
I
Ia(k,n) = 1
2Re(P(k,n)U
U
U∗(k,n)). (1)
The diffuseness is estimated from the coefficient of
variation of this vector [23]:
ψ(k,n) = s1−kE{I
I
Ia(k,n)}k
E{kI
I
Ia(k,n)k}, (2)
where
E
denotes the expectation operator along time
frames, implemented as moving average.
Since we intend to manipulate the sound using a
direction-based distance map, the variance of the di-
rection estimates should be low. As the frames are
typically short, this is not always the case. Therefore,
a moving average is applied to obtain a smoothed di-
rection estimate
I
I
Ia(k,n)
. The DoA of the direct part of
the signal is then computed as unit length vector in the
opposite direction:
r
r
rr(k,n) = −I
I
Ia(k,n)
I
I
Ia(k,n)
. (3)
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 4 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
3.2 Translation Transformation
As the direction is encoded as a three-dimensional vec-
tor of unit length for each time-frequency bin, it is
straightforward to integrate the distance information.
We multiply the direction vectors with their correspond-
ing map entry such that the vector length represents the
distance of the corresponding sound source dr(k,n):
d
d
dr(k,n) = r
r
rr(k,n)dr(k,n)
=r
r
rr(k,n)m(r
r
rr(k,n)),(4)
where
d
d
dr(k,n)
is a vector pointing from the recording
position of the microphone to the sound source active
at time nand frequency bin k.
The listener position, measured with respect to the
recording position, is given by the tracking system
for the current processing frame as
l
l
l(n)
. With the
vector representation of source positions, we can sub-
tract the tracking position vector
l
l
l(n)
to yield the
new, translated direction vector
d
d
dl(k,n)
with the length
dl(k,n) = kd
d
dl(k,n)k
, cf. Fig. 1. The distances from
the listener’s PoV to the sound sources are derived, and
the DoAs are adapted in a single step:
d
d
dl(k,n) = d
d
dr(k,n)−l
l
l(n). (5)
An important aspect of realistic reproduction is the
distance attenuation or amplification . We assume the
adjustment is a function of the distance between sound
source and listener [
24
]. The length of the direction vec-
tors is to encode this adjustment. We have the distance
to the recording position encoded in
d
d
dr(k,n)
according
to the distance map, and the distance to be reproduced
encoded in
d
d
dl(k,n)
. If we normalize the vectors to unit
length and then multiply by the ratio of new and old
distance, we see that the required length is given by
dividing d
d
dl(k,n)by the length of the original vector:
d
d
dv(k,n) = d
d
dl(k,n)
kd
d
dl(k,n)kkd
d
dl(k,n)k
kd
d
dr(k,n)k=d
d
dl(k,n)
kd
d
dr(k,n)k. (6)
3.3 Rotation Transformation
We apply the changes for the listener’s orientation in the
following step. The orientation given by the tracking
can be written as vector composed of the pitch, yaw,
and roll
o
o
o(n)=[oX(n),oZ(n),oY(n)]T
, relative to the
recording position as the origin. The source direction
is rotated according to the listener orientation, which is
implemented using 2D rotation matrices, cf. eqn. (23)
in [7]:
d
d
dp(k,n) = R
R
RY(oY(n))R
R
RZ(oZ(n))R
R
RX(oX(n))d
d
dv(k,n)
(7)
The resulting DoA for the listener is then given by the
vector normalized to unit length:
r
r
rp(k,n) = d
d
dp(k,n)
kd
d
dp(k,n)k. (8)
3.4 DirAC Decoding and Binauralization
The transformed direction vector, the diffuseness, and
the complex spectrum are used to synthesize signals
for a uniformly distributed 8+4+4 virtual loudspeaker
setup. Eight virtual speakers are located in 45
◦
azimuth
steps on the listener plane (elevation 0
◦
), and four in a
90
◦
cross formation above and below at
±
45
◦
elevation.
The synthesis is split into a direct and diffuse part for
each loudspeaker channel
1≤i≤I
, where
I=16
is
the number of loudspeakers [19]:
Yi(k,n) = Yi,S(k,n) +Yi,D(k,n). (9)
For the direct part, edge fading amplitude panning
(EFAP) is applied to reproduce the sound from the right
direction given the virtual loudspeaker geometry [
25
].
Given the DoA vector
r
r
rp(k,n)
, this provides a panning
gain
Gi(r
r
r)
for each virtual loudspeaker channel
i
. The
distance-dependent gain for each DoA is derived from
the resulting length of the direction vector,
dp(k,n)
.
The direct synthesis for channel ibecomes:
Yi,S(k,n) =p1−ψ(k,n)P(k,n)·(10)
Gir
r
rp(k,n)kd
d
dp(k,n)k−γ,
where the exponent
γ
is a tuning factor that is typically
set to about 1 [
24
]. Note that with
γ=0
the distance-
dependent gain is turned off and which is equivalent to
the non-VR version of the decoder [19].
The pressure
P(k,n)
is used to generate
I
decorrelated
signals
˜
P
i(k,n)
. These decorrelated signals are added
to the individual loudspeaker channels as the diffuse
component. This follows the standard method [19]:
Yi,D(k,n) = pψ(k,n)1
√I
˜
P
i(k,n). (11)
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 5 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
Fig. 3:
The VR scene. The sound is coming from the person, the radio and the open window, each source marked
with concentric circles. The microphone position for the FOA recording is marked by a cross. The user can
walk in the area marked by the dashed rectangle on the floor.
The diffuse and direct part of each channel are added
together, and the signals are transformed back into the
time domain by an inverse STFT. These channel time
domain signals are convolved with HRTFs for the left
and right ear depending on the loudspeaker position to
create binauralized signals.
4 Perceptual Evaluation
For the evaluation, a single scene in a virtual living
room is reproduced. Different rendering conditions are
used to reproduce three simultaneously active sound
sources. A novel MUSHRA-VR technique was used to
asses the quality with the help of test subjects.
4.1 The Scene
The virtual environment in the experiment is an indoor
room with three sound sources at different distances
from the recording position. At about 50 cm there is
a human speaker, at 1 m a radio and at 2 m an open
window, cf. Fig. 3.
The visual rendering is done using Unity and an HTC
VIVE. The audio processing is implemented with the
help of virtual studio technology (VST) plugins and
Max/MSP. The tracking data and conditions are ex-
changed via open sound control (OSC) messages. The
walking area is about 3.5 ×4.5 m.
4.2 MUSHRA-VR
While there are established standards for evaluation
of static audio reproduction, these are usually not di-
rectly applicable for VR. Especially for 6DoF, novel
approaches for evaluation of the audio quality have to
be developed as the experience is more complicated
than in audio-only evaluation, and the presented con-
tent depends on the unique motion path of each listener.
Novel methods such as wayfinding in VR [
26
] or phys-
iological responses to immersive experiences [
27
] are
actively researched, but traditional well-tested methods
can also be adapted to a VR environment to support
development work done today.
MUSHRA is a widely adopted audio quality evalua-
tion method applied to a wide range of use cases from
speech quality evaluation to multichannel spatial audio
setups [
22
]. It allows side-by-side comparison of a
reference with multiple renderings of the same audio
content and provides an absolute quality scale through
the use of a hidden reference and anchor test items. In
this test, the MUSHRA methodology is adopted into
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 6 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
DirAC
Source signals
Tracking
Artificial
reverberation
Downmix to
bformat
B-Format
synthesis
Binaural
Renderer
Source
positions
Distance map
m
6DoF
REF
C1-C3
‘dry’
‘wet’
Fig. 4:
The signal paths for reference rendering and DirAC. In the reference case, the tracking data is used to
change the positioning and rotation of the object-based B-format synthesis (top left). In the other conditions
C1-C3, the tracking data is applied in the DirAC domain (right).
a VR setting, and thus some departures from the rec-
ommended implementation are necessary. Specifically,
the version implemented here does not allow looping
of the audio content, and the anchor item is the 3DoF
rendering.
The different conditions are randomly assigned to the
test conditions in each run. Each participant is asked to
evaluate the audio quality of each condition and give a
score on a scale of 0 to 100. They know that one of the
conditions is, in fact, identical to the reference and as
such to be scored with 100 points. The worst ‘anchor’
condition is to be scored 20 (bad) or lower; all other
conditions should be scored in between.
The MUSHRA panel was designed in such a way that
ratings of systems-under-test can be done at any time
while having an unobtrusive interface in the virtual
environment. By pressing a button on the hand-held
controller, a semi-transparent interface is instantiated
at eye level in the user’s field of view (FoV), at a dis-
tance suitable for natural viewing. A laser pointer is
present that replicates mouse-over states (inactive, ac-
tive, pressed, highlighted) for buttons to assist with
interaction. Pressing the same button on the hand-held
controller removes the panel but maintains all current
ratings and condition selection playback. All ratings
are logged in real-time to a file including a legend for
the randomization of conditions.
4.3 Conditions
A total of four different conditions were implemented
for the experiment.
REF
Object-based rendering. This is the reference
condition. The B-format is generated on the fly for
the listener’s current position and then rendered
via the virtual speakers.
C1
3DoF reproduction. The listener position is ig-
nored, i.e.
l
l
l(n) = 0
0
0
, but his head rotation
o
o
o(n)
is
still applied. The gain is set to that of sources in a
distance of 2 m from the listener. This condition
is used as an anchor.
C2
The proposed method for 6DoF reproduction with-
out distance information. The listener position is
used to change the direction vector. All sources
are located on a sphere outside of the walking
area. The radius of the sphere was fixed to 2 m,
i.e.,
m(r
r
r) = 2∀r
r
r
, and the distance-dependent gain
is applied (γ=1).
C3
The proposed method of 6DoF reproduction with
distance information. The listener position
l
l
l(n)
is
used to change the direction vector. The distance
information
m(r
r
r)
is used to compute the correct
DoA at the listener position (5), and the distance-
dependent gain (6) is applied (γ=1).
4.4 Rendering
The same signal processing pipeline is used for all con-
ditions. This was done to ensure that the comparison is
focused on the spatial reproduction only and the result
is not influenced by coloration or other effects. The
pipeline is shown in Fig. 4. Two B-Format signals are
computed from the three mono source signals. A di-
rect (dry) signal is computed online. A reverberation
(wet) signal is precomputed off-line. These are added
together and processed by DirAC which renders to vir-
tual loudspeakers, which are then binauralized. The
difference lies in the application of the tracking data.
In the reference case, it is applied before the synthesis
of the B-format signal, such that it is virtually recorded
at the listener position. In the other cases, it is applied
in the DirAC domain.
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 7 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
4.4.1 Reference
Object-based rendering is used as a reference scenario.
Virtually, the listener is equipped with a B-format mi-
crophone on her/his head and produces a recording at
his/her head position and rotation. This is implemented
straightforwardly: The objects are placed relative to the
tracked listener position. An FOA signal is generated
from each source with distance attenuation. The syn-
thetic direct B-Format signal
s
s
si
for a source signal
si(t)
at distance
di
, direction with azimuth
θ
and elevation
ψis:
s
s
si(t) = 1
di
1
cosθcosφ
sinθcosφ
sinφ
si(t−di/c), (12)
where
c
is the speed of sound in m/s. Thereafter, the
tracked rotation is applied in the FOA domain [7].
4.4.2 Artificial Reverberation
Artificial reverberation is added to the source signal
in a time-invariant manner to enhance the realism of
the rendered in-door sound scene. Early reflections
from the boundaries of the shoebox-shaped room are
added with accurate delay, direction, and attenuation.
Late reverberation is generated with a spatial feedback
delay network (FDN) which distributes the multichan-
nel output to the virtual loudspeaker setup [
28
]. The
frequency-dependent reverberation time
T60
was be-
tween 90 to 150 ms with a mean of 110 ms. A tonal
correction filter with a lowpass characteristic was ap-
plied subsequently.
The reverberated signal is then converted from 8+4+4
virtual speaker setup to B-format by multiplying each
of the virtual speaker signals with the B-format pattern
of their DoA as in (12). The reverberant B-format
signal is added to the direct signal.
4.4.3 Rendering and Binauralization
The summed B-format is processed in the DirAC do-
main. The encoding is done using a quadrature mirror
filter (QMF) filterbank with 128 bands, chosen due to
its high temporal resolution and low temporal aliasing.
Both direction and diffuseness are estimated with a
moving average smoothing of 42 ms. The decoding is
generating 8+4+4 virtual loudspeaker signals. These 16
signals are then convolved with non-individual HRTFs
for binaural playback.
C1
3DoF
C2
no distance
C3
6DoF
REF
objects
0
25
50
75
100
** **
**
**
**
condition
MUSHRA score
Fig. 5:
MUSHRA ratings (N=21) as box plots. The
dotted line represents the median score, the
boxes the 1st to 3rd quartile, the whiskers are at
±
1.5 inter-quartile range (IQR).Stars indicate
significant differences according to a pairwise
permutation test.
5 Results
The test was done with 24 subjects. Three subjects’
results were excluded as they scored the reference be-
low or equal to 70. The N=21 used were around 29.3
(SD=4.7) years old, 5 were female, none reported any
hearing impairment. All but five reported a bit of expe-
rience with VR, most in the rage of 0.5-2 hours total.
Only five of those reported to have done listening ex-
periments with active tracking. All but seven subjects
had done a MUSHRA test before. On average, it took
8.5 (SD=6) minutes to complete the test.
Fig. 5 shows the overall score distribution as box
plots. It can be clearly seen that the proposed method
achieved the second highest rating after the reference.
Both the 3DoF and reproduction without distance in-
formation are scored lower. All pairs of conditions
were compared with a permutation test. No significant
difference was found between C1 and C2. All others
were significantly different (p≤0.01,N=250000).
6 Discussion
Even though the conditions were found to be signif-
icantly different, the variance in the responses was
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 8 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
relatively large. One reason for this could be the dif-
ferent experience levels of the test subjects with VR.
However, having used a range of novice to expert in
VR and listening tests while still producing significant
effects shows that the results hold across these factors.
Some participants had difficulty spotting the 3DoF con-
dition as anchor. This is understandable as the reproduc-
tion without distance information also lead to the sound
coming from the wrong direction, i.e. when walking
behind the person or radio. This effect may have been
perceived stronger than the missing distance attenua-
tion. However, it may simplify the procedure and help
with consistency to provide an additional, non-spatial
anchor, such as a mono mix of the sound sources.
Regarding the proposed reproduction method, we see
that it allows for reproduction of FOA content, recorded
at a single point in space, in 6DoF. While the test
participants rated the ideal B-Format signal reference
higher, the proposed method achieved the highest mean
score for reproduction among the other conditions. The
proposed method works even when the sound sources
in the recording are located at different distances from
the microphones. In that case, the distances have to be
recorded as meta-data for reproduction in 6DoF. The
results show that the distance reproduction enhances
the quality of the experience. The effect is stronger
when the walking area allows for the users to walk
around sound sources.
The MUSHRA-VR worked for comparing the different
conditions. Most participants liked the interface and
used the possibility to open and close it at will. None
required additional instructions after the introduction
before putting on the HMD. We did not encounter any
severe usability issues.
7 Summary
A novel method of audio reproduction in six-degrees-
of-freedom was proposed. The audio is recorded as
first-order Ambisonics at a single position and distance
data for the sound sources is acquired as side informa-
tion. Using this information, the audio is reproduced
with respect to the live tracking of the listener in the
parametric directional audio coding domain.
A subjective test showed that the proposed method is
ranked closely to object-based rendering. This implies
that the proposed reproduction method can successfully
provide a virtual playback beyond three degrees of
freedom when the distance information is taken into
account.
References
[1]
Liitola, T., Headphone sound externalization,
Ph.D. thesis, Helsinki University of Technology.
Department of Electrical and Communications
Engineering Laboratory of Acoustics and Audio
Signal Processing., 2006.
[2]
Blauert, J., Spatial Hearing - Revised Edition:
The Psychophysics of Human Sound Localization,
The MIT Press, 1996, ISBN 0262024136.
[3]
Zhang, W., Samarasinghe, P. N., Chen, H., and
Abhayapala, T. D., “Surround by Sound: A Re-
view of Spatial Audio Recording and Reproduc-
tion,” Applied Sciences, 7(5), p. 532, 2017.
[4]
Bates, E. and Boland, F., “Spatial Music, Virtual
Reality, and 360 Media,” in Audio Eng. Soc. Int.
Conf. on Audio for Virtual and Augmented Reality,
Los Angels, CA, U.S.A., 2016.
[5]
Anderson, R., Gallup, D., Barron, J. T., Kontka-
nen, J., Snavely, N., Esteban, C. H., Agarwal, S.,
and Seitz, S. M., “Jump: Virtual Reality Video,”
ACM Transactions on Graphics, 35(6), p. 198,
2016.
[6]
Merimaa, J., Analysis, Synthesis, and Perception
of Spatial Sound: Binaural Localization Model-
ing and Multichannel Loudspeaker Reproduction,
Ph.D. thesis, Helsinki University of Technology,
2006.
[7]
Kronlachner, M. and Zotter, F., “Spatial Trans-
formations for the Enhancement of Ambisonic
Recordings,” in 2nd International Conference on
Spatial Audio, Erlangen, Germany, 2014.
[8]
Noisternig, M., Sontacchi, A., Musil, T., and
Holdrich, R., “A 3D Ambisonic Based Binaural
Sound Reproduction System,” in Audio Eng. Soc.
Conf., 2003.
[9]
Taylor, M., Chandak, A., Mo, Q., Lauterbach, C.,
Schissler, C., and Manocha, D., “Guided multi-
view ray tracing for fast auralization,” IEEE Trans.
Visualization & Comp. Graphics, 18, pp. 1797–
1810, 2012.
[10]
Rungta, A., Schissler, C., Rewkowski, N., Mehra,
R., and Manocha, D., “Diffraction Kernels for
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 9 of 10
Plinge et al. 6DoF FOA Reproduction with Distance Information
Interactive Sound Propagation in Dynamic Envi-
ronments,” IEEE Trans. Visualization & Comp.
Graphics, 24(4), pp. 1613–1622, 2018.
[11]
Ziegler, M., Keinert, J., Holzer, N., Wolf, T.,
Jaschke, T., op het Veld, R., Zakeri, F. S., and
Foessel, S., “Immersive Virtual Reality for Live-
Action Video using Camera Arrays,” in IBC, Am-
sterdam, Netherlands, 2017.
[12]
Mariette, N., Katz, B. F. G., Boussetta, K., and
Guillerminet, O., “SoundDelta: A Study of Au-
dio Augmented Reality Using WiFi-Distributed
Ambisonic Cell Rendering,” in Audio Eng. Soc.
Conv. 128, 2010.
[13]
Thiergart, O., Galdo, G. D., Taseska, M., and Ha-
bets, E. A. P., “Geometry-Based Spatial Sound
Acquisition using Distributed Microphone Ar-
rays,”
IEEE
Trans. Audio, Speech, Language Pro-
cess., 21(12), pp. 2583–2594, 2013.
[14]
Schörkhuber, C., Hack, P., Zaunschirm, M., Zot-
ter, F., and Sontacchi, A., “Localization of Mul-
tiple Acoustic Sources with a Distributed Array
of Unsynchronized First-Order Ambisonics Mi-
crophones,” in Congress of Alps-Adria Acoustics
Assosiation, Graz, Austria, 2014.
[15]
Tylka, J. G. and Choueiri, E. Y., “Soundfield Nav-
igation using an Array of Higher-Order Ambison-
ics Microphone,” in Audio Eng. Soc. Conf. on
Audio for Virtual and Augmented Reality, Los
Angeles, California, U.S.A., 2016.
[16]
Schultz, F. and Spors, S., “Data-Based Binau-
ral Synthesis Including Rotational and Transla-
tory Head-Movements,” in Audio Eng. Soc. Conf.:
Sound Field Control - Engineering and Percep-
tion, 2013.
[17]
Tylka, J. G. and Choueiri, E., “Comparison of
Techniques for Binaural Navigation of Higher-
Order Ambisonic Soundfields,” in Audio Eng. Soc.
Conv. 139, 2015.
[18]
Kowalczyk, K., Thiergart, O., Taseska, M.,
Del Galdo, G., Pulkki, V., and Habets, E. A. P.,
“Parametric Spatial Sound Processing: A Flexi-
ble and Efficient Solution to Sound Scene Acqui-
sition, Modification, and Reproduction,” IEEE
Signal Process. Mag., 32(2), pp. 31–42, 2015.
[19]
Pulkki, V., “Spatial Sound Reproduction with
Directional Audio Coding,” J. Audio Eng. Soc.,
55(6), pp. 503–516, 2007.
[20]
Thiergart, O., Kowalczyk, K., and Habets, E.
A. P., “An Acoustical Zoom based on Informed
Spatial Filtering,” in Int. Workshop on Acoustic
Signal Enhancement, pp. 109–113, 2014.
[21]
Khaddour, H., Schimmel, J., and Rund, F., “A
Novel Combined System of Direction Estimation
and Sound Zooming of Multiple Speakers,” Ra-
dioengineering, 24(2), 2015.
[22]
International Telecommunication Union, “ITU-R
BS.1534-3, Method for the subjective assessment
of intermediate quality level of audio systems,”
2015.
[23]
Thiergart, O., Del Galdo, G., Kuech, F., and Prus,
M., “Three-Dimensional Sound Field Analysis
with Directional Audio Coding Based on Signal
Adaptive Parameter Estimators,” in Audio Eng.
Soc. Conv. Spatial Audio: Sense the Sound of
Space, 2010.
[24]
Kuttruff, H., Room Acoustics, Taylor & Francis,
4 edition, 2000.
[25]
Borß, C., “A polygon-based panning method for
3D loudspeaker setups,” in Audio Eng. Soc. Conv.,
pp. 343–352, Los Angeles, CA, USA, 2014.
[26]
Rummukainen, O., Schlecht, S., Plinge, A., and
Habets, E. A. P., “Evaluating Binaural Reproduc-
tion Systems from Behavioral Patterns in a Virtual
Reality – A Case Study with Impaired Binaural
Cues and Tracking Latency,” in Audio Eng. Soc.
Conv. 143, New York, NY, USA, 2017.
[27]
Engelke, U., Darcy, D. P., Mulliken, G. H., Bosse,
S., Martini, M. G., Arndt, S., Antons, J.-N.,
Chan, K. Y., Ramzan, N., and Brunnström, K.,
“Psychophysiology-Based QoE Assessment: A
Survey,” IEEE Selected Topics in Signal Process-
ing, 11(1), pp. 6–21, 2017.
[28]
Schlecht, S. J. and Habets, E. A. P., “Sign-
agnostic Matrix Design for Spatial Artificial Re-
verberation with Feedback Delay Networks,” in
Audio Eng. Soc. Conf. on Spatial Reproduction,
Tokyo, Japan, 2018.
AES Conference on Audio for Virtual and Augmented Reality, Redmond, WA, USA, 2018 August 20 – 22
Page 10 of 10