Conference PaperPDF Available

The Role of Head Tracking in Binaural Rendering


Abstract and Figures

Binaural rendering can integrate, through the use of a head-tracker, the movements of the listener. This means the rendering can be updated as a function of listener’s head rotation and position, allowing for the virtual sound source to be perceived as being fixed relative to the real world, as well as enhancing the externalisation of the sources. This paper presents a summary of two recent experiments involving head-tracked binaural rendering. The first concerns the influence of latency in the head-tracking system with regards to sound scene stability. The second examines the influence of head-tracking on the perceived externalisation of the sound sources. A discussion on the advantages of head-tracking with respect to realism in binaural rendering is provided.
Content may be subject to copyright.
The Role of Head Tracking in Binaural Rendering
P. Stitt1, E. Hendrickx2, J-C. Messonnier2, B. FG Katz1
1LIMSI, CNRS, Orsay, France, Email:
2Conservatoire National Sup´
erieur de Musique et de Danse de Paris (CNSMdP), France
Binaural rendering can integrate, through the use of a head-tracker, the movements of the listener. This means the
rendering can be updated as a function of listener’s head rotation and position, allowing for the virtual sound source
to be perceived as being fixed relative to the real world, as well as enhancing the externalisation of the sources. This
paper presents a summary of two recent experiments involving head-tracked binaural rendering. The first concerns
the influence of latency in the head-tracking system with regards to sound scene stability. The second examines the
influence of head-tracking on the perceived externalisation of the sound sources. A discussion on the advantages of
head-tracking with respect to realism in binaural rendering is provided.
1. Introduction
Humans interpret auditory scenes by analysing the pressure
signals reaching the ears. The spatial cues of each sound
source in the scene are encoded in these two signals, allowing
listeners to make sense of what is happening around them.
Each active sound source in the scene will emit a pressure
signal which radiates outwards until it eventually reaches
the listener. As the sound travels to the listener it will take
different paths to the two ears, diffracting round the head,
reflecting off the torso and resonating in the pinnae. The
differences in the acoustic paths provide a significant amount
of information for the listener. The mains cues used for
sound source localisation are interaural time difference (ITD),
interaural level difference (ILD) and spectral cues [
]. ITD and
ILD are binaural cues and are used to evaluate lateralisation,
while spectral cues are monaural and provide information about
the elevation of the source. Front-back confusions can occur
since the ITD and ILDs are not unique - the same ITD can
occur on a contour of positions around the listener, known as
cones-of-confusion. Monaural spectral cues can sometimes
be used to distinguish between front and back, but front-back
confusions still occur in natural listening for anechoic sounds
]. Additional information can also be obtained by interaction
with the sound scene, such as by performing head rotations,
meaning front-back ambiguities can be resolved.
The acoustic paths taken from the sound source to the two ears
can be represented as filters. These are known as head related
impulse responses (HRIRs) in the time domain and head related
transfer functions (HRTFs) in the frequency domain. For each
position of the sound source in space relative to the listener
there is a corresponding pair of filters (one for each ear). They
include all of the spatial information about the sound source at
their corresponding position.
Sound scenes can be recreated over headphones using binaural
processing. The principle is to recreate at the eardrums of
the listener the pressure signals corresponding to an intended
sound source or scene. This can be done for example by
recording, using a dummy head, or synthetically, using HRTFs
previously recorded in an anechoic chamber [
]. If the sound
scene is recorded then the binaural and spectral cues encoded
will be that of the dummy head, which may not be appropriate
for all listeners. Furthermore, the scene can only be recorded
in a static manner, meaning head movements cannot later be
applied to the scene at playback. In synthesis, each individual
sound source in the scene is filtered using the pair of HRTFs
corresponding to the desired source position. The left and
right ear signals for each sound source are summed to generate
the final composite scene. The HRTFs used for creating a
synthetic sound scene should ideally be personalised to the
listener. Unfortunately, recording HRTFs requires specialist
equipment (appropriate microphones and an anechoic chamber)
]. Personalisation can be performed using the Boundary
Element Method (BEM) to simulate the acoustic cues [
but this is limited by the acquisition of an appropriate scan
of the listener. If individual HRTFs cannot be obtained then
non-individual ones must be used instead, though these can
cause problems such as increased numbers of front-back or
up-down confusions [
]. This can be alleviated by training to
the HRTF [
] or selecting the perceptually best-rated HRTF
from a set [9].
If playback is from either a dummy head recording or a static
synthetic scene, then sources will remain fixed with respect
to the listener’s head - a source to the left will remain to the
left regardless of head orientation. This causes problems for
the realism of the sound scene because the sound source is
not fixed with respect to the external world but rather to the
head-centred coordinate system.
Head tracking can be used to correct for this and to increase
the realism of the scene. The position and orientation of the
listener’s head is tracked, for example, using optical camera
methods or gyroscopic sensors. The current position and
orientation of the head are used to determine the new relative
position of the source(s) to the listener’s head. The HRTFs used
for rendering are then updated and the position of the sources
appear to be fixed in space, independent of the motion of the
listener. Some head tracking systems, such as gyroscopic,
only give information on the relative orientation of the listener
ISBN 987-3-9812830-7-5
and so cannot account for parallax if the listener moves
translationally from their initial position. Other trackers, such
as optical, can be used for full interaction with and movement
around the sources, as long as one stays within the cameras’
field of view.
However, in practice, head tracking and binaural rendering
cannot be performed without some minimum level of system
latency, due to the time taken for the head tracker to transmit
the change in head position/orientation to the renderer, the
selection of the updated filters, and the convolution with the
sound source. If the latency is above a certain level then it can
be perceived [
]. Perceptually, latency can take the
form of the sound source moving around the intended source
position, meaning it is no longer fixed in absolute position in
the external world nor independent of the listener. Previous
studies on the perceived influence of latency on the rendered
sound source have been evaluated for single source sound
The addition of reverberation to the sound source has been
found to increase the perceived externalisation of the sound
source [
], and is related to the interaural cue fluctuations
over the duration of the stimulus [
]. Studies have also
been performed to investigate the benefit of head tracking
on the externalisation of the sound sources [
]. The
results have been conflicting, with Begault et al. [
] finding
no benefit of head tracking on the externalisation rate, and
Brimijoin et al. [
] finding that head movements in the range
can improve it. Brimijoin et al. [
] found that sounds
to the front and rear were less well externalised than those to
the sides. Wenzel [
], testing using individualised HRTFs,
found that the addition of latency to the tracking system did
not significantly degrade the externalisation rate, though it did
reduce the localization accuracy.
This paper summarises the results of two recent studies
focused on head movement in binaural rendering. The first
explores the influence of latency on the source stability with
respect to the external world of simple and complex sound
scenes. The second investigates the influence of head tracking
on the externalisation of sound sources with different head
movement/tracking combinations. Particular emphasis is on
sound source azimuth. These are followed by a discussion on
the role of head tracking in the rendering of realistic sound
2. Head Movement on Source Stability
2.1. Overview
Previous studies on head tracker latency have tended to
ask about the audibility of the latency using simples scenes
consisting of only one sound source [
]. In this
study, source stability was used as the evaluation criterion
because it can be linked to a physical property of the
sound source, therefore giving it possible application to other
variables of the binaural renderer. It also has the advantage
of being a property link to the physical world that can be
understood by non-expert subjects.1
1Portions of this study are presented in [17]
2.2. Method
Subjects performed an AB comparison as the main experiment
task. Subjects were presented with one of two scenes, followed
by a repeat of the same scene with a different latency applied to
the head tracker information. They were asked to pick which
of the two stimuli had the more stable sound sources relative to
the external world. A control condition (both scenes presented
without latency) was included. Subjects were presented
two sounds scenes - single and multi-source - consisting of
anechoic recordings. The simple scene consisted of a 5 s
sample of maracas at 0
. The complex scene added male and
female talking voices at
, piano at 110
, and clarinet
, thus distributing the sources in a 5.0 arrangement.
During the 5 s stimulus presentation for each scene, subjects
were instructed to turn their head to 90
either left or right
and then repeat to the other side, finishing at 0
. The initial
direction was not prescribed, as long as it was the same for
both scenes in a stimulus-pair. Initial direction was unified
by mirroring for analysis of the results. The motion was to
last the full duration of the stimulus. Subjects were given
time to train to the stimulus duration, followed by an 8 trial
training session of the main task. The simple and complex
scenes were presented in blocks since pre-testing found that
the transition from a simple to complex scene could cause
confusion and disrupt concentration. The starting order of
the scenes was alternated between subjects to ensure an equal
distribution. Due to the physically repetitive task, subjects
were given a short break at the halfway point. The experiment
took approximately 40 min to complete, excluding the training
Before the main experiment, subjects performed an HRTF
selection task to determine how well the HRTFs could trace
two known trajectories - one horizontal and one in the median
plane. The HRTFs were from a set of 7 from the the LISTEN
database [
] that have been shown to provide an adequate
HRTF for most listeners [
]. The perceptually best-rated HRTF
was given to each of the subjects, unless their own measured
HRTF was available. The ITD cues were personalised based
on the circumference of the subject’s head.
The latency levels tested were 0, 25, 50, 75, 100, 150, 200,
and 250 ms. Each latency level was presented 12 times, 6 in
each half of the experiment. This gives a total of 192 trials
for the whole experiment (12 repetitions ×8 latency levels ×
2 scenes). 10 subjects took part in the experiment (mean age
32, standard deviation 9.2 years).
2.3. Results
2.3.1. Latency Detection
The psychometric curves of detection rate against additional
latency for both scenes are shown in Figure 1. As expected, the
control condition gives results at chance level. For the simple
scene the 70% latency threshold is approximately 50 ms, rising
to above 90% by 100 ms. There is a shift to larger latencies
for the complex scene, indicating a slightly lowered sensitivity
to source instability due to informational masking. The 70%
threshold is approximately 60 ms, while it reaches 90% only
at almost 150 ms additional latency. There are also lower
detection rates for very large latency values. The increase in
ISBN 987-3-9812830-7-5
additional latency (ms)
0 50 100 150 200 250
% low latency system selected
single source
complex scene
Fig. 1:
The selection rates for the simple and complex sound scenes
as a function of latency.
0 1000 2000 3000 4000 5000
time (ms)
azimuth angle (degrees)
Fig. 2:
Example of the mean azimuth over the stimulus duration for
one subject. Dashed lines indicate the standard deviation.)
the detection thresholds, from the psychometric fits at 70%
detection, increases by approximately 10 ms.
2.3.2. Head Movements
Since subjects were asked to perform the same head movement,
the head tracker data was stored and analysed to determine if
they were consistent with the instructions. Figure 2 shows the
mean and standard deviation over all trials for one listener as
an example of the consistency with which subjects performed
the task.
The average minimum and maximum azimuth angles (after
mirroring to make the initial direction consistent) over all
subjects were –86
and 98
respectively, with standard
deviations of 27
and 29
, respectively. This suggests a
slight overshoot on the first turn, but that the instructions were
generally well respected. Subjects were asked only to make
yawing head motion during the scenes. Subjects were found to
have stayed on average within 10
of zero pitch and roll angles,
indicating no large motion in these dimensions.
Head movement speeds were analysed to examine differences
between the two scene types and latency levels. Very weak
correlation (
r= 0.47
) was found between the average subject
head movement speed and their detections rates. An analysis
of variance (ANOVA) found no difference in the distribution
of mean head speeds at the different additional-latency levels
p= 0.53
). However, a difference was found between the head
speeds for the two scenes (
p < 0.01
). The mean speeds over
all conditions were 91
/s and 97
/s for the simple and complex
scenes respectively. Table 1 shows the mean azimuthal speeds
for both scenes and all latency levels. No significant change in
speed was found with increasing numbers of trials, indicating
consistency for the duration of the experiment.
Added-latency (ms) Simple Complex
0 91/s 98/s
25 93/s 99/s
50 91/s 97/s
75 94/s 98/s
100 92/s 101/s
150 91/s 96/s
200 90/s 94/s
250 89/s 96/s
Tab. 1:
The mean angular speed for each latency values tested for the
simple and complex scenes.
3. Head Movement and externalisation
3.1. Overview
Head tracking has been shown to increase the externalisation
of binaural sound scenes when using individualised HRTFs
]. That study asked listeners to make relatively small
head movements. The study summarised here asked subjects
to make the same head movement as in the source stability
experiment on the hypothesis that a large head movement
can give rise to a strong perception of externalisation. Three
different non-individual HRTFs were used in order to minimise
any training to the HRTF, which can lead to improved
externalisation with time [18].
3.2. Method
Ten expert subjects were asked to rate the externalisation of
the sound image on a scale of 0 to 5, 0 being “at the center
of the head” and 5 being “externalised and remote”. The
subjects were sound engineers with experience listening to
binaural content. The stimulus was a male vocal sample
placed at 0
, recorded using the 6 microphone configuration
shown in Figure 3. The microphones were cardioid pattern,
with the 0
microphone capturing the highest level of direct
sound, while the others provide varying levels of direction-
to-reverberant energy. The 6 microphone signals were
placed at positions corresponding to the recording positions,
using the binaural renderer. During the experiment, a
number of azimuth conditions were tested. The scene was
rotated that the 0
microphone signal rotated in steps of 30
to determine the influence of the source direction on the
perceived externalisation. When head tracking was active,
ISBN 987-3-9812830-7-5
Fig. 3:
The recording setup to capture the talker used as the stimulus.
head movements cause the position of the 6 sources to rotate
in the opposite direction, keeping the signals fixed with respect
to the external world. There were 4 head movement/tracker
static or no head movement/no tracker (S0),
no head movement/head tracked (ST),
head movement/no tracker (M0),
head movement/head tracked (MT).
For the conditions requiring head movement, subjects were
asked to turn to 90
, then
, and back to 0
during a 5.5 s
spoken phrase. After the head movement there was a 1 s pause
in the stimulus followed by 2.5 s more of speech. Subjects were
instructed to have their heads still at this point and to make
their evaluation disregarding the initial 5.5 s phrase. Previous
externalisation studies [
] have asked for evaluation of
the extent of externalisation while the subject might be moving
their head. While making head movements, frontal sources
are likely to be laterally displaced relative to the centre of
the subject’s head, and lateral sources are better externalised
than those near the median plane. This presents a difficulty
for subjects evaluating externalisation, which might vary for
different head orientations. Hence, subjects were asked to
evaluation externalisation only when their heads were still,
at the end of the stimulus. This ensures that for all head
movement conditions the same conditions are present at the
point of evaluation.
3.3. Results
The distributions of ratings for each of the head
tracker/movement conditions is shown in Figure 4. It is clear
that condition
has the highest rate of externalisation,
followed by the still-head conditions
. Condition
exhibits the least externalisation, due to the movement of
the source with the subject’s head. There does not appear to
be a large difference in the distribution of the ratings between
. Deeper analysis of the head movements made by
the listeners during these conditions suggests that they were
often small enough movements as to be below the minimum
audible movement angle (MAMA). The resolution of the
binaural renderer might also have been a factor for these
Fig. 4:
Comparison summary, over all subjects and HRTFs, of
percentage of trials each externalisation score was attributed for each
0 ±30 ±60 ±90 ±120 ±150 180
Azimuth (degrees)
Externalization rate (%)
Fig. 5:
Mean externalisation rate obtained for each condition,
mirrored azimuths combined, over all subjects and HRTFs.
Since a range of source azimuths were tested, it is useful to
break the results down to determine if some regions benefit
more from head movements than others. Figure 5 shows
the externalisation rate (proportion of 3–5 ratings) for the 4
movement/tracker conditions as a function of azimuth. No
significant differences were found between the results to
the left and right of the subjects, so the results have been
collapsed to a single semicircle from 0
to 180
. It is clear
that the addition of head movements and tracking is not
beneficial at lateral positions (
) since
the externalisation rates of
are no greater than those of
at these positions. Condition
shows lower
externalisation rates at all positions, but is still generally high
for the lateral regions. The regions of greatest improvement due
to head movements and tracking (
) are those to the front
and rear. For all targets except that at 0
, the externalisation
rate is in line with that of the lateral sources, giving better and
more uniform externalisation.
For the 0
source, the externalisation rates are at their lowest.
The addition of head movement and tracking provides a
significant increase in the externalisation rate compared to the
other conditions. Even though the externalisation rate is below
ISBN 987-3-9812830-7-5
50% (46.7%), it still highlights the advantage of head move-
ments coupled with tracking. There was some improvement
in condition
of the externalisation rate with increasing
numbers of repetitions, suggesting that externalisation of the
frontal sources could perhaps be improved over time.
It should be noted that the subjects were asked to rate
the externalisation only during a portion of the stimulus
at which they instructed to be forward-facing and have a
static head. The externalisation rates are, therefore, an
aftereffect of the head movement/tracking condition, since
externalisation is not evaluated during the head movement.
This is because externalisation varies with azimuth and,
without some standardised response condition, it would be
difficult to know if subjects evaluated a frontal source during
the point in the motion when it was lateral, or gave an average
over the whole motion.
Note that the head movements made for condition
were followed the trend of the source stability experiment,
with subjects making the required head movement during the
initial phase and keeping their heads still during the final 2.5 s.
4. Discussion
Both studies presented here show how head tracking can in
some way impact on the realism of the sound scene. Stable
sound sources that appear exterior to the listener’s head are
important factors in creating a realistic scene.
Compared to a single sound source, the addition of more sound
sources reduces slightly listener sensitivity to the head tracker
latency, before sound sources are perceived as being unstable.
The study here used anechoic source material and an argument
could be made that reverberant sources provide additional
complexity in comparison to this. Lindau [
] found that there
was no significant difference in latency detection for different
reverberant conditions (anechoic and reverberant), suggesting
that reverberant sources are more like the simple scene than the
multi-source one. This is likely to be because the reflections in
a reverberant environment are in some way correlated with the
direct source, while the 5 sources used in this study all convey
uncorrelated information.
Assuming the latency of the head tracking system is low
enough to avoid any source instability then the externalisation
of frontal sources is one of the biggest challenges to realism.
Even with head movements and tracking, frontal sources are
less well externalised than even those to the rear or side.
This has implications for applications such as television and
cinema, in which a large proportion of the sounds (particularly
dialogue) are likely to be intended to originate from this
direction. Such a collapse in externalisation will reduce the
realism for the listener. It is possible that audio-visual media
can overcome this because the ventriloquism effect can create
the illusion of a sound emanating from a visible source. It
is unclear whether sounds can be drawn out of the head in a
similar manner.
Many head tracking applications use only orientation in-
formation. This is sufficient if the listener is required to
remain in their seat and will make only rotational movements.
More interactive virtual reality experience will require full
translational information to be used alongside the orientation
information. This allows listeners to fully explore and interact
with the scene, something particularly important for game
environments. It is possible that more complex listener-
source interactions involving translation could produce better
externalisation for the more difficult frontal sources since there
are even more cues which place the source in the external
Beyond translational movements, the sound source could be
linked to a physical, tracked object, allowing the user to pick
up and interact with the sound source at another level. This
allows the user to gain direct feedback of the source position
from their (known) hand position. Using such a method has
been shown to allow users to adapt to non-individual HRTFs
]. This adds to the possibility of increasing the realism of
the scene by allowing faster adaptation to the HRTFs and more
accurate interaction with objects in the scene.
Another important factor, particularly for externalisation but
also for timbre, is the use of appropriate room reverberation.
This has been shown to improve externalisation with binaural
]. A compromise must be made between
computational efficiency and physical accuracy. For complete
physical accuracy either a real-time reverberation engine,
which encodes the direction of the reflections, can be used
or a the binaural room impulse responses (BRIRs) could
be calculated or recorded for all possible positions and
orientations the listener might inhabit. An alternative is to
add some generic room reflections and diffuse reverberation,
independent of the head tracking information. The first
approach could potentially generate a more realistic scene
for the listener but requires significant extra computational
5. Conclusion
This paper presents a summary of two recent studies into per-
ceptual properties of sound scenes reproduced using binaural
with head tracking. The first investigated source stability for
different head-tracker latencies with simple and complex sound
scenes. The source instability generated by increased latency
was found to be less audible for the complex scene than the
simple one. The second experiment investigated externalisation
for a variety of head movement and tracking conditions, using
a 6 microphone recording of a male talker in a reverberant
environment. It was found that lateral sources are generally
well externalised, whether or not the head is tracked, while
frontal and rear sources benefit from the addition of tracked
head movement.
6. References
Jens Blauert. Spatial hearing: The psychophysics of
human sound localization. MIT Press, Cambridge, MA,
James C. Makous and John C. Middlebrooks. Two-
dimensional sound localization by human listeners. The
Journal of the Acoustical Society of America, 87(5):2188–
2200, 1990. doi:
ISBN 987-3-9812830-7-5
IRCAM LISTEN HRTF database. http://recherche.ircam.
fr/equipes/salles/listen/, 2003. URL http://recherche. Last accessed: 14th March
Thibaut Carpentier, H
ene Bahu, Markus Noisternig,
and Olivier Warusfel. Measurement of a head-related
transfer function database with high spatial resolution. In
Forum Acousticum, pages 1–6, Krak´
ow, 2014.
individual head-related transfer function. I. Rigid model
calculation. The Journal of the Acoustical Society of
America, 110(5):2440, 2001. doi: 10.1121/1.1412440.
Harald Ziegelwanger, Piotr Majdak, and Wolfgang
Kreuzer. Numerical calculation of listener-specific
head-related transfer functions and sound localization:
Microphone model and mesh discretization. The Journal
of the Acoustical Society of America, 138(1):208–222,
2015. doi: 10.1121/1.4922518.
Elizabeth M. Wenzel, Marianne Arruda, Doris J.
Kistler, and Frederic L. Wightman. Localization using
nonindividualized head-related transfer functions. The
Journal of the Acoustical Society of America, 94(1):111–
123, 1993.
etan Parseihian and Brian F. G. Katz. Rapid head-
related transfer function adaptation using a virtual
auditory environment. The Journal of the Acoustical
Society of America, 131(4):2948–2957, 2012. doi:
head-related transfer function database optimization. The
Journal of the Acoustical Society of America, 131(2):
EL99–EL105, 2012. doi: 10.1121/1.3672641.
Douglas S. Brungart, Alexander J. Kordik, and Brian D.
Simpson. Effects of headtracker latency in virtual audio
displays. Journal of the Audio Engineering Society, 54
(1-2):32–44, 2006.
Alexander Lindau. The Perception of System Latency in
Dynamic Binaural Synthesis. In Fortschritte der Akustik:
Tagungsband der 35. DAGA, pages 1063–1066, 2009.
Satoshi Yairi, Yukio Iwaya, and Y
oiti Suzuki. In-
vestigation of System Latency Detection Threshold of
Virtual Auditory Display. In Proceedings of the 12th
International Conference on Auditory Display, pages 217–
222, 2006.
Elisabeth M. Wenzel. Effect of increasing system latency
on localization of virtual sounds with short and long
duration. Proceedings of the 7th International Conference
on Auditory Display (ICAD2001), pages 185–190, 2001.
Durand R. Begault, Elizabeth M. Wenzel, and Mark R.
Anderson. Direct comparison of the impact of head
tracking, reverberation, and individualized head-related
transfer functions on the spatial perception of a virtual
speech source. Journal of the Audio Engineering Society.
Audio Engineering Society, 49(10):904–916, 2001.
Jasmina Catic, S
ebastien Santurette, and Torsten Dau.
The role of reverberation-related binaural cues in the
externalization of speech. The Journal of the Acoustical
Society of America, 138(2):1154–1167, 2015. doi: 10.
W. Owen Brimijoin, Alan W. Boyd, and Michael A.
Akeroyd. The contribution of head movement to the
externalization and internalization of sounds. PLoS ONE,
8(12):1–12, 2013. doi: 10.1371/journal.pone.0083068.
Peter Stitt, Etienne Hendrickx, Jean-Christophe Messon-
nier, and Brian F.G. Katz. The influence of head tracking
latency on binaural rendering in simple and complex
sound scenes. In Audio Engineering Society Convention
140, pages 1–8, Paris, France, 2016.
Catarina Mendon
a, Guilherme Campos, Paulo Dias,
and Jorge A Santos. Learning Auditory Space :
Generalization and Long-Term Effects. Frontiers in
neuroscience, 8(10):1–14, 2013. doi: 10.1371/journal.
ISBN 987-3-9812830-7-5
... At the same time, quick and exact head tracking is also very important. Head Tracking stability rapidly declines, when binaural rendering delays increase over certain thresholds [10]. Stable sound sources are very important for an immersive and realistic VR scene. ...
... This shows up when moving the head very fast and the binaural rendering adapts with a slight delay. The latency of the head tracking system needs to be low enough to avoid any source instability, which we were not able to measure [10]. ...
Conference Paper
This paper analyzes the visual influence on auditive perception in binauralized 3D Audio signals in the context of virtual reality. Demo recordings were done using a coincident and a near-coincident 3D-miking technique while recording 360 video at the same time. Information on the acceptance was gathered by the means of a comparative listening test, showing an overall improvement in the perceived quality when being confronted with matching visual input simultaneously.
... Although issues can arise from nonindividualized HRTFs (e.g. poor externalization) these issues can be offset if head-tracking is employed [10]. The BBC Audio R&D team undertook a custom build of the Chromium browser to implement a HRTF set of their choosing [6] though as previously stated, this is likely a prohibitive step for most practitioners. ...
This paper examines the current eco-system of tools for implementing dynamic 3D audio through the browser, from the perspective of spatial sound practitioners. It presents a survey of some existing tools to assess usefulness, and ease of use. This takes the forms of case studies, interviews with other practitioners, and initial testing comparisons between the authors. The survey classifies and summarizes their relative advantages, disadvantages and potential use cases. It charts the specialist knowledge needed to employ them or enable others to.The recent and necessary move to online exhibition of works, has seen many creative practitioners grapple with a disparate eco-system of software. Such technologies are diverse in their both their motivations and applications. From formats which overcome the limits of WebGL’s lack of support for Ambisonics, to the creative deployment of Web Audio API (WAA), to third-party tools based on WAA, the field can seem prohibitively daunting for practitioners. The current range of possible acoustic results may be too unclear to justify the learning curve.Through this evaluation of the current available tools, we hope to demystify and make accessible these novel technologies to composers, musicians, artists and other learners, who might otherwise be dissuaded from engaging with this rich territory. This paper is based on a special session at Soundstack 2021.
... Finally, when it comes to dynamic listening situations (involving listener or source movements), the MAAs further increase [83]. In order to account for sufficient spatial resolution when applying HRTFs in dynamic listening scenarios, the movement of the listener has to be monitored additionally to the modelling of sound source movement [84][85][86]. The minimum amount of directions and specific measurement points for a sufficiently sparse HRTF set are still current topics of research [87]. ...
Full-text available
Head-related transfer functions (HRTFs) describe the spatial filtering of acoustic signals by a listener’s anatomy. With the increase of computational power, HRTFs are nowadays more and more used for the spatialised headphone playback of 3D sounds, thus enabling personalised binaural audio playback. HRTFs are traditionally measured acoustically and various measurement systems have been set up worldwide. Despite the trend to develop more user-friendly systems and as an alternative to the most expensive and rather elaborate measurements, HRTFs can also be numerically calculated, provided an accurate representation of the 3D geometry of head and ears exists. While under optimal conditions, it is possible to generate said 3D geometries even from 2D photos of a listener, the geometry acquisition is still a subject of research. In this chapter, we review the requirements and state-of-the-art methods for obtaining personalised HRTFs, focusing on the recent advances in numerical HRTF calculation.
... Although issues can arise from nonindividualized HRTFs (e.g. poor externalization) these issues can be offset if head-tracking is employed [10]. The BBC Audio R&D team undertook a custom build of the Chromium browser to implement a HRTF set of their choosing [6] though as previously stated, this is likely a prohibitive step for most practitioners. ...
Conference Paper
This paper examines the current eco-system of tools for implementing dynamic 3D audio through the browser, from the perspective of spatial sound practitioners. It presents a survey of some existing tools to assess usefulness, and ease of use. This takes the forms of case studies, interviews with other practitioners, and initial testing comparisons between the authors. The survey classifies and summarizes their relative advantages, disadvantages and potential use cases. It charts the specialist knowledge needed to employ them or enable others to. The recent and necessary move to online exhibition of works, has seen many creative practitioners grapple with a disparate eco-system of software. Such technologies are diverse in their both their motivations and applications. From formats which overcome the limits of WebGL’s lack of support for Ambisonics, to the creative deployment of Web Audio API (WAA), to third- party tools based on WAA, the field can seem prohibitively daunting for practitioners. The current range of possible acoustic results may be too unclear to justify the learning curve. Through this evaluation of the current available tools, we hope to demystify and make accessible these novel technologies to composers, musicians, artists and other learners, who might otherwise be dissuaded from engaging with this rich territory. This paper is based on a special session at Soundstack 2021.
... In order to account for the real-time influence of the listener's pose, head tracking systems can be utilized [4]. It has been shown that this method impacts sound source realism, externalization, and reduces localization confusion when low-latency and stable sensors are used [5][6][7]. ...
Conference Paper
Full-text available
Binaural rendering is a technique that seeks to generate virtual auditory environments that replicate the natural listening experience, including the three-dimensional perception of spatialized sound sources. As such, real-time knowledge of the listener’s position, or more specifically, their head and ear orientations allow the transfer of movement from the real world to virtual spaces, which consequently enables a richer immersion and interaction with the virtual scene. This study presents the use of a simple laptop integrated camera (webcam) as a head tracker sensor, disregarding the necessity to mount any hardware to the listener’s head. The software was built on top of a state-of-the-art face landmark detection model, from Google’s MediaPipe library for Python. Manipulations to the coordinate system are performed, in order to translate the origin from the camera to the center of the subject’s head and adequately extract rotation matrices and Euler angles. Low-latency communication is enabled via User Datagram Protocol (UDP), allowing the head tracker to run in parallel and asynchronous with the main application. Empirical experiments have demonstrated reasonable accuracy and quick response, indicating suitability to real-time applications that do not necessarily require methodical precision. Furthermore, cross-validation with existing hardware head trackers revealed an adequate agreement on measured head orientation, confirming its potential as a contactless head tracking device.
... Humans use head movements to localize sound sources with a higher accuracy (Blauert, 1997;. In the past it was shown, that providing the option of interactive head rotation to the listener in a binaural reproduction improves externalization (Brimijoin et al., 2013;Stitt et al., 2016;Hendrickx et al., 2017), reduces front-back-confusion (Begault and Wenzel, 2001) and supports the localization accuracy (McAnally and Martin, 2014;Mackensen, 2004). Furthermore, when listeners evaluated the timbre of the sound of a source, the range of head motion was relatively small compared to the movements that happen when listener envelopment and source width (Kim et al., 2007) are evaluated. ...
Full-text available
It is pointed out that beyond reproducing the physically correct sound pressure at the eardrums, more effects play a significant role in the quality of the auditory illusion. In some cases, these can dominate perception and even overcome physical deviations. Perceptual effects like the room-divergence effect, additional visual influences, personalization, pose and position tracking as well as adaptation processes are discussed. These effects are described individually, and the interconnections between them are highlighted. With the results from experiments performed by the authors, the perceptual effects can be quantified. Furthermore, concepts are proposed to optimize reproduction systems with regard to those effects. One example could be a system that adapts to varying listening situations as well as individual listening habits, experience and preference.
... In the future, we plan to enhance the measurement system further. The external hardware can, for example, be extended to determine local and remote haptic delays in VR systems, as well as binocular audio latencies as described by Stitt et al. [SHMK16]. ...
Distributed Virtual Reality systems enable globally dispersed users to interact with each other in a shared virtual environment. In such systems, different types of latencies occur. For a good VR experience, they need to be controlled. The time delay between the user's head motion and the corresponding display output of the VR system might lead to adverse effects such as a reduced sense of presence or motion sickness. Additionally, high network latency among worldwide locations makes collaboration between users more difficult and leads to misunderstandings. To evaluate the performance and optimize dispersed VR solutions it is therefore important to measure those delays. In this work, a novel, easy to set up, and inexpensive method to measure local and remote system latency will be described. The measuring setup consists of a microcontroller, a microphone, a piezo buzzer, a photosensor, and a potentiometer. With these components, it is possible to measure motion-to-photon and mouth-to-ear latency of various VR systems. By using GPS-receivers for timecode-synchronization it is also possible to obtain the end-to-end delays between different worldwide locations. The described system was used to measure local and remote latencies of two HMD based distributed VR systems.
... The six-channel equal-segment microphone array was also selected because equal segmentation of the sound field enables continuous and homogeneous sound field capture in the horizontal place (Williams, 1991), and because informal comparative studies of several microphone arrays with ten subjects suggested that this configuration provided the most natural audio scene when binauralized. 1 The four tracking conditions were not referred using the condition labels SØ, ST, MØ, and MT in Brimijoin et al. (2013). These labels are proposed by the authors of the present study in order to simplify the presentation of subsequent results. 2 Some early analysis of preliminary results of this study have been previously presented (Stitt et al., 2016b). 3 ...
Binaural reproduction aims at recreating a realistic audio scene at the ears of the listener using headphones. In the real acoustic world, sound sources tend to be externalized (that is perceived to be emanating from a source out in the world) rather than internalized (that is perceived to be emanating from inside the head). Unfortunately, several studies report a collapse of externalization, especially with frontal and rear virtual sources, when listening to binaural content using non-individualized Head-Related Transfer Functions (HRTFs). The present study examines whether or not head movements coupled with a head tracking device can compensate for this collapse. For each presentation, a speech stimulus was presented over headphones at different azimuths, using several intermixed sets of non-individualized HRTFs for the binaural rendering. The head tracker could either be active or inactive, and the subjects could either be asked to rotate their heads or to keep them as stationary as possible. After each presentation, subjects reported to what extent the stimulus had been externalized. In contrast to several previous studies, results showed that head movements can substantially enhance externalization, especially for frontal and rear sources, and that externalization can persist once the subject has stopped moving his/her head.
Dans la continuité d’autres travaux de la communauté du son binaural, nous pensons que le son spatialisé peut constituer un outil efficace pour guider des personnes aveugles, y compris pour la pratique sportive.Un système destiné au guidage par audio binaural doit être suffisamment précis et réactif pour pouvoir prendre en compte chaque mouvement du sujet à guider, ce qui a impliqué le développement et la mise en œuvre d’un système de localisation temps réel et d'un logiciel de spatialisation binaural à faible latence. Enfin nous avons intégré l’ensemble dans un dispositif embarqué.Les techniques les plus avancées de navigation globale par satellite (augmenté et multibande) ne sont pas toujours disponibles en intérieur ou en environnement urbain. C’est pourquoi nous avons travaillé sur des méthodes alternatives de localisation et de suivi temps réel pour l’intérieur. Tout d’abord nous avons développé une méthode de calibration et de latération temps réel robuste par réseau de balises Ultra WideBand utilisant un filtre de Kalman (UKF). Nous avons également développé une méthode originale de localisation par réseau de radars Doppler à onde continue non modulée. Nous avons montré qu’il était possible d’utiliser l’amplitude du signal Doppler pour estimer la distance à un objet mobile. Nous avons alors implémenté un filtre particulaire qui permet la localisation en temps réel par hybridation des données de distance, des mesures de vitesse radiale Doppler et du cap fourni par une centrale inertielle.Dans le domaine de l’acoustique et de l’audio binaural, nous avons cherché à mieux comprendre les capacités des personnes à localiser et à suivre un son en mouvement. Pour cela nous avons mené des expériences en utilisant des sons naturels et des sons spatialisés par audio binaural. Nous avons pu montrer, que sur le plan azimutal, les stimuli audio spatialisés permettaient une localisation comparable aux sons naturels, y compris avec les HRTFs (head-related transfer function) non individualisées et interpolées. Par ailleurs, nous avons pu montrer que même sur le plan azimutal, les stimuli obtenus par convolution de HRTFs étaient supérieurs au panning (ITD+ILD) pour les sons fixes et pour les sons en mouvement. En nous appuyant sur les travaux antérieurs de l’équipe, nous avons implémenté des algorithmes efficaces pour la spatialisation sonore temps-réel sur des plateformes embarquées disposant de peu de ressources. Pour une mise en œuvre efficace, cette approche temps-réel a impliqué une compréhension approfondie des sources de latence qu’elles soient liées au head-tracking ou au sous-système audio des systèmes d’exploitation modernes.Finalement nous avons mis en œuvre ces méthodes de localisation et ces techniques audio pour construire un dispositif de guidage où la source sonore précède continuellement la personne pour lui indiquer le chemin à suivre.Il a été conçu en lien avec des personnes déficientes visuelles, dans une démarche itérative et avec une approche centrée sur les besoins utilisateurs. Nous avons alors mené des expériences de guidage avec des personnes aveugles, en lien avec nos partenaires associatifs, qui ont permis d’évaluer différentes stratégies de contrôle. Nous avons ainsi pu confirmer que le son spatialisé pouvait constituer un outil efficace pour guider des personnes aveugles sans induire de charge cognitive pénalisante pour des pratiques sportives comme la marche, la course à pied ou le roller en autonomie partielle, y compris dans un contexte de recherche de performances.
Full-text available
Head-related transfer functions (HRTFs) can be numerically calculated by applying the boundary element method on the geometry of a listener&apos;s head and pinnae. The calculation results are defined by geometrical, numerical, and acoustical parameters like the microphone used in acoustic measurements. The scope of this study was to estimate requirements on the size and position of the microphone model and on the discretization of the boundary geometry as triangular polygon mesh for accurate sound localization. The evaluation involved the analysis of localization errors predicted by a sagittal-plane localization model, the comparison of equivalent head radii estimated by a time-of-arrival model, and the analysis of actual localization errors obtained in a sound-localization experiment. While the average edge length (AEL) of the mesh had a negligible effect on localization performance in the lateral dimension, the localization performance in sagittal planes, however, degraded for larger AELs with the geometrical error as dominant factor. A microphone position at an arbitrary position at the entrance of the ear canal, a microphone size of 1 mm radius, and a mesh with 1 mm AEL yielded a localization performance similar to or better than observed with acoustically measured HRTFs.
Full-text available
When stimuli are presented over headphones, they are typically perceived as internalized; i.e., they appear to emanate from inside the head. Sounds presented in the free-field tend to be externalized, i.e., perceived to be emanating from a source in the world. This phenomenon is frequently attributed to reverberation and to the spectral characteristics of the sounds: those sounds whose spectrum and reverberation matches that of free-field signals arriving at the ear canal tend to be more frequently externalized. Another factor, however, is that the virtual location of signals presented over headphones moves in perfect concert with any movements of the head, whereas the location of free-field signals moves in opposition to head movements. The effects of head movement have not been systematically disentangled from reverberation and/or spectral cues, so we measured the degree to which movements contribute to externalization. WE PERFORMED TWO EXPERIMENTS: 1) Using motion tracking and free-field loudspeaker presentation, we presented signals that moved in their spatial location to match listeners' head movements. 2) Using motion tracking and binaural room impulse responses, we presented filtered signals over headphones that appeared to remain static relative to the world. The results from experiment 1 showed that free-field signals from the front that move with the head are less likely to be externalized (23%) than those that remain fixed (63%). Experiment 2 showed that virtual signals whose position was fixed relative to the world are more likely to be externalized (65%) than those fixed relative to the head (20%), regardless of the fidelity of the individual impulse responses. Head movements play a significant role in the externalization of sound sources. These findings imply tight integration between binaural cues and self motion cues and underscore the importance of self motion for spatial auditory perception.
Full-text available
Previous findings have shown that humans can learn to localize with altered auditory space cues. Here we analyze such learning processes and their effects up to one month on both localization accuracy and sound externalization. Subjects were trained and retested, focusing on the effects of stimulus type in learning, stimulus type in localization, stimulus position, previous experience, externalization levels, and time. We trained listeners in azimuth and elevation discrimination in two experiments. Half participated in the azimuth experiment first and half in the elevation first. In each experiment, half were trained in speech sounds and half in white noise. Retests were performed at several time intervals: just after training and one hour, one day, one week and one month later. In a control condition, we tested the effect of systematic retesting over time with post-tests only after training and either one day, one week, or one month later. With training all participants lowered their localization errors. This benefit was still present one month after training. Participants were more accurate in the second training phase, revealing an effect of previous experience on a different task. Training with white noise led to better results than training with speech sounds. Moreover, the training benefit generalized to untrained stimulus-position pairs. Throughout the post-tests externalization levels increased. In the control condition the long-term localization improvement was not lower without additional contact with the trained sounds, but externalization levels were lower. Our findings suggest that humans adapt easily to altered auditory space cues and that such adaptation spreads to untrained positions and sound types. We propose that such learning depends on all available cues, but each cue type might be learned and retrieved differently. The process of localization learning is global, not limited to stimulus-position pairs, and it differs from externalization processes.
Full-text available
It is important in a virtual auditory display (VAD) system to repro- duce not only static sound information, but also dynamic variation of sound. Thus, to achieve a highly precise virtual auditory dis- play system, the system should be responsive to a listener's head movement. However, system latency (SL), in which the listener's head movement is reflected in the sound, certainly exists. If SL is detectable to the listener, it results in incongruousness. Conse- quently, the detection threshold (DT) of SL must be well inves- tigated and SL should be sufficiently smaller than it. However, there have been relatively few studies on the DT of SL. Moreover, as inter-subject differences have been reported, it is necessary to examine DT in more detail. In this study, the DT and difference limen (DL) were investigated using two kinds of experiments and compared. As a result, averaged DT and DL over listeners were estimated to be 94 ms and 70 ms, respectively. Moreover, a strong correlation between the DT and DL (r=0.81 (p < .01)) was ob- served. This may mean that DL can be regarded as DT when the minimum system latency of the system is sufficiently small. Therefore, by taking the average of our results and previous stud- ies, DT of SL was estimated as being around 75 ms.
Full-text available
In a virtual acoustic environment, the total system latency (TSL) refers to the time elapsed from the transduction of an event or action, such as movement of the head, until the consequences of that action cause the equivalent change in the virtual sound source. This paper reports on the impact of increasing TSL on localization accuracy when head motion is enabled. A previous study [1] investigated long duration stimuli of 8 s to provide subjects with substantial opportunity for exploratory head movements. Those data indicated that localization was generally accurate, even with a latency as great as 500 ms. In contrast, Sandvad [2] has observed deleterious effects on localization with latencies as small as 96 ms when using stimuli of shorter duration (~1.5 to 2.5 s). In an effort to investigate stimuli more comparable to Sandvad [2], the present study repeated the experimental conditions of [1] but with a stimulus duration of 3 s. Five subjects estimated the location of 12 virtual sound sources (individualized head-related transfer functions) with latencies of 33.8, 100.4, 250.4 or 500.3 ms in an absolute judgement paradigm. Subjects also rated the perceived latency on each trial. Comparison of the data for the 3 and 8 ms duration stimuli indicates that localization accuracy as a function of latency is moderately affected by the overall duration of the sound. For example, for the 8-s stimuli, front-back confusions were minimal and increased only slightly with increasing latency. For the 3-s stimuli, the increase in front-back confusions with latency was more pronounced, particularly for the longest latency tested (500 ms). Mean latency ratings indicated that latency had to be at least 250 ms to be readily perceived. The fact that accuracy was generally comparable for the shortest and longest latencies suggests that listeners are able to ignore latency during active localization, even though delays of this magnitude produce an obvious spatial "slewing" of the source such that it is no longer stabilized in space. There is some suggestion that listeners are less able to compensate for latency with the short duration stimuli, although the effect is not as pronounced as in [2].
Conference Paper
Head tracking has been shown to improve the quality of multiple aspects of binaural rendering for single sound sources, such as reduced front-back confusions. This paper presents the results of an AB experiment to investigate the influence of tracker latency on the perceived stability of virtual sounds. The stimuli used are a single frontal sound source and a complex (5 source) sound scene. A comparison is performed between the results for the simple and complex sound scenes and the head motions of the subjects for various latencies. The perceptibility threshold was found to be 10 ms higher for the complex scene compared to the simple one. The subject head movement speeds were found to be 6 degrees/s faster for the complex scene.
A critical parameter for the design of interactive virtual audio displays is the maximum acceptable amount of delay between the movement of the listener's head and the corresponding change in the spatialized signal presented to the listener's ears. Two studies that used a low-latency virtual audio display to evaluate the effects of headtracker latency on auditory localization are presented. The first study examined the effects of headtracker delay on the localization on broad-band sounds. The results show that latency values in excess of 73 ms result in increased localization errors for brief sounds and increased localization response times for continuous sound sources. The second study measured how well listeners could detect the presence of headtracker latency in a virtual sound. The results show that the best listeners can detect latency values of 60-70 ms for isolated sounds, and that their detection thresholds are 25 ms lower for sounds presented in conjunction with a low-latency reference tone. These results suggest that headtracker latency values lower than 60 ms are likely to be adequate for most virtual audio applications, and that delays of less than 30 ms are difficult to detect even in very demanding virtual auditory environments.
The perception of externalization of speech sounds was investigated with respect to the monaural and binaural cues available at the listeners&apos; ears in a reverberant environment. Individualized binaural room impulse responses (BRIRs) were used to simulate externalized sound sources via headphones. The measured BRIRs were subsequently modified such that the proportion of the response containing binaural vs monaural information was varied. Normal-hearing listeners were presented with speech sounds convolved with such modified BRIRs. Monaural reverberation cues were found to be sufficient for the externalization of a lateral sound source. In contrast, for a frontal source, an increased amount of binaural cues from reflections was required in order to obtain well externalized sound images. It was demonstrated that the interaction between the interaural cues of the direct sound and the reverberation strongly affects the perception of externalization. An analysis of the short-term binaural cues showed that the amount of fluctuations of the binaural cues corresponded well to the externalization ratings obtained in the listening tests. The results further suggested that the precedence effect is involved in the auditory processing of the dynamic binaural cues that are utilized for externalization perception.