The Role of Head Tracking in Binaural Rendering


Binaural rendering can integrate, through the use of a head-tracker, the movements of the listener. This means the rendering can be updated as a function of listener’s head rotation and position, allowing for the virtual sound source to be perceived as being fixed relative to the real world, as well as enhancing the externalisation of the sources. This paper presents a summary of two recent experiments involving head-tracked binaural rendering. The first concerns the influence of latency in the head-tracking system with regards to sound scene stability. The second examines the influence of head-tracking on the perceived externalisation of the sound sources. A discussion on the advantages of head-tracking with respect to realism in binaural rendering is provided.
1. Introduction
Humans interpret auditory scenes by analysing the pressure
signals reaching the ears. The spatial cues of each sound
source in the scene are encoded in these two signals, allowing
listeners to make sense of what is happening around them.
Each active sound source in the scene will emit a pressure
signal which radiates outwards until it eventually reaches
the listener. As the sound travels to the listener it will take
different paths to the two ears, diffracting round the head,
reflecting off the torso and resonating in the pinnae. The
differences in the acoustic paths provide a significant amount
of information for the listener. The mains cues used for
sound source localisation are interaural time difference (ITD),
interaural level difference (ILD) and spectral cues [
]. ITD and
ILD are binaural cues and are used to evaluate lateralisation,
while spectral cues are monaural and provide information about
the elevation of the source. Front-back confusions can occur
since the ITD and ILDs are not unique - the same ITD can
occur on a contour of positions around the listener, known as
cones-of-confusion. Monaural spectral cues can sometimes
be used to distinguish between front and back, but front-back
confusions still occur in natural listening for anechoic sounds
]. Additional information can also be obtained by interaction
with the sound scene, such as by performing head rotations,
meaning front-back ambiguities can be resolved.
The acoustic paths taken from the sound source to the two ears
can be represented as filters. These are known as head related
impulse responses (HRIRs) in the time domain and head related
transfer functions (HRTFs) in the frequency domain. For each
position of the sound source in space relative to the listener
there is a corresponding pair of filters (one for each ear). They
include all of the spatial information about the sound source at
their corresponding position.
Sound scenes can be recreated over headphones using binaural
processing. The principle is to recreate at the eardrums of
the listener the pressure signals corresponding to an intended
sound source or scene. This can be done for example by
recording, using a dummy head, or synthetically, using HRTFs
previously recorded in an anechoic chamber [
]. If the sound
scene is recorded then the binaural and spectral cues encoded
will be that of the dummy head, which may not be appropriate
for all listeners. Furthermore, the scene can only be recorded
in a static manner, meaning head movements cannot later be
applied to the scene at playback. In synthesis, each individual
sound source in the scene is filtered using the pair of HRTFs
corresponding to the desired source position. The left and
right ear signals for each sound source are summed to generate
the final composite scene. The HRTFs used for creating a
synthetic sound scene should ideally be personalised to the
listener. Unfortunately, recording HRTFs requires specialist
equipment (appropriate microphones and an anechoic chamber)
]. Personalisation can be performed using the Boundary
Element Method (BEM) to simulate the acoustic cues [
but this is limited by the acquisition of an appropriate scan
of the listener. If individual HRTFs cannot be obtained then
non-individual ones must be used instead, though these can
cause problems such as increased numbers of front-back or
up-down confusions [
]. This can be alleviated by training to
the HRTF [
] or selecting the perceptually best-rated HRTF
from a set [9].
If playback is from either a dummy head recording or a static
synthetic scene, then sources will remain fixed with respect
to the listener’s head - a source to the left will remain to the
left regardless of head orientation. This causes problems for
the realism of the sound scene because the sound source is
not fixed with respect to the external world but rather to the
head-centred coordinate system.
Head tracking can be used to correct for this and to increase
the realism of the scene. The position and orientation of the
listener’s head is tracked, for example, using optical camera
methods or gyroscopic sensors. The current position and
orientation of the head are used to determine the new relative
position of the source(s) to the listener’s head. The HRTFs used
for rendering are then updated and the position of the sources
appear to be fixed in space, independent of the motion of the
listener. Some head tracking systems, such as gyroscopic,
only give information on the relative orientation of the listener
and so cannot account for parallax if the listener moves
translationally from their initial position. Other trackers, such
as optical, can be used for full interaction with and movement
around the sources, as long as one stays within the cameras’
field of view.
However, in practice, head tracking and binaural rendering
cannot be performed without some minimum level of system
latency, due to the time taken for the head tracker to transmit
the change in head position/orientation to the renderer, the
selection of the updated filters, and the convolution with the
sound source. If the latency is above a certain level then it can
be perceived [
]. Perceptually, latency can take the
form of the sound source moving around the intended source
position, meaning it is no longer fixed in absolute position in
the external world nor independent of the listener. Previous
studies on the perceived influence of latency on the rendered
sound source have been evaluated for single source sound
The addition of reverberation to the sound source has been
found to increase the perceived externalisation of the sound
source [
], and is related to the interaural cue fluctuations
over the duration of the stimulus [
]. Studies have also
been performed to investigate the benefit of head tracking
on the externalisation of the sound sources [
]. The
results have been conflicting, with Begault et al. [
] finding
no benefit of head tracking on the externalisation rate, and
Brimijoin et al. [
] finding that head movements in the range
can improve it. Brimijoin et al. [
] found that sounds
to the front and rear were less well externalised than those to
the sides. Wenzel [
], testing using individualised HRTFs,
found that the addition of latency to the tracking system did
not significantly degrade the externalisation rate, though it did
reduce the localization accuracy.
This paper summarises the results of two recent studies
focused on head movement in binaural rendering. The first
explores the influence of latency on the source stability with
respect to the external world of simple and complex sound
scenes. The second investigates the influence of head tracking
on the externalisation of sound sources with different head
movement/tracking combinations. Particular emphasis is on
sound source azimuth. These are followed by a discussion on
the role of head tracking in the rendering of realistic sound
2. Head Movement on Source Stability
2.1. Overview
Previous studies on head tracker latency have tended to
ask about the audibility of the latency using simples scenes
consisting of only one sound source [
]. In this
study, source stability was used as the evaluation criterion
because it can be linked to a physical property of the
sound source, therefore giving it possible application to other
variables of the binaural renderer. It also has the advantage
of being a property link to the physical world that can be
understood by non-expert subjects.1
2.2. Method
Subjects performed an AB comparison as the main experiment
task. Subjects were presented with one of two scenes, followed
by a repeat of the same scene with a different latency applied to
the head tracker information. They were asked to pick which
of the two stimuli had the more stable sound sources relative to
the external world. A control condition (both scenes presented
without latency) was included. Subjects were presented
two sounds scenes - single and multi-source - consisting of
anechoic recordings. The simple scene consisted of a 5 s
sample of maracas at 0
. The complex scene added male and
female talking voices at
, piano at 110
, and clarinet
, thus distributing the sources in a 5.0 arrangement.
During the 5 s stimulus presentation for each scene, subjects
were instructed to turn their head to 90
either left or right
and then repeat to the other side, finishing at 0
. The initial
direction was not prescribed, as long as it was the same for
both scenes in a stimulus-pair. Initial direction was unified
by mirroring for analysis of the results. The motion was to
last the full duration of the stimulus. Subjects were given
time to train to the stimulus duration, followed by an 8 trial
training session of the main task. The simple and complex
scenes were presented in blocks since pre-testing found that
the transition from a simple to complex scene could cause
confusion and disrupt concentration. The starting order of
the scenes was alternated between subjects to ensure an equal
distribution. Due to the physically repetitive task, subjects
were given a short break at the halfway point. The experiment
took approximately 40 min to complete, excluding the training
Before the main experiment, subjects performed an HRTF
selection task to determine how well the HRTFs could trace
two known trajectories - one horizontal and one in the median
plane. The HRTFs were from a set of 7 from the the LISTEN
database [
] that have been shown to provide an adequate
HRTF for most listeners [
]. The perceptually best-rated HRTF
was given to each of the subjects, unless their own measured
HRTF was available. The ITD cues were personalised based
on the circumference of the subject’s head.
The latency levels tested were 0, 25, 50, 75, 100, 150, 200,
and 250 ms. Each latency level was presented 12 times, 6 in
each half of the experiment. This gives a total of 192 trials
for the whole experiment (12 repetitions ×8 latency levels ×
2 scenes). 10 subjects took part in the experiment (mean age
32, standard deviation 9.2 years).
2.3. Results
2.3.1. Latency Detection
The psychometric curves of detection rate against additional
latency for both scenes are shown in Figure 1. As expected, the
control condition gives results at chance level. For the simple
scene the 70% latency threshold is approximately 50 ms, rising
to above 90% by 100 ms. There is a shift to larger latencies
for the complex scene, indicating a slightly lowered sensitivity
to source instability due to informational masking. The 70%
threshold is approximately 60 ms, while it reaches 90% only
at almost 150 ms additional latency. There are also lower
detection rates for very large latency values. The increase in
additional latency (ms)
0 50 100 150 200 250
% low latency system selected
single source
complex scene
Fig. 1:
The selection rates for the simple and complex sound scenes
as a function of latency.
0 1000 2000 3000 4000 5000
time (ms)
azimuth angle (degrees)
Fig. 2:
Example of the mean azimuth over the stimulus duration for
one subject. Dashed lines indicate the standard deviation.)
the detection thresholds, from the psychometric fits at 70%
detection, increases by approximately 10 ms.
2.3.2. Head Movements
Since subjects were asked to perform the same head movement,
the head tracker data was stored and analysed to determine if
they were consistent with the instructions. Figure 2 shows the
mean and standard deviation over all trials for one listener as
an example of the consistency with which subjects performed
the task.
The average minimum and maximum azimuth angles (after
mirroring to make the initial direction consistent) over all
subjects were –86
and 98
respectively, with standard
deviations of 27
and 29
, respectively. This suggests a
slight overshoot on the first turn, but that the instructions were
generally well respected. Subjects were asked only to make
yawing head motion during the scenes. Subjects were found to
have stayed on average within 10
of zero pitch and roll angles,
indicating no large motion in these dimensions.
Head movement speeds were analysed to examine differences
between the two scene types and latency levels. Very weak
correlation (
r= 0.47
) was found between the average subject
head movement speed and their detections rates. An analysis
of variance (ANOVA) found no difference in the distribution
of mean head speeds at the different additional-latency levels
p= 0.53
). However, a difference was found between the head
speeds for the two scenes (
p < 0.01
). The mean speeds over
all conditions were 91
/s and 97
/s for the simple and complex
scenes respectively. Table 1 shows the mean azimuthal speeds
for both scenes and all latency levels. No significant change in
speed was found with increasing numbers of trials, indicating
consistency for the duration of the experiment.
Added-latency (ms) Simple Complex
0 91/s 98/s
25 93/s 99/s
50 91/s 97/s
75 94/s 98/s
100 92/s 101/s
150 91/s 96/s
200 90/s 94/s
250 89/s 96/s
Tab. 1:
The mean angular speed for each latency values tested for the
simple and complex scenes.
3. Head Movement and externalisation
3.1. Overview
Head tracking has been shown to increase the externalisation
of binaural sound scenes when using individualised HRTFs
]. That study asked listeners to make relatively small
head movements. The study summarised here asked subjects
to make the same head movement as in the source stability
experiment on the hypothesis that a large head movement
can give rise to a strong perception of externalisation. Three
different non-individual HRTFs were used in order to minimise
any training to the HRTF, which can lead to improved
externalisation with time [18].
3.2. Method
Ten expert subjects were asked to rate the externalisation of
the sound image on a scale of 0 to 5, 0 being “at the center
of the head” and 5 being “externalised and remote”. The
subjects were sound engineers with experience listening to
binaural content. The stimulus was a male vocal sample
placed at 0
, recorded using the 6 microphone configuration
shown in Figure 3. The microphones were cardioid pattern,
with the 0
microphone capturing the highest level of direct
sound, while the others provide varying levels of direction-
to-reverberant energy. The 6 microphone signals were
placed at positions corresponding to the recording positions,
using the binaural renderer. During the experiment, a
number of azimuth conditions were tested. The scene was
rotated that the 0
microphone signal rotated in steps of 30
to determine the influence of the source direction on the
perceived externalisation. When head tracking was active,
Fig. 3:
The recording setup to capture the talker used as the stimulus.
head movements cause the position of the 6 sources to rotate
in the opposite direction, keeping the signals fixed with respect
to the external world. There were 4 head movement/tracker
static or no head movement/no tracker (S0),
no head movement/head tracked (ST),
head movement/no tracker (M0),
head movement/head tracked (MT).
For the conditions requiring head movement, subjects were
asked to turn to 90
, then
, and back to 0
during a 5.5 s
spoken phrase. After the head movement there was a 1 s pause
in the stimulus followed by 2.5 s more of speech. Subjects were
instructed to have their heads still at this point and to make
their evaluation disregarding the initial 5.5 s phrase. Previous
externalisation studies [
] have asked for evaluation of
the extent of externalisation while the subject might be moving
their head. While making head movements, frontal sources
are likely to be laterally displaced relative to the centre of
the subject’s head, and lateral sources are better externalised
than those near the median plane. This presents a difficulty
for subjects evaluating externalisation, which might vary for
different head orientations. Hence, subjects were asked to
evaluation externalisation only when their heads were still,
at the end of the stimulus. This ensures that for all head
movement conditions the same conditions are present at the
point of evaluation.
3.3. Results
The distributions of ratings for each of the head
tracker/movement conditions is shown in Figure 4. It is clear
that condition
has the highest rate of externalisation,
followed by the still-head conditions
. Condition
exhibits the least externalisation, due to the movement of
the source with the subject’s head. There does not appear to
be a large difference in the distribution of the ratings between
. Deeper analysis of the head movements made by
the listeners during these conditions suggests that they were
often small enough movements as to be below the minimum
audible movement angle (MAMA). The resolution of the
binaural renderer might also have been a factor for these
Fig. 4:
Comparison summary, over all subjects and HRTFs, of
percentage of trials each externalisation score was attributed for each
0 ±30 ±60 ±90 ±120 ±150 180
Azimuth (degrees)
Externalization rate (%)
Fig. 5:
Mean externalisation rate obtained for each condition,
mirrored azimuths combined, over all subjects and HRTFs.
Since a range of source azimuths were tested, it is useful to
break the results down to determine if some regions benefit
more from head movements than others. Figure 5 shows
the externalisation rate (proportion of 3–5 ratings) for the 4
movement/tracker conditions as a function of azimuth. No
significant differences were found between the results to
the left and right of the subjects, so the results have been
collapsed to a single semicircle from 0
to 180
. It is clear
that the addition of head movements and tracking is not
beneficial at lateral positions (
) since
the externalisation rates of
are no greater than those of
at these positions. Condition
shows lower
externalisation rates at all positions, but is still generally high
for the lateral regions. The regions of greatest improvement due
to head movements and tracking (
) are those to the front
and rear. For all targets except that at 0
, the externalisation
rate is in line with that of the lateral sources, giving better and
more uniform externalisation.
For the 0
source, the externalisation rates are at their lowest.
The addition of head movement and tracking provides a
significant increase in the externalisation rate compared to the
other conditions. Even though the externalisation rate is below
50% (46.7%), it still highlights the advantage of head move-
ments coupled with tracking. There was some improvement
in condition
of the externalisation rate with increasing
numbers of repetitions, suggesting that externalisation of the
frontal sources could perhaps be improved over time.
It should be noted that the subjects were asked to rate
the externalisation only during a portion of the stimulus
at which they instructed to be forward-facing and have a
static head. The externalisation rates are, therefore, an
aftereffect of the head movement/tracking condition, since
externalisation is not evaluated during the head movement.
This is because externalisation varies with azimuth and,
without some standardised response condition, it would be
difficult to know if subjects evaluated a frontal source during
the point in the motion when it was lateral, or gave an average
over the whole motion.
Note that the head movements made for condition
were followed the trend of the source stability experiment,
with subjects making the required head movement during the
initial phase and keeping their heads still during the final 2.5 s.
4. Discussion
Both studies presented here show how head tracking can in
some way impact on the realism of the sound scene. Stable
sound sources that appear exterior to the listener’s head are
important factors in creating a realistic scene.
Compared to a single sound source, the addition of more sound
sources reduces slightly listener sensitivity to the head tracker
latency, before sound sources are perceived as being unstable.
The study here used anechoic source material and an argument
could be made that reverberant sources provide additional
complexity in comparison to this. Lindau [
] found that there
was no significant difference in latency detection for different
reverberant conditions (anechoic and reverberant), suggesting
that reverberant sources are more like the simple scene than the
multi-source one. This is likely to be because the reflections in
a reverberant environment are in some way correlated with the
direct source, while the 5 sources used in this study all convey
uncorrelated information.
Assuming the latency of the head tracking system is low
enough to avoid any source instability then the externalisation
of frontal sources is one of the biggest challenges to realism.
Even with head movements and tracking, frontal sources are
less well externalised than even those to the rear or side.
This has implications for applications such as television and
cinema, in which a large proportion of the sounds (particularly
dialogue) are likely to be intended to originate from this
direction. Such a collapse in externalisation will reduce the
realism for the listener. It is possible that audio-visual media
can overcome this because the ventriloquism effect can create
the illusion of a sound emanating from a visible source. It
is unclear whether sounds can be drawn out of the head in a
similar manner.
Many head tracking applications use only orientation in-
formation. This is sufficient if the listener is required to
remain in their seat and will make only rotational movements.
More interactive virtual reality experience will require full
translational information to be used alongside the orientation
information. This allows listeners to fully explore and interact
with the scene, something particularly important for game
environments. It is possible that more complex listener-
source interactions involving translation could produce better
externalisation for the more difficult frontal sources since there
are even more cues which place the source in the external
Beyond translational movements, the sound source could be
linked to a physical, tracked object, allowing the user to pick
up and interact with the sound source at another level. This
allows the user to gain direct feedback of the source position
from their (known) hand position. Using such a method has
been shown to allow users to adapt to non-individual HRTFs
]. This adds to the possibility of increasing the realism of
the scene by allowing faster adaptation to the HRTFs and more
accurate interaction with objects in the scene.
Another important factor, particularly for externalisation but
also for timbre, is the use of appropriate room reverberation.
This has been shown to improve externalisation with binaural
]. A compromise must be made between
computational efficiency and physical accuracy. For complete
physical accuracy either a real-time reverberation engine,
which encodes the direction of the reflections, can be used
or a the binaural room impulse responses (BRIRs) could
be calculated or recorded for all possible positions and
orientations the listener might inhabit. An alternative is to
add some generic room reflections and diffuse reverberation,
independent of the head tracking information. The first
approach could potentially generate a more realistic scene
for the listener but requires significant extra computational
5. Conclusion
This paper presents a summary of two recent studies into per-
ceptual properties of sound scenes reproduced using binaural
with head tracking. The first investigated source stability for
different head-tracker latencies with simple and complex sound
scenes. The source instability generated by increased latency
was found to be less audible for the complex scene than the
simple one. The second experiment investigated externalisation
for a variety of head movement and tracking conditions, using
a 6 microphone recording of a male talker in a reverberant
environment. It was found that lateral sources are generally
well externalised, whether or not the head is tracked, while
frontal and rear sources benefit from the addition of tracked
head movement.
