Content uploaded by Catarina Mendonça
Author content
All content in this area was uploaded by Catarina Mendonça on Feb 27, 2019
Content may be subject to copyright.
Audio Engineering Society
Convention Paper
Presented at the 129th Convention
2010 November 4–7 San Francisco, CA, USA
The papers at this Convention have been selected on the basis of a submitted abstract and extended précis that have been peer
reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from the author's advance
manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents.
Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New
York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof,
is not permitted without direct permission from the Journal of the Audio Engineering Society.
On the improvement of auditory accuracy
with non-individualized HRTF-based sounds
Catarina Mendonça1, Jorge A. Santos1, Guilherme Campos2, Paulo Dias2, José Vieira2, and João
Ferreira3
1 School of Psychology, University of Minho, Portugal
2 Department of Electronics, Telecommunications and Informatics, University of Aveiro, Portugal
3 School of Engineering, University of Minho, Portugal
ABSTRACT
Auralization is a powerful tool to increase the realism and sense of immersion in Virtual Reality environments. The
Head Related Transfer Function (HRTF) filters commonly used for auralization are non-individualized, as obtaining
individualized HRTFs poses very serious practical difficulties. It is therefore extremely important to understand to
what extent this hinders sound perception. In this paper, we address this issue from a learning perspective. In a set of
experiments, we observed that mere exposure to virtual sounds processed with generic HRTF did not improve the
subjects’ performance in sound source localization, but short training periods involving active learning and feedback
led to significantly better results. We propose that using auralization with non-individualized HRTF should always
be preceded by a learning period.
1. INTRODUCTION
Auralization consists in the recreation of spatial sound.
The aim is to accurately simulate acoustic environments
and provide vivid and compelling auditory experiences.
It has applications in many fields; examples range from
flight control systems to tools for helping the visually
impaired. It also has a strong potential in virtual reality
(VR) settings and in the entertainment industry.
Acoustic simulation needs to take into account the
influence not only of the room itself (wall reflections,
attenuation effects,…) but also of the listener’s physical
presence in it. In fact, the interaction of sound waves
with the listener’s body – particularly torso, head,
pinnae (outer ears) and ear canals – has extremely
AES
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 2 of 8
important effects in sound perception, notably interaural
time and level differences (ITD and ILD, respectively),
the main cues for source localization. Such effects can
be mathematically described by the binaural impulse
response for the corresponding source position, known
as Head Related Impulse Response (HRIR), or, more
commonly, by its Fourier transform, the Head Related
Transfer Function (HRTF). It is possible to
appropriately externalize headphone-delivered sounds
by processing anechoic recordings of the source
material through the HRTF filters corresponding to the
desired virtual source position [1], [2]. The localization
cues are particularly effective for sources in the median
plane [3], [4].
Since they depend on anatomic features such as the size
and shape of head and ears, HRTFs vary considerably
from person to person. Moreover, even for the same
person they will vary with age and reveal no symmetry
between left and right ear responses. Given this
variability, spatial audio simulations should use
individualized HRTFs [5]. However, these would be
extremely difficult to obtain in practice; HRTF
recordings are effortful and expensive, requiring
anechoic rooms, arrays of speakers (or accurate speaker
positioning systems), miniature microphones, and
specialized software and technicians. Due to these
practical difficulties, most systems resort to generic
(non-individualized) HRTFs, measured on manikins or
head-and-torso systems equipped with artificial pinnae
designed to approximate as best as possible an ‘average’
human subject
It has been suggested that satisfactory auralization can
be obtained using generic HRTFs [6]. Wenzel et al. [5]
compared the localization accuracy when listening to
external free-field acoustic sources and to virtual sounds
filtered by non-individualized HRTFs. Several front-
back and up-down confusions were found, but there was
overall similarity between the results obtained in the
two test situations. A similar result was found in the
auralization of speech signals [7]. Most listeners can
obtain useful azimuth information from speech filtered
with non-individualized HRTFs.
On the other hand, there are indications that
individualized HRTF-based systems do differ from
generic ones. There is a significant increase in the
feeling of presence when virtual sounds are processed
with individualized binaural filters instead of generic
HRTFs. Differences in convincingness and intensity of
auditory experience are also reported [8]. Interestingly,
some authors have suggested that the perception of
spatial sound with non-individualized HRTFs might
change over time. Begault and Wenzel [7] observed
several individual differences, which suggest that some
listeners are able to adapt more easily to the spectral
cues of the non-individualized HRTFs than others.
Asano et al. [9] also claimed that reversal errors
decrease as subjects adapt to the unfamiliar cues in
static anechoic stimuli.
In this context, our primary research question in this
paper is: can humans learn to accurately localize sound
sources when provided with spatial cues from HRTF
sets different from their own? There is evidence that the
mature brain is not immutable, but instead holds the
capacity for reorganization as a consequence of sensory
pattern changes or behavioral training [10]. Shinn-
Cunningham and Durlach [11] trained listeners with
“supernormal” cues, which resulted from the spectral
intensification of the peak frequencies. With repeated
testing, during a single session, subjects adapted to the
altered relationship between auditory cues and spatial
position. Hofman [12] addressed the consequences of
manipulating spectral cues over large periods of time,
adapting moulds to the outer ears of the subjects.
Elevation cues (which depend exclusively on monoaural
cues) were initially disrupted. These elevation errors
were greatly reduced after several weeks, suggesting
that subjects learned to associate the new patterns with
positions in space.
The broad intention of this study was to assess how
training may influence the use of non-individualized
HRFT. Our main concern was assuring that users of
such generically spatialized sounds become able to fully
enjoy their listening experiences in as short a time as
possible. The experiments were intended to understand
under which conditions subjects will be readily
prepared, namely by tackling the questions: Do listeners
adapt spontaneously without feedback? (experiment 1);
and Can we accelerate the adaptation process?
(experiments 2 and 3).
2. EXPERIMENT 1
This experiment intended to assess the localization
accuracy of inexperienced subjects as they became
gradually more familiarized with the non-individualized
HRTF processed sounds. We tested their ability to
discriminate sounds at fixed elevation and variable
azimuth in 10 consecutive experimental sessions
(blocks), without feedback on the accuracy of their
responses. We analyzed the evolution of the subjects’
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 3 of 8
performance throughout each block.
2.1. Method
2.1.1. Participants
Four naïve and inexperienced young adults participated
in the experiment. They all had normal hearing, verified
by standard audiometric screening at 500, 750, 1000,
1500 and 2000 Hz. All auditory thresholds were below
10 dB SPL and none had significant interaural
sensitivity differences.
2.1.2. Stimuli
The stimuli consisted of pink noise sounds.
The sounds were auralized at 8 different azimuths: 0º
(front), 180º (back), 90º (left and right), (45º left and
right), and 135º (left and right). They had constant
elevation (0º) and distance (1m). For this purpose, the
original (anechoic) sound was convolved with the
HRTF pair corresponding to the desired source position.
The resulting pair of signals – for the left and the right
ear – was then reproduced through earphones.
The HRTFs set were recorded using a KEMAR dummy
head microphone at the Massachusetts Institute of
Technology [13]. Sounds were reproduced with a
Realtec Intel 8280 IBA sound card, and presented
through a set of Etymotics ER-4B MicroPro in-ear
earphones.
2.1.3. Procedure
All sounds were presented pseudo-randomly for 3
seconds, with 1 second interstimulus interval. There
were 10 blocks of 10 stimulus repetitions each.
Participants were told to indicate the perceived sound
source location for each stimulus.
The answers were recorded by selecting, on a touch
screen, one of the eight possible stimulus positions.
2.2. Results
The average accuracy of azimuth localization was above
chance (65% correct answers) in all cases, but no ceiling
performances were observed. The left and right 90º
sounds were the most accurately located, with a correct
response rate of 78%. Similarly to what had been found
in previous studies [5], there were several front-back
confusions that account for the lower accuracy at 0º
(62% correct answers), 180º (43%), left/right 45º (60%)
and left/right 135º (69%).
Analyzing the average participant’s performance along
time (Figure 1), we see that the overall accuracy
remained constant. There were individual differences
between participants. Listener 1 was less accurate
(50.4% correct answers), listeners 2 and 3 performed
near average (61.9% and 71.1%, respectively) and
listener 4 had the best azimuth localization performance
(85.1%). However, none of the participants revealed a
tendency to improve their performance.
Figure 1 Percentage of correct answers by experimental
block and linear regression.
The linear regression results revealed a slope coefficient
close to zero (0.04), meaning a small tendency for the
percentage of correct responses to change. The
correlation values confirmed that the experimental block
number does not account for the listeners’ accuracy
(r2=0.00), and the relation between them is insignificant
(p=0.958).
Our results reveal that naïve participants are able to
discriminate sounds at several azimuths well above
chance (random responses would have resulted in
averages of 12.5% correct responses). However,
throughout the exposure blocks, their accuracy does not
evolve, leading to the conclusion that simple exposure is
not enough for significant localization improvement in
short periods of time.
In view of these conclusions, a second experiment was
developed where, in the same amount of time, listeners
were trained to discriminate sound source locations.
3. EXPERIMENT 2
In experiment 2, we tested the participants’ accuracy in
localizing sounds at several azimuths before and after a
short training program. In this program, we selected
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 4 of 8
only a small number of sounds and trained them through
active learning and response feedback.
3.1. Method
3.1.1. Participants
Four young adults participated. None of them had any
previous experience with virtual sounds. They all had
normal hearing, tested with a standard audiometric
screening, as described in experiment 1.
3.1.2. Stimuli
As in experiment 1, all stimuli consisted of pink noise
sounds, auralized with the same algorithms and
software.
All stimuli varied in azimuth, with elevation (0º) and
distance (1m) fixed. Azimuths ranged from the front of
the subjects head to their right ear, spaced at 6º intervals
(from 6º left to 96 º right). Only these azimuths were
used, aiming to assure that other effects such as front-
back biases and individual lateral accuracy asymmetries
did not emerge, as they were not within the interest of
our study. Stimuli did not range solely from 0º to 90º, to
avoid reducing response options, which would
artificially increase the accuracy at these azimuths. All
sounds had a 3 second duration, with an interval of 1
second between each stimulus.
3.1.3. Procedure
Both experiments 2 and 3 started with a pre-test. In the
pre-test, all sounds were presented pseudo-randomly
with 4 repetitions each. Participants had to indicate, on a
continuum displayed on a touch screen (figure 2A, blue
area), the point in space where they estimated the sound
source to be.
Figure 2 Touch screen in the pre-test and post-test (A).
Touch screen in the training period (B).
After the pre-test, participants engaged in a training
period. The trained sounds corresponded to the frontal
(0º), lateral (90º) and two intermediate azimuths (21º
and 45º) (see white areas in figure 2B).
The training conformed to the following steps:
• Active Learning: Participants were presented with a
sound player where they could hear the training
sounds at their will. To select the sounds, there were
several buttons on the screen, arranged according to
the spatial the corresponding source position. The
participants were informed that they had 5 minutes to
practice and that afterwards they would be tested.
• Passive Feedback: After the 5 minutes of active
learning, participants heard the training sounds and
had to point their location on a touch screen (figure
2B). After each trial, they were told the correct
answer. The passive feedback period continued until
participants could answer correctly in 80 percent of
the trials (5 consecutive repetitions of all stimuli with
at least 20 correct answers).
When training period ended, participants performed a
post-test, an experiment equal to the pre-test for
comparison purposes.
3.2. Results
Pre-Test
Results from the pre-test and post-test sessions are
displayed in Figure 3. Orange and purple bars display
the average distance (in degrees) to the given stimulus
position. Gray bars display the mean hypothetical error
(in degrees) that would be obtained if participants
responded randomly.
Analysing the pre-test results (Figure 3, orange bars),
we observe that azimuth discrimination is easier for
frontal stimuli: the average error is below 5 degrees.
The absence of rear stimuli which prevented any front-
back confusions may help explain these results. As in
experiment 1, listeners were fairly precise in identifying
lateral source positions. Sounds were most difficult to
locate at intermediate azimuths ( between 40º and 60º).
For these positions, pre-test localisation was at chance
level, revealing an overall inability of the subjects to
discriminate such sound positions.
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 5 of 8
Figure 3 Average response error in the Pre-Test and
Post-Test sessions, and theoretical error level if listeners
responded randomly.
On average, participants missed the stimulus position in
the pre-test by 15.67º.
Training Period
The training sessions were very successful for all
participants. All took less than 30 minutes and, in
average, they lasted 22 minutes.
Learning curves are displayed in figure 4, where
individual azimuth discrimination accuracy is plotted as
a function of the time elapsed since the start of the
training period.
Figure 4 Individual accuracy evolution in the azimuth
localization training sessions.
All participants reached the 80% criterion. Despite the
differences in learning velocity, a smooth progression
was observed for all of them.
Post-Test
The post-test results (Figure 3, purple bars) revealed a
large error reduction (7.23º on average). Despite
individual differences, all participants revealed the
similar learning effects. In the post-test, the mean
localization error was 8.44º. This difference was
statistically significant in a paired samples T-test
(t(287)=14.94, p≤0.001). The error reduction was most
expressive in the intermediate azimuths, where the
average error decreased 20 degrees. Analysing the
trained azimuths (0º, 21º, 45º, 66º, 90º), we observe that
performance enhancement was substantial not only for
these stimuli, but also for others, not trained. As an
example, the best error reduction was obtained with the
48º azimuth, a non-trained stimulus. In contrast, the 90º
azimuth, a trained one, revealed similar results in both
sessions. These findings allow us to conclude that the
trained discrimination abilities for some stimuli
positions are generalized to other, non-trained, auditory
positions.
4. EXPERIMENT 3
In experiment 3, an elevation discrimination task was
carried out under the same methodology as in
experiment 2. Elevation is known to be perceived less
accurately than azimuth or distance, probably because it
depends mostly upon monoaural information. This
experiment was designed to investigate whether or not
the learning effect found in experiment 2 could be
attributed to an improved interpretation of the binaural
information contained in the HRTFs.
4.1. Method
4.1.1. Participants
Four inexperienced subjects took part in the experiment
after undergoing auditory testing with the same standard
screening as previously described (experiment 1).
4.1.2. Stimuli
As in experiment 1 and 2, all stimuli consisted of pink
noise sounds, auralized with the same algorithms and
software. In experiment 3, the stimuli varied in
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 6 of 8
elevation, but not in azimuth (0º) or distance (1m). They
ranged from the front of the listeners’ head (0º in
elevation) to the top (90º in elevation) in 10º intervals.
Stimuli did not go beyond 90º, as the HRTF database
was limited to these elevations. Participants were aware
that no back stimuli were present, but no instruction was
given regarding stimuli below 0º.
All sounds had 3 second duration, with 1 second
interstimulus intervals.
4.1.3. Procedure
Experiment 3 followed the same procedure as
experiment 2.
In the training period, the sounds were positioned at
elevations of 0º, 50º and 90º. Figure 5 shows the touch
screen used in the pre-test and post-test sessions (A), as
well as the touch screen with the 3 defined elevations,
which were trained (B).
Figure 5 Touch screen in the pre-test and post-test (A).
Touch screen in the training period (B).
4.2. Results
Pre-Test
Figure 6 presents the average distance (in degrees)
between the subjects’ answers and the stimulus
elevations in the pre and post-test sessions. It also shows
the hypothetical errors that would be obtained if
subjects responded at chance (unequally distributed, as
subjects were allowed to respond farther in the lower
elevations than in the upper elevations).
In the pre-test session, the average error was 40.8º, close
to random. The subjects were unable to localise all
sounds ranging from 0º to 50º; the worst results were in
the frontal (0º) stimuli (55º average error). Overall,
participants were less accurate in estimating a sound
position in elevation than in azimuth.
Figure 6 Average response error in the Pre-Test and
Post-Test sessions, and theoretical response errors if
listeners responded randomly.
Training Period
Training sessions were faster than those of experiment
1, as there were only 3 trained elevations. On average,
they took 17 minutes (Figure 7).
Figure 7 Individual accuracy evolutions in the elevation
training sessions.
Only one subject (listener 3) did not evolve as expected.
After 10 minutes testing, this subject was still making
excessive mistakes, and was allowed a second active
learning phase (5 minutes), after which the 80 percent
accuracy was rapidly achieved.
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 7 of 8
Post-Test
The post-test results were better than those of the pre-
test for all subjects. This difference was significant in a
paired samples T-test (t(159)=16.678, p≤0.001) The
average error decreased 14.75 degrees, to a mean of
26.5º, an effect larger than found in experiment 2. The
training effect was most expressive for the upper
stimuli, namely at 80º, 40º and 50º elevations. Among
these stimuli, the only trained one was at 50º. On the
other hand, sounds at 0º elevation, a trained stimulus,
revealed no decrease in the post-test session.
Similarly to what was found in experiment 2, training
was highly effective and generalized well to other
stimuli.
5. FINAL DISCUSSION
Auralization is of great interest to practitioners in
several scientific and technological areas, as well as in a
variety of commercial applications. Developmental
efforts in this field have led to sophisticated simulations,
which include the effect of the individual anatomical
shaping upon the sound waves that reach the listeners’
ears. But, as such shaping varies considerably among
different people, the influence of using non-
individualized approaches must be investigated. In this
paper, we were specifically interested in better
understanding the evolution in perceptual accuracy as a
subject familiarizes with non-individualized HRTFs.
We intended to understand if listeners adapt
spontaneously without feedback in a reasonably short
time and/or if we could somehow accelerate the
adaptation process.
In experiment 1, we addressed the listeners’ adaptation
process to static non-individualized azimuth sounds
without feedback. Throughout 10 short experimental
consecutive sessions, we measured the percentage of
correct answers in position discrimination. Results
revealed an overall absence of performance
improvement in all subjects. We concluded that simple
exposure is not enough for significant accuracy
evolution to be achieved in short periods of time. Such
exposure learning had been claimed in previous works
[7,9], in an attempt to explain individual differences in
accuracy results. Our results did not reveal those effects.
Adaptation without training was, however,
demonstrated before [12], but over wide periods of time
(weeks) and with spatial feedback, as participants of
those experiments carried the moulds inside their ears in
their daily lives during the whole period.
Pursuing the intent of preparing untrained listeners to
take full advantage of non-individualized HRTFs, we
designed a second experiment, where subjects could
train with sample sounds in a short program combining
active learning and feedback. In a pre-test, participants
revealed good discrimination abilities for frontal
stimuli, but performed very poorly in the intermediate
(40º to 60º) azimuths. After the training sessions, in a
post-test, all azimuths were identified above chance,
with results significantly better than the pre-test ones.
More importantly, the training benefit was observed not
only in the trained sample azimuths, but was
generalized to other stimulus positions. In an attempt to
interpret such results, one might argue that an overall
learning of the new HRTF-based cues took place, and
was then applied to the other untrained stimuli.
One could speculate that the learning effect found in
experiment 2 might be explained by a fast recalibration
to new ITD and ILD values, rather than an adaptation to
the new binaural and spectral cues altogether. In
experiment 3, we tested the same training program, with
stimuli varying in elevation and with fixed azimuth.
Elevation alone is known to be poorly discriminated,
when compared to azimuth, mostly because it depends
upon monoaural cues (ITD and ILD values are fixed),
such as the spectral shaping of the pinnae and inner ear.
Results in the pre-test of this experiment revealed poor
source discrimination ability at almost all elevations,
particularly from 0º to 50º. Indeed, with unfamiliar
HRTF filters, auralized sounds carried little elevation
information for the untrained subjects. A large
difference was found in the post-test, where some
discriminability arose. Again, the performance benefit
was generalized across stimuli and was not restricted to
the trained elevations. This finding further supports the
assumption that indeed the new HRTF-shaped
frequencies were learned.
Both in experiments 2 and 3, the training sessions had
the approximate duration of 20 minutes. Longer training
sessions might have led to better performance
improvements. We stress, however, that in preparing
listeners for auralized interfaces time should not be the
criterion. In our sessions, each participant revealed a
different profile and learned at a different velocity.
Fixing a goal (such as 80% accuracy) will allow a way
of assuring all listeners reach an acceptable adaptation.
We conclude that in binaural auralization using generic
HRTF it is possible to significantly improve the
auditory performance of an untrained listener in a short
period of time. However, natural adaptation to static
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 8 of 8
stimuli is unlikely to occur in a timely manner. Without
any training, several source positions are poorly
discriminated. In view of this, we argue that virtual
sounds processed through non-individualised HRTFs
should only be used after learning sessions. We propose
that these sessions might involve a small sound
sampling, active learning and feedback.
Future studies in this field should focus on the
endurance of the learned capabilities over time,
generalization limits, and the training effects over the
final auditory virtual experience.
6. ACKNOWLEDGEMENTS
This work was supported by FCT - Portuguese
Foundation for Science and Technology
(SFRH/BD/36345/2007 and PTDC/TRA/67859/2006).
7. REFERENCES
[1] G. Plenge. “On the difference between localization
and lateralization.” J. Acoust. Soc. Am., 56, pp.
944-951.
[2] F. L. Wightman, D. J. Kistler, M. E. Perkins. “A
new approach to the study of human sound
localization.” in Directional Hearing, W. Yost, G.
Gourevich, Eds. New York: Sringer-Verlag.
[3] J. Blauert. Spatial Hearing: The psychophysics of
human sound localization. Cambridge: MIT Press,
1983.
[4] C. L. Searle, L. D. Braida, D. R. Cuddy, and M. F.
Pavis. “Model for auditory localization.” J. Acoust.
Soc. Am., 60, pp. 1164-1175.
[5] E. M. Wenzel, M. Arruda, D. J. Kistler, F. L.
Wightman. “Localization using nonindividualized
Head-Related Transfer Functions J. Acoust. Soc.
Am., 94, pp. 111-123.
[6] J. M. Loonis, R. L. Klatzky, and R. G. Golledge.
“Auditory distance perception in real, virtual and
mixed environments”, in Mixed Reality: Merging
Real and Virtual Worlds, Y. Ohta, H. Tamura, Eds.
Tokio: Ohmsha, 1999.
[7] A. Valjamae, P. Larson, D. Vastfjall, M Kleiner.
“Auditory pressure, individualized Head-Related
Transfer Function, and illusory ego-motion in
virtual environments.” Proceedings of the Seventh
Annual Workshop in Presence, 2004, Spain.
[8] D. R. Begault, E. M. Wenzel. “Headphone
localization of speech.” Hum. Fact., 35(2), pp. 361-
376, 1993.
[9] F. Asano, Y. Suzuki, and T. Stone. “Role of
Spectral cues in median plane localization.” J.
Acoust. Soc. Am., 80, pp. 159-168, 1990.
[10] C. D. Gilbert. “Adult cortical dynamics.” Physiol.
Rev., 78, pp. 467-485, 1998.
[11] B. G. Shinn-Cunningham, N. I. Durlach, R. M.
Held. “ Adapting to supernormal auditory location
cues. I. Bias and resolution.” J. Acoust. Soc. Am.,
103, pp. 3656-3666, 1998.
[12] P. M. Hofman, J. G. A. Van Ristwick, A. J. Van
Opstal. “Relearning sound localization with new
ears”. Nat. Neurosc., 1, pp. 417-421, 1998.
[13] B. Gardner, K. Martin. HRTF Measurements of a
KEMAR Dummy-Head Microphone. url:
http://sound.media.mit.edu/resources/KEMAR.html
visited - June 2010.