Conference PaperPDF Available

On the improvement of auditory accuracy with non-individualized HRTF-based sounds

Authors:

Abstract and Figures

Auralization is a powerful tool to increase the realism and sense of immersion in Virtual Reality environments. The Head Related Transfer Function (HRTF) filters commonly used for auralization are non-individualized, as obtaining individualized HRTFs poses very serious practical difficulties. It is therefore extremely important to understand to what extent this hinders sound perception. In this paper, we address this issue from a learning perspective. In a set of experiments, we observed that mere exposure to virtual sounds processed with generic HRTF did not improve the subjects' performance in sound source localization, but short training periods involving active learning and feedback led to significantly better results. We propose that using auralization with non-individualized HRTF should always be preceded by a learning period.
Content may be subject to copyright.
Audio Engineering Society
Convention Paper
Presented at the 129th Convention
2010 November 4–7 San Francisco, CA, USA
The papers at this Convention have been selected on the basis of a submitted abstract and extended précis that have been peer
reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from the author's advance
manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents.
Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42nd Street, New
York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof,
is not permitted without direct permission from the Journal of the Audio Engineering Society.
On the improvement of auditory accuracy
with non-individualized HRTF-based sounds
Catarina Mendonça1, Jorge A. Santos1, Guilherme Campos2, Paulo Dias2, José Vieira2, and João
Ferreira3
1 School of Psychology, University of Minho, Portugal
2 Department of Electronics, Telecommunications and Informatics, University of Aveiro, Portugal
3 School of Engineering, University of Minho, Portugal
ABSTRACT
Auralization is a powerful tool to increase the realism and sense of immersion in Virtual Reality environments. The
Head Related Transfer Function (HRTF) filters commonly used for auralization are non-individualized, as obtaining
individualized HRTFs poses very serious practical difficulties. It is therefore extremely important to understand to
what extent this hinders sound perception. In this paper, we address this issue from a learning perspective. In a set of
experiments, we observed that mere exposure to virtual sounds processed with generic HRTF did not improve the
subjects’ performance in sound source localization, but short training periods involving active learning and feedback
led to significantly better results. We propose that using auralization with non-individualized HRTF should always
be preceded by a learning period.
1. INTRODUCTION
Auralization consists in the recreation of spatial sound.
The aim is to accurately simulate acoustic environments
and provide vivid and compelling auditory experiences.
It has applications in many fields; examples range from
flight control systems to tools for helping the visually
impaired. It also has a strong potential in virtual reality
(VR) settings and in the entertainment industry.
Acoustic simulation needs to take into account the
influence not only of the room itself (wall reflections,
attenuation effects,…) but also of the listener’s physical
presence in it. In fact, the interaction of sound waves
with the listener’s body particularly torso, head,
pinnae (outer ears) and ear canals – has extremely
AES
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 2 of 8
important effects in sound perception, notably interaural
time and level differences (ITD and ILD, respectively),
the main cues for source localization. Such effects can
be mathematically described by the binaural impulse
response for the corresponding source position, known
as Head Related Impulse Response (HRIR), or, more
commonly, by its Fourier transform, the Head Related
Transfer Function (HRTF). It is possible to
appropriately externalize headphone-delivered sounds
by processing anechoic recordings of the source
material through the HRTF filters corresponding to the
desired virtual source position [1], [2]. The localization
cues are particularly effective for sources in the median
plane [3], [4].
Since they depend on anatomic features such as the size
and shape of head and ears, HRTFs vary considerably
from person to person. Moreover, even for the same
person they will vary with age and reveal no symmetry
between left and right ear responses. Given this
variability, spatial audio simulations should use
individualized HRTFs [5]. However, these would be
extremely difficult to obtain in practice; HRTF
recordings are effortful and expensive, requiring
anechoic rooms, arrays of speakers (or accurate speaker
positioning systems), miniature microphones, and
specialized software and technicians. Due to these
practical difficulties, most systems resort to generic
(non-individualized) HRTFs, measured on manikins or
head-and-torso systems equipped with artificial pinnae
designed to approximate as best as possible an ‘average’
human subject
It has been suggested that satisfactory auralization can
be obtained using generic HRTFs [6]. Wenzel et al. [5]
compared the localization accuracy when listening to
external free-field acoustic sources and to virtual sounds
filtered by non-individualized HRTFs. Several front-
back and up-down confusions were found, but there was
overall similarity between the results obtained in the
two test situations. A similar result was found in the
auralization of speech signals [7]. Most listeners can
obtain useful azimuth information from speech filtered
with non-individualized HRTFs.
On the other hand, there are indications that
individualized HRTF-based systems do differ from
generic ones. There is a significant increase in the
feeling of presence when virtual sounds are processed
with individualized binaural filters instead of generic
HRTFs. Differences in convincingness and intensity of
auditory experience are also reported [8]. Interestingly,
some authors have suggested that the perception of
spatial sound with non-individualized HRTFs might
change over time. Begault and Wenzel [7] observed
several individual differences, which suggest that some
listeners are able to adapt more easily to the spectral
cues of the non-individualized HRTFs than others.
Asano et al. [9] also claimed that reversal errors
decrease as subjects adapt to the unfamiliar cues in
static anechoic stimuli.
In this context, our primary research question in this
paper is: can humans learn to accurately localize sound
sources when provided with spatial cues from HRTF
sets different from their own? There is evidence that the
mature brain is not immutable, but instead holds the
capacity for reorganization as a consequence of sensory
pattern changes or behavioral training [10]. Shinn-
Cunningham and Durlach [11] trained listeners with
“supernormal” cues, which resulted from the spectral
intensification of the peak frequencies. With repeated
testing, during a single session, subjects adapted to the
altered relationship between auditory cues and spatial
position. Hofman [12] addressed the consequences of
manipulating spectral cues over large periods of time,
adapting moulds to the outer ears of the subjects.
Elevation cues (which depend exclusively on monoaural
cues) were initially disrupted. These elevation errors
were greatly reduced after several weeks, suggesting
that subjects learned to associate the new patterns with
positions in space.
The broad intention of this study was to assess how
training may influence the use of non-individualized
HRFT. Our main concern was assuring that users of
such generically spatialized sounds become able to fully
enjoy their listening experiences in as short a time as
possible. The experiments were intended to understand
under which conditions subjects will be readily
prepared, namely by tackling the questions: Do listeners
adapt spontaneously without feedback? (experiment 1);
and Can we accelerate the adaptation process?
(experiments 2 and 3).
2. EXPERIMENT 1
This experiment intended to assess the localization
accuracy of inexperienced subjects as they became
gradually more familiarized with the non-individualized
HRTF processed sounds. We tested their ability to
discriminate sounds at fixed elevation and variable
azimuth in 10 consecutive experimental sessions
(blocks), without feedback on the accuracy of their
responses. We analyzed the evolution of the subjects’
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 3 of 8
performance throughout each block.
2.1. Method
2.1.1. Participants
Four naïve and inexperienced young adults participated
in the experiment. They all had normal hearing, verified
by standard audiometric screening at 500, 750, 1000,
1500 and 2000 Hz. All auditory thresholds were below
10 dB SPL and none had significant interaural
sensitivity differences.
2.1.2. Stimuli
The stimuli consisted of pink noise sounds.
The sounds were auralized at 8 different azimuths:
(front), 180º (back), 90º (left and right), (45º left and
right), and 135º (left and right). They had constant
elevation (0º) and distance (1m). For this purpose, the
original (anechoic) sound was convolved with the
HRTF pair corresponding to the desired source position.
The resulting pair of signals – for the left and the right
ear – was then reproduced through earphones.
The HRTFs set were recorded using a KEMAR dummy
head microphone at the Massachusetts Institute of
Technology [13]. Sounds were reproduced with a
Realtec Intel 8280 IBA sound card, and presented
through a set of Etymotics ER-4B MicroPro in-ear
earphones.
2.1.3. Procedure
All sounds were presented pseudo-randomly for 3
seconds, with 1 second interstimulus interval. There
were 10 blocks of 10 stimulus repetitions each.
Participants were told to indicate the perceived sound
source location for each stimulus.
The answers were recorded by selecting, on a touch
screen, one of the eight possible stimulus positions.
2.2. Results
The average accuracy of azimuth localization was above
chance (65% correct answers) in all cases, but no ceiling
performances were observed. The left and right 90º
sounds were the most accurately located, with a correct
response rate of 78%. Similarly to what had been found
in previous studies [5], there were several front-back
confusions that account for the lower accuracy at 0º
(62% correct answers), 180º (43%), left/right 45º (60%)
and left/right 135º (69%).
Analyzing the average participant’s performance along
time (Figure 1), we see that the overall accuracy
remained constant. There were individual differences
between participants. Listener 1 was less accurate
(50.4% correct answers), listeners 2 and 3 performed
near average (61.9% and 71.1%, respectively) and
listener 4 had the best azimuth localization performance
(85.1%). However, none of the participants revealed a
tendency to improve their performance.
Figure 1 Percentage of correct answers by experimental
block and linear regression.
The linear regression results revealed a slope coefficient
close to zero (0.04), meaning a small tendency for the
percentage of correct responses to change. The
correlation values confirmed that the experimental block
number does not account for the listeners’ accuracy
(r2=0.00), and the relation between them is insignificant
(p=0.958).
Our results reveal that naïve participants are able to
discriminate sounds at several azimuths well above
chance (random responses would have resulted in
averages of 12.5% correct responses). However,
throughout the exposure blocks, their accuracy does not
evolve, leading to the conclusion that simple exposure is
not enough for significant localization improvement in
short periods of time.
In view of these conclusions, a second experiment was
developed where, in the same amount of time, listeners
were trained to discriminate sound source locations.
3. EXPERIMENT 2
In experiment 2, we tested the participants’ accuracy in
localizing sounds at several azimuths before and after a
short training program. In this program, we selected
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 4 of 8
only a small number of sounds and trained them through
active learning and response feedback.
3.1. Method
3.1.1. Participants
Four young adults participated. None of them had any
previous experience with virtual sounds. They all had
normal hearing, tested with a standard audiometric
screening, as described in experiment 1.
3.1.2. Stimuli
As in experiment 1, all stimuli consisted of pink noise
sounds, auralized with the same algorithms and
software.
All stimuli varied in azimuth, with elevation (0º) and
distance (1m) fixed. Azimuths ranged from the front of
the subjects head to their right ear, spaced at 6º intervals
(from 6º left to 96 º right). Only these azimuths were
used, aiming to assure that other effects such as front-
back biases and individual lateral accuracy asymmetries
did not emerge, as they were not within the interest of
our study. Stimuli did not range solely from 0º to 90º, to
avoid reducing response options, which would
artificially increase the accuracy at these azimuths. All
sounds had a 3 second duration, with an interval of 1
second between each stimulus.
3.1.3. Procedure
Both experiments 2 and 3 started with a pre-test. In the
pre-test, all sounds were presented pseudo-randomly
with 4 repetitions each. Participants had to indicate, on a
continuum displayed on a touch screen (figure 2A, blue
area), the point in space where they estimated the sound
source to be.
Figure 2 Touch screen in the pre-test and post-test (A).
Touch screen in the training period (B).
After the pre-test, participants engaged in a training
period. The trained sounds corresponded to the frontal
(0º), lateral (90º) and two intermediate azimuths (21º
and 45º) (see white areas in figure 2B).
The training conformed to the following steps:
Active Learning: Participants were presented with a
sound player where they could hear the training
sounds at their will. To select the sounds, there were
several buttons on the screen, arranged according to
the spatial the corresponding source position. The
participants were informed that they had 5 minutes to
practice and that afterwards they would be tested.
Passive Feedback: After the 5 minutes of active
learning, participants heard the training sounds and
had to point their location on a touch screen (figure
2B). After each trial, they were told the correct
answer. The passive feedback period continued until
participants could answer correctly in 80 percent of
the trials (5 consecutive repetitions of all stimuli with
at least 20 correct answers).
When training period ended, participants performed a
post-test, an experiment equal to the pre-test for
comparison purposes.
3.2. Results
Pre-Test
Results from the pre-test and post-test sessions are
displayed in Figure 3. Orange and purple bars display
the average distance (in degrees) to the given stimulus
position. Gray bars display the mean hypothetical error
(in degrees) that would be obtained if participants
responded randomly.
Analysing the pre-test results (Figure 3, orange bars),
we observe that azimuth discrimination is easier for
frontal stimuli: the average error is below 5 degrees.
The absence of rear stimuli which prevented any front-
back confusions may help explain these results. As in
experiment 1, listeners were fairly precise in identifying
lateral source positions. Sounds were most difficult to
locate at intermediate azimuths ( between 40º and 6).
For these positions, pre-test localisation was at chance
level, revealing an overall inability of the subjects to
discriminate such sound positions.
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 5 of 8
Figure 3 Average response error in the Pre-Test and
Post-Test sessions, and theoretical error level if listeners
responded randomly.
On average, participants missed the stimulus position in
the pre-test by 15.67º.
Training Period
The training sessions were very successful for all
participants. All took less than 30 minutes and, in
average, they lasted 22 minutes.
Learning curves are displayed in figure 4, where
individual azimuth discrimination accuracy is plotted as
a function of the time elapsed since the start of the
training period.
Figure 4 Individual accuracy evolution in the azimuth
localization training sessions.
All participants reached the 80% criterion. Despite the
differences in learning velocity, a smooth progression
was observed for all of them.
Post-Test
The post-test results (Figure 3, purple bars) revealed a
large error reduction (7.23º on average). Despite
individual differences, all participants revealed the
similar learning effects. In the post-test, the mean
localization error was 8.44º. This difference was
statistically significant in a paired samples T-test
(t(287)=14.94, p0.001). The error reduction was most
expressive in the intermediate azimuths, where the
average error decreased 20 degrees. Analysing the
trained azimuths (0º, 21º, 45º, 66º, 90º), we observe that
performance enhancement was substantial not only for
these stimuli, but also for others, not trained. As an
example, the best error reduction was obtained with the
48º azimuth, a non-trained stimulus. In contrast, the 90º
azimuth, a trained one, revealed similar results in both
sessions. These findings allow us to conclude that the
trained discrimination abilities for some stimuli
positions are generalized to other, non-trained, auditory
positions.
4. EXPERIMENT 3
In experiment 3, an elevation discrimination task was
carried out under the same methodology as in
experiment 2. Elevation is known to be perceived less
accurately than azimuth or distance, probably because it
depends mostly upon monoaural information. This
experiment was designed to investigate whether or not
the learning effect found in experiment 2 could be
attributed to an improved interpretation of the binaural
information contained in the HRTFs.
4.1. Method
4.1.1. Participants
Four inexperienced subjects took part in the experiment
after undergoing auditory testing with the same standard
screening as previously described (experiment 1).
4.1.2. Stimuli
As in experiment 1 and 2, all stimuli consisted of pink
noise sounds, auralized with the same algorithms and
software. In experiment 3, the stimuli varied in
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 6 of 8
elevation, but not in azimuth (0º) or distance (1m). They
ranged from the front of the listeners’ head (0º in
elevation) to the top (90º in elevation) in 10º intervals.
Stimuli did not go beyond 90º, as the HRTF database
was limited to these elevations. Participants were aware
that no back stimuli were present, but no instruction was
given regarding stimuli below 0º.
All sounds had 3 second duration, with 1 second
interstimulus intervals.
4.1.3. Procedure
Experiment 3 followed the same procedure as
experiment 2.
In the training period, the sounds were positioned at
elevations of 0º, 50º and 90º. Figure 5 shows the touch
screen used in the pre-test and post-test sessions (A), as
well as the touch screen with the 3 defined elevations,
which were trained (B).
Figure 5 Touch screen in the pre-test and post-test (A).
Touch screen in the training period (B).
4.2. Results
Pre-Test
Figure 6 presents the average distance (in degrees)
between the subjects’ answers and the stimulus
elevations in the pre and post-test sessions. It also shows
the hypothetical errors that would be obtained if
subjects responded at chance (unequally distributed, as
subjects were allowed to respond farther in the lower
elevations than in the upper elevations).
In the pre-test session, the average error was 40.8º, close
to random. The subjects were unable to localise all
sounds ranging from 0º to 50º; the worst results were in
the frontal (0º) stimuli (55º average error). Overall,
participants were less accurate in estimating a sound
position in elevation than in azimuth.
Figure 6 Average response error in the Pre-Test and
Post-Test sessions, and theoretical response errors if
listeners responded randomly.
Training Period
Training sessions were faster than those of experiment
1, as there were only 3 trained elevations. On average,
they took 17 minutes (Figure 7).
Figure 7 Individual accuracy evolutions in the elevation
training sessions.
Only one subject (listener 3) did not evolve as expected.
After 10 minutes testing, this subject was still making
excessive mistakes, and was allowed a second active
learning phase (5 minutes), after which the 80 percent
accuracy was rapidly achieved.
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 7 of 8
Post-Test
The post-test results were better than those of the pre-
test for all subjects. This difference was significant in a
paired samples T-test (t(159)=16.678, p0.001) The
average error decreased 14.75 degrees, to a mean of
26.5º, an effect larger than found in experiment 2. The
training effect was most expressive for the upper
stimuli, namely at 80º, 40º and 50º elevations. Among
these stimuli, the only trained one was at 50º. On the
other hand, sounds at 0º elevation, a trained stimulus,
revealed no decrease in the post-test session.
Similarly to what was found in experiment 2, training
was highly effective and generalized well to other
stimuli.
5. FINAL DISCUSSION
Auralization is of great interest to practitioners in
several scientific and technological areas, as well as in a
variety of commercial applications. Developmental
efforts in this field have led to sophisticated simulations,
which include the effect of the individual anatomical
shaping upon the sound waves that reach the listeners’
ears. But, as such shaping varies considerably among
different people, the influence of using non-
individualized approaches must be investigated. In this
paper, we were specifically interested in better
understanding the evolution in perceptual accuracy as a
subject familiarizes with non-individualized HRTFs.
We intended to understand if listeners adapt
spontaneously without feedback in a reasonably short
time and/or if we could somehow accelerate the
adaptation process.
In experiment 1, we addressed the listeners’ adaptation
process to static non-individualized azimuth sounds
without feedback. Throughout 10 short experimental
consecutive sessions, we measured the percentage of
correct answers in position discrimination. Results
revealed an overall absence of performance
improvement in all subjects. We concluded that simple
exposure is not enough for significant accuracy
evolution to be achieved in short periods of time. Such
exposure learning had been claimed in previous works
[7,9], in an attempt to explain individual differences in
accuracy results. Our results did not reveal those effects.
Adaptation without training was, however,
demonstrated before [12], but over wide periods of time
(weeks) and with spatial feedback, as participants of
those experiments carried the moulds inside their ears in
their daily lives during the whole period.
Pursuing the intent of preparing untrained listeners to
take full advantage of non-individualized HRTFs, we
designed a second experiment, where subjects could
train with sample sounds in a short program combining
active learning and feedback. In a pre-test, participants
revealed good discrimination abilities for frontal
stimuli, but performed very poorly in the intermediate
(40º to 60º) azimuths. After the training sessions, in a
post-test, all azimuths were identified above chance,
with results significantly better than the pre-test ones.
More importantly, the training benefit was observed not
only in the trained sample azimuths, but was
generalized to other stimulus positions. In an attempt to
interpret such results, one might argue that an overall
learning of the new HRTF-based cues took place, and
was then applied to the other untrained stimuli.
One could speculate that the learning effect found in
experiment 2 might be explained by a fast recalibration
to new ITD and ILD values, rather than an adaptation to
the new binaural and spectral cues altogether. In
experiment 3, we tested the same training program, with
stimuli varying in elevation and with fixed azimuth.
Elevation alone is known to be poorly discriminated,
when compared to azimuth, mostly because it depends
upon monoaural cues (ITD and ILD values are fixed),
such as the spectral shaping of the pinnae and inner ear.
Results in the pre-test of this experiment revealed poor
source discrimination ability at almost all elevations,
particularly from 0º to 50º. Indeed, with unfamiliar
HRTF filters, auralized sounds carried little elevation
information for the untrained subjects. A large
difference was found in the post-test, where some
discriminability arose. Again, the performance benefit
was generalized across stimuli and was not restricted to
the trained elevations. This finding further supports the
assumption that indeed the new HRTF-shaped
frequencies were learned.
Both in experiments 2 and 3, the training sessions had
the approximate duration of 20 minutes. Longer training
sessions might have led to better performance
improvements. We stress, however, that in preparing
listeners for auralized interfaces time should not be the
criterion. In our sessions, each participant revealed a
different profile and learned at a different velocity.
Fixing a goal (such as 80% accuracy) will allow a way
of assuring all listeners reach an acceptable adaptation.
We conclude that in binaural auralization using generic
HRTF it is possible to significantly improve the
auditory performance of an untrained listener in a short
period of time. However, natural adaptation to static
Mendonça et al.
Improving auditory accuracy in HRTF sounds
AES 129th Convention, San Francisco, CA, USA, 2010 November 4–7
Page 8 of 8
stimuli is unlikely to occur in a timely manner. Without
any training, several source positions are poorly
discriminated. In view of this, we argue that virtual
sounds processed through non-individualised HRTFs
should only be used after learning sessions. We propose
that these sessions might involve a small sound
sampling, active learning and feedback.
Future studies in this field should focus on the
endurance of the learned capabilities over time,
generalization limits, and the training effects over the
final auditory virtual experience.
6. ACKNOWLEDGEMENTS
This work was supported by FCT - Portuguese
Foundation for Science and Technology
(SFRH/BD/36345/2007 and PTDC/TRA/67859/2006).
7. REFERENCES
[1] G. Plenge. “On the difference between localization
and lateralization.” J. Acoust. Soc. Am., 56, pp.
944-951.
[2] F. L. Wightman, D. J. Kistler, M. E. Perkins. “A
new approach to the study of human sound
localization.” in Directional Hearing, W. Yost, G.
Gourevich, Eds. New York: Sringer-Verlag.
[3] J. Blauert. Spatial Hearing: The psychophysics of
human sound localization. Cambridge: MIT Press,
1983.
[4] C. L. Searle, L. D. Braida, D. R. Cuddy, and M. F.
Pavis. “Model for auditory localization.” J. Acoust.
Soc. Am., 60, pp. 1164-1175.
[5] E. M. Wenzel, M. Arruda, D. J. Kistler, F. L.
Wightman. “Localization using nonindividualized
Head-Related Transfer Functions J. Acoust. Soc.
Am., 94, pp. 111-123.
[6] J. M. Loonis, R. L. Klatzky, and R. G. Golledge.
“Auditory distance perception in real, virtual and
mixed environments”, in Mixed Reality: Merging
Real and Virtual Worlds, Y. Ohta, H. Tamura, Eds.
Tokio: Ohmsha, 1999.
[7] A. Valjamae, P. Larson, D. Vastfjall, M Kleiner.
“Auditory pressure, individualized Head-Related
Transfer Function, and illusory ego-motion in
virtual environments.” Proceedings of the Seventh
Annual Workshop in Presence, 2004, Spain.
[8] D. R. Begault, E. M. Wenzel. “Headphone
localization of speech.” Hum. Fact., 35(2), pp. 361-
376, 1993.
[9] F. Asano, Y. Suzuki, and T. Stone. “Role of
Spectral cues in median plane localization.” J.
Acoust. Soc. Am., 80, pp. 159-168, 1990.
[10] C. D. Gilbert. “Adult cortical dynamics.” Physiol.
Rev., 78, pp. 467-485, 1998.
[11] B. G. Shinn-Cunningham, N. I. Durlach, R. M.
Held. Adapting to supernormal auditory location
cues. I. Bias and resolution.” J. Acoust. Soc. Am.,
103, pp. 3656-3666, 1998.
[12] P. M. Hofman, J. G. A. Van Ristwick, A. J. Van
Opstal. “Relearning sound localization with new
ears”. Nat. Neurosc., 1, pp. 417-421, 1998.
[13] B. Gardner, K. Martin. HRTF Measurements of a
KEMAR Dummy-Head Microphone. url:
http://sound.media.mit.edu/resources/KEMAR.html
visited - June 2010.
... Mendonça et al. (2012, 2013) used an explicit training method. They presented the subjects with an interface where they could select any of three to five source positions to be learned and play them freely. ...
... As already referred in Section Effects of Auditory Space Adaptation, Butler (1987) found that spatial adaptation was specific to trained cue spectrum, On the other hand, Zahorik et al. (2006) found that, after training, subjects improved in localizing not only the trained auditory source positions, but also other, untrained sources. A similar result was obtained by Mendonça et al. (2012). Mendonça et al. (2013) looked deeper into auditory space generalization patterns. ...
... All these studies applied a change in localization cues and analyzed adaptation effects. Overall, auditory adaptation studies in humans vary greatly in length, from 10–20 min (Mendonça et al., 2012, 2013), to 27–101 days (Florentine, 1976). To obtain an estimate of average training length per study type, we computed local averages for studies in which length was itself variable. ...
Article
Full-text available
In this article we present a review of current literature on adaptations to altered head-related auditory localization cues. Localization cues can be altered through ear blocks, ear molds, electronic hearing devices, and altered head-related transfer functions (HRTFs). Three main methods have been used to induce auditory space adaptation: sound exposure, training with feedback, and explicit training. Adaptations induced by training, rather than exposure, are consistently faster. Studies on localization with altered head-related cues have reported poor initial localization, but improved accuracy and discriminability with training. Also, studies that displaced the auditory space by altering cue values reported adaptations in perceived source position to compensate for such displacements. Auditory space adaptations can last for a few months even without further contact with the learned cues. In most studies, localization with the subject's own unaltered cues remained intact despite the adaptation to a second set of cues. Generalization is observed from trained to untrained sound source positions, but there is mixed evidence regarding cross-frequency generalization. Multiple brain areas might be involved in auditory space adaptation processes, but the auditory cortex (AC) may play a critical role. Auditory space plasticity may involve context-dependent cue reweighting.
... Also, some untrained sounds positions had higher error reductions than some trained ones. This result, congruent with our previous findings [13], is consistent with the interpretation that a global new head model is formed as subjects learn to localize with the new cues. Localization errors were also varied in continuums, namely after training: in azimuth, they were higher in the intermediate area between front and lateral positions; in elevation they were better in higher positions and gradually became worse as they became more frontal. ...
... In this study, since we intended to avoid any previous experience, no such experiment was run. In any case, in our previous work with similar stimuli [13], there was such a control experiment. No benefit was found from prolonged exposure to the task. ...
... In a previous study [13] we addressed the learning of auditory localization with non-individualized head-related transfer functions in a simple setting. Using only passive contact to static free-field sounds without head motion or feedback did not result in any improvement in azimuth localization accuracy. ...
Article
Full-text available
Previous findings have shown that humans can learn to localize with altered auditory space cues. Here we analyze such learning processes and their effects up to one month on both localization accuracy and sound externalization. Subjects were trained and retested, focusing on the effects of stimulus type in learning, stimulus type in localization, stimulus position, previous experience, externalization levels, and time. We trained listeners in azimuth and elevation discrimination in two experiments. Half participated in the azimuth experiment first and half in the elevation first. In each experiment, half were trained in speech sounds and half in white noise. Retests were performed at several time intervals: just after training and one hour, one day, one week and one month later. In a control condition, we tested the effect of systematic retesting over time with post-tests only after training and either one day, one week, or one month later. With training all participants lowered their localization errors. This benefit was still present one month after training. Participants were more accurate in the second training phase, revealing an effect of previous experience on a different task. Training with white noise led to better results than training with speech sounds. Moreover, the training benefit generalized to untrained stimulus-position pairs. Throughout the post-tests externalization levels increased. In the control condition the long-term localization improvement was not lower without additional contact with the trained sounds, but externalization levels were lower. Our findings suggest that humans adapt easily to altered auditory space cues and that such adaptation spreads to untrained positions and sound types. We propose that such learning depends on all available cues, but each cue type might be learned and retrieved differently. The process of localization learning is global, not limited to stimulus-position pairs, and it differs from externalization processes.
... Additionally, researchers have shown that the human auditory system is capable of 'learning' foreign HRTFs by the help of dedicated training sessions. Studies found that localisation accuracies, using different HRTFs, improved not only for the trained positions but also for others, allowing them to conclude that trained localisation abilities can be generalised to untrained auditory positions (Mendonça et al., 2012a). A follow-up study by the same team of authors focused on the longitudinal effects of training, in which pre-training localisation errors were compared to post-training errors immediately after training, one week later and one month later. ...
... These findings indicate that using auralisation with non-individualised HRTF should always be preceded by a learning period. This work, on the basis of an earlier presentation at the 129 th AES Convention, was selected for publication in the AES Journal [15]. An additional set of experiments were devised to investigate this learning effect in further detail. ...
Conference Paper
Full-text available
This communication is an overview of the FCT-funded three-year research project ‘AcousticAVE – Auralisation Models and Applications in Virtual Reality Environments’, a collaboration between the Universities of Aveiro (UA - IEETA) and Minho (UM - CIPsi, LVP). The project involved the development of auralisation software based on the image-source method accommodating dynamic scenarios, with real-time tracking of source/listener motion and listener head orientation. This software supported psychophysical research at the CAVE-like facilities of UM’s Visualisation and Perception Lab (LVP). This included an investigation on learning effects in spatial audio perception using non-individualised HRTF sets and distance and time-to-passage (TTP) perception experiments.
Article
Full-text available
Virtual auditory environments (VAEs) are created by filtering digital sounds through HRTFs (Head-Related Transfer Functions) such that they convey a spatial location to the listener. The most accurate HRTFs are obtained by direct individual acoustic measurement, however this is a costly and time-consuming process. Subjective selection arises as a low cost alternative to obtaining customized HRTFs, however, this manner of selection is perceptual in nature, and a user's choices may change over time. The validity of using subjective selection for HRTF customization relies on the consistency of the HRTFs selected by listeners. The present work assesses how listener's subjectively selected HRTFs may change over time. The results suggested that listeners are able to select adequate HRTFs in one session, without the need for additional sessions.
Article
The digital wireless microphone systems for hearing aids have been developed to provide a clean and intelligible speech signal to hearing-impaired listeners for e.g. school or teleconference applications. In this technology, the voice of the speaker is picked up by a body-worn microphone, wirelessly transmitted to the hearing aids and rendered in a diotic way (same signal at both ears), preventing any speaker localization clues from being provided. The reported algorithm performs a real-time binaural localization and tracking of the talker so that the clean speech signal can then be spatialized, according to its estimated position relative to the aided listener. This feature is supposed to increase comfort, sense of immersion and intelligibility for the users of such wireless microphone systems.
Article
Head-related transfer functions (HRTFs) and virtual auditory displays (VADs) are popular research topics in the areas of acoustics, signal processing, and hearing, among others. In addition, HRTFs and VADs have been widely applied in numerous fields such as multimedia and virtual reality, communication, computer games, acoustical design in rooms, and spatial sound reproduction. In recent years, these topics have received increasing interest in China. The current engineering report presents the highlights and developments of our research projects, including studies on HRTF measurements and database construction, statistical analysis on measured HRTFs, a dynamic VAD system, and algorithm for reducing timbre coloration on the virtual surround sound reproduction. Finally, suggestions for further studies are also presented.
Article
In headphone playback different factors contribute to a deviation of the presented from the intended stimuli: the headphone transfer functions, their inter-individual differences, and the intra-individual variability due to repeated positioning. This report gives a detailed inspection of the blocked auditory canal transfer characteristics for one specimen of each of three different circumaural headphone models frequently used in psychoacoustics and audio production; two operating based on the electrodynamic, one on the electrostatic converter principle. It is shown that the variability can influence the stimuli presented, especially in the frequency range above 6 kHz.
Article
Non-individualized binaural synthesis degrades the veracity of the virtual auditory space. To improve the non-individual binaural audio playback we use a frontal projection headphone, which customizes the non-individual head related transfer functions (HRTFs) by introducing idiosyncratic pinna cues. In addition, a robust headphone equalization technique is recommended for frontal projection headphone playback that preserves the embedded personal pinna cues. Moreover, the frontal projection headphone and the conventional headphone transfer functions display similar pinna spectral cues as present in the measured frontal and rear azimuthal HRTFs, respectively. Perceptual experiments validated the effectiveness of frontal headphone playback over the conventional headphones with reduced front-back confusions and improved frontal localization. It was also observed that the individual spectral cues created by the frontal projection are self-sufficient for front-back discrimination even with the high frequency pinna cues removed from the non-individual HRTF.
Article
Full-text available
It is likely that experiences of presence and self-motion elicited by binaurally simulated and reproduced rotating sound fields can be degraded by the artifacts caused by the use of generic Head-Related Transfer Functions (HRTFs). In this paper, an HRTF measurement system which allows for fast data collection is discussed. Furthermore, effects of generic vs. individualized HRTFs were investigated in an experiment. Results show a significant increase in presence ratings of individualized binaural stimuli compared to responses to stimuli processed with generic HRTFs. Additionally, differences in intensity and convincingness of illusory self-rotation ratings were found for sub-groups of subjects, formed on the basis of subjects' localization performance with the given HRTFs catalogues.
Article
The audibility of HRTF information reduction was investigated using a parametric analysis and synthesis approach. Nonindividualized HRTFs were characterized by magnitude and interaural phase properties computed for warped critical bands. The minimum number of parameters was established as a function of the HRTF set under test, the sound-source position, whether overlapping or nonoverlapping parameter bands were used, and whether spectral characteristics were derived from the HRTF magnitude or power spectrum domain. A three-interval, forced-choice procedure was employed to determine the required spectral resolution of the parameters and the minimum requirements for interaural phase reconstruction. The results indicated that, for pink-noise stimuli, the estimation of magnitude and interaural phase spectra per critical band is a sufficient prerequisite for transparent HRTF parameterization. Furthermore the low-frequency HRTF phase characteristics can be efficiently described by a single interaural delay while disregarding the absolute phase response of the individual HRTFs. Also, high-frequency phase characteristics were found to be irrelevant for the HRTFs and the stimuli used for the test. Estimation of parameters in the spectral magnitude domain using overlapping parameter bands resulted in better quality compared to either power-domain parameter estimation or the use of nonoverlapping parameter bands. When HRTFs were reconstructed by interpolation of parameters in the spatial domain, a spatial measurement resolution of about 10 degrees was shown to be sufficient for high-quality binaural processing. Further reductions in spatial resolution predominantly give rise to monaural cues, which are stronger for interpolation in the vertical direction than in the horizontal direction. The results obtained provide clear design criteria for parametric HRTF processing in filter-bank-based applications such as MPEG Surround.
Article
Reproducing a pair of binaural signals over loudspeakers requires crosstalk cancellation filters that create sound at the two ears corresponding to a transparent delivery of the intended source material. Such filtering is effectively inverting the actual response of the loudspeakers to the two ears. The authors explore the consequences of inversion, especially when the response lasts longer than that of a strictly anechoic environment. The choice of inverse design parameters proves more difficult than expected. The authors conclude that the required knowledge of the actual environment is equivalent to making in situ measurements.
Article
There are currently three options for the implementation of spatial audio systems based on head-related transfer functions (HRTFs): individually measured HRTFs, generic HRTFs, and customizable HRTFs. Individualized HRTFs require the subject to undergo lengthy measurements in an anechoic chamber under the supervision of trained personnel, which limits their availability to the average potential user. To overcome this, many researchers and commercial developers have resorted to using generic HRTFs. However, it has been shown that generic HRTFs result in an increase in localization errors. The third possibility, which we are pursuing, is the customization of HRTFs, in which the anatomical measurements of a potential user are utilized to determine the parameters of a structural model of the head and the pinnae. However, an initial step of decomposing the impulse responses of individualized HRTFs into a summation of damped and delayed sinusoids (DDSs) is needed to reveal the parameters of the structural pinna model. An exhaustive search decomposition method has already been developed which seems to do this accurately when the DDSs are separated by long latencies, but not for short latencies. The application of the Hankel total least-squares (HTLS) decomposition method is proposed as a solution to the parameter extraction problem faced in customizing HRTFs when short latencies are expected between the DDSs that make up their impulse response.
Article
Our interactions with the world around us depend heavily on information supplied by the auditory system. Information about the presence and identity of a sound source is obviously important. However, the location of the sound is often equally important. In everyday life localization seems so automatic and generally so precise that we much more often find ourselves concentrating on “what” rather than “where.” Nevertheless, localization is an important auditory function, and the details of how it is accomplished in the auditory system are not well understood.
Article
This article offers a brief review of the theory underlying production of virtual auditory stimuli and a discussion of various issues a researcher might encounter in applying the theory. The focus of the discussion is on head-related transfer functions (HRTFs) which describe how a sound wave is transformed as it propagates from a point in space to a listener's eardrums. The discussion covers both technical and psychophysical issues. Most of the technical issues will be familiar to anyone who produces virtual auditory stimuli, focusing on how and where HRTF measurements are made and how the data are processed. Specific issues include: 1) measurement at the eardrum or at the entrance of a blocked meatus; 2) the meaning of HRTF phase, and the consequences of representing HRTFs as minimum-phase systems; 3) individual differences in HRTFs. A few important technical issues will be discussed that have received relatively little attention. Chief among these are problems involved with compensating for the on-ear frequency response of the headphones used to present virtual auditory stimuli. Since the final test of the adequacy of any virtual synthesis must rest with the human listener, psychophysical issues are at least as important as technical issues. Among the psychophysical issues that will be reviewed are front-back confusions and various difficulties with response modalities that restrict either stimulus or response alternatives.
Article
A mathematical model based on statistical decision theory has been devised to represent the human auditory localization task. The known localization cues have been represented as Gaussian random variables, so that their interaction in a given experiment can be analyzed (and predicted) along the lines of classical detection/estimation theory. We have applied this technique to most of the horizontal and vertical localization experiments reported in the literature during the past ten years, encompassing over 200 subjects and 20 000 trials. Using a nonlinear regression program we have been able to estimate the standard deviations of four of the auditory localization cues, allowing objective comparison of their relative accuracy. The resulting model provides a relatively good fit to the published results on 40 localization experiments.
Article
The role of spectral cues in the sound source to ear transfer function in median plane sound localization is investigated in this paper. At first, transfer functions were measured and analyzed. Then, these transfer functions were used in experiments where sounds from a source on the median plane were simulated and presented to subjects through headphones. In these simulation experiments, the transfer functions were smoothed by ARMA models with different degrees of simplification to investigate the role of microscopic and macroscopic patterns in the transfer functions for median plane localization. The results of the study are summarized as follows: (1) For front-rear judgment, information derived from microscopic peaks and dips in the low-frequency region (below 2 kHz) and the macroscopic patterns in the high-frequency region seems to be utilized; (2) for judgment of elevation angle, major cues exist in the high-frequency region above 5 kHz. The information in macroscopic patterns is utilized instead of that in small peaks and dips.