ArticlePDF Available

Abstract and Figures

In this paper, we present an investigation into the perception of source distance in interactive virtual auditory environments in the context of First (FOA) and Higher Order Ambisonic (HOA) reproduction. In particular, we investigate the accuracy of sound field reproduction over virtual loudspeakers (headphone reproduction) with increasing Ambisonic order. Performance of 1st, 2nd and 3rd order Ambisonics in representing distance cues is assessed in subjective audio perception tests. Results demonstrate that 1st order sound fields can be sufficient in representing distance cues for Ambisonic-to-binaural decodes.
Content may be subject to copyright.
Vol. 98 (2012) 61 –71 DOI 10.3813/AAA.918492
Distance Perception in Interactive Virtual
Acoustic Environments using First and Higher
Order Ambisonic Sound Fields
Gavin Kearney1),Marcin Gorzel2),Henry Rice3),Frank Boland2)
1) Department of Theatre, Film and Television, University of Yo rk, United Kingdom.
2) Department of Electronic and Electrical Engineering, Trinity College Dublin, Ireland.
[gorzelm, fboland]
3) Department of Mechanical and Manufacturing Engineering, Trinity College Dublin, Ireland.
In this paper,wepresent an investigation into the perception of source distance in interactive virtual auditory
environments in the context of First (FOA)and Higher Order Ambisonic (HOA)reproduction. In particular,we
investigate the accuracyofsound field reproduction overvirtual loudspeakers (headphone reproduction)with
increasing Ambisonic order.Performance of 1st,2
nd and 3rd order Ambisonics in representing distance cues is
assessed in subjective audio perception tests. Results demonstrate that 1st order sound fields can be sucient in
representing distance cues for Ambisonic-to-binaural decodes.
PACS no. 43.20.-f,43.55.-n, 43.58.-e, 43.60.-c, 43.71.-k, 43.75.-z
1. Introduction
Recent advances in interactive entertainment technology
have led to visual displays with aconvincing perception
of source distance, based not only on stereo vision tech-
niques, butalso on real time graphics rendering technol-
ogy for correct motion parallax [1, 2].
Typically,such presentations are accompanied by loud-
speaker surround technology based on amplitude panning
techniques and aimed at multiple listeners. However, in
interactive virtual environments, headphone listening al-
lows for greater control overpersonalized sound field re-
production. One method of auditory spatialization is to in-
corporate Head Related Transfer Functions (HRTFs)into
the headphone reproduction signals. HRTFs describe the
interaction of alistener’shead and pinnae on impinging
source wavefronts. It has been shown that for eective
externalization and localization to occur,head-tracking
should be employed to control this spatialization pro-
cess [3], particularly where non-individualised HRTFs are
used. However, the switching of the directionally depen-
dent HRTFs with head movement can lead to auditory arti-
facts caused by wave discontinuity in the convolved binau-
ral signals [4]. Amore flexible solution is to form ‘virtual
loudspeakers’ from HRTFs, where the listener is placed
at the centre of an imaginary loudspeaker array.Here, the
loudspeaker feeds are changed relative to the head po-
sition and anytechnique for sound source spatialization
Received25February 2011,
accepted 1October 2011.
overloudspeakers can be used. Manydierent spatializa-
tion systems have been proposed for such application in
the literature, most notably Vector Based Amplitude Pan-
ning (VBAP) [5] and Wavefield Synthesis [6]. However,
the Ambisonics system [7], which is based on the spheri-
cal harmonic decomposition of the sound field, represents
apractical and asymptotically holographic approach to
spatialization. It is well known in Ambisonic loudspeaker
reproduction, that as the order of sound field representa-
tion gets higher,the localization accuracyincreases due to
greater directional resolution.
However, there are manyunanswered questions of the
capability of Ambisonic techniques with regard to the per-
ception of depth and distance. In this paper,wewant to
investigate whether enhanced directional accuracyofdi-
rect sound and early reflections in asound field can pos-
sibly lead to better perception of environmental depth and
thus better localization of the sound source distance in this
environment. We approach the problem by means of sub-
jective listening tests in which we compare the perception
of distance of real sound sources to the First Order Am-
bisonic (FOA)and Higher Order Ambisonic (HOA)sound
fields presented overheadphones.
This paper is outlined as follows: We will begin by pre-
senting asuccinct reviewofthe relevant psychoacousti-
cal aspects of auditory localization and distance percep-
tion. We will then outline the incorporation of Ambisonic
techniques to virtual loudspeaker reproduction and sub-
sequent re-synthesis of measured FOAsound fields into
higher orders. Acase study investigating the perception
of source distance at higher Ambisonic orders is then pre-
sented through subjective listening tests.
©S.Hirzel Verlag ·EAA 61
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.:Distance perception
Vol. 98 (2012)
2. Distance Perception
It is important to note that throughout the literature there
exists aclear distinction between ‘distance’ and ‘depth’,
both understood as perceptual attributes of sound. Accord-
ing to [8], ‘distance’ is related to the physical range be-
tween the sound source and alistener,whereas ‘depth’ re-
lates to the recreated auditory scene as awhole and con-
cerns asense of perspective in that scene.
2.1. Distance Perception in aFreeField
Although the human ability to perceive sources at dier-
ent distances is not fully understood, there are several key
factors, which are known to contribute to distance percep-
tion. In the first case, changes in distance lead to changes
in the monaural transfer function (the sound pressure at
one ear). This is shown in Figure 1for aspherical model
of ahead. We see that for sources of less than 1m distance,
the sound pressure levelvaries depending on the angle of
incidence, due to the shadowing eects of the head. Be-
yond 1m, the intensity of the source decays according to
the inverse square law.
However, absolute monoaural cues will only be mean-
ingful if we have some prior knowledge of the source level,
i.e., howfamiliar we are with the source. In other words, a
form of semiosis occurs, where the perception of localiza-
tion is based on anticipation and experience [9]. Forexam-
ple, for normal levelspeech (approximately 60dB at 1m),
we expect nearer sources to be loud, and quieter sources
further away.However,this is more dicult to assess for
synthetic sounds or sounds that we are unfamiliar with.
It is interesting to note that for sources in the median
plane, the levelatdistances less than 1m does not change
as dramatically as sources located at the ipsilateral point.
This will not significantly aect the lowfrequencyInterau-
ral Time Dierence (ITD), butitisreflected in the Interau-
ral LevelDierence (ILD)asshown in Figure 2. We note
that the most extreme ILD is exhibited at the side of the
head (90), due to the maximum head shadowing eect.
Forasimilar reason, subconscious head movements may
be regarded as another important cue since levelchanges
close to the source will be more apparent then farfrom it
[10]. Thus, near-field ILD cues exist, which aid us in dis-
criminating source distance.
On the other hand, for larger distances and high sound
pressure levels, the propagation speed of asound wave
in amedium ceases to be constant with frequency, which
may lead to distortion of the waveform [11]. Furthermore,
sound wavestravelling asubstantial distance also undergo
aprocess of energy absorption by water molecules in the
atmosphere. This is more apparent for high-frequencyen-
ergy of the wave and leads to spectral changes (low-pass
filtering)ofthe sound being heard.
2.2. Distance Perception in aReverberant Field
In reverberant rooms, the ratio of the direct to reverberant
sound plays an extremely important role in distance per-
ception. Fornear sources, where the direct field energy is
Source distance (m)
RMS Monaural Transfer Function (dB)
.2 .5 12510
Figure 1. RMS monaural transfer function for aspherical head
model at the left ear for broadband source at dierent angles with
varying source distance (reference =plane wave at (0,0)).
Source distance (m)
ILD (dB)
.2 .5 12510
Figure 2. Interaural leveldierence of spherical head model for
broadband source at dierent angles with varying source dis-
much greater than the reverberant field, the sound pressure
levelapproximately changes in accordance to the free-field
conditions. However, for source-listener distances greater
than the critical distance, the levelofreverberation is in
general independent of the source position due to the ho-
mogeneous levelofthe diuse field and the direct to re-
verberant ratio changes approximately 6dB per doubling
of distance from the source.
The directions of arrivalofthe early reflections are an-
other parameter,which change according to the source-
listener position and can be regarded as an important factor
in creating environmental depth. Whether it is useful to the
listeners in determining the distance to the sound source
in the presence of other cues likesound intensity,direct
to reverberant energy ratio or the arrivalpattern of delays,
remains an open question that needs to be addressed. Am-
bisonics allows for enhanced directional reproduction of
deterministic components of asound field by increasing
Kearney et al.:Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)
the order of spherical harmonic decomposition. However,
better directional localization can be achievedwithout af-
fecting other important cues for distance estimation like
overall sound intensity or direct to reverberant energy ra-
tio. Thus it can constitute an ideal framework for testing
whether less apparent properties of asound field can influ-
ence the perception of distance.
2.3. Former Psychoacoustical Studies on Distance
The perception of distance has been shown to be one that is
not linearly proportional to the source distance. Forexam-
ple, both Nielson et al. [12] and Gardner [13] have shown
that the localization of speech signals is consistently un-
derestimated in an anechoic environment. This underesti-
mation has also been shown by other authors in the context
of reverberant environments, both real and virtual. In [14],
Bronkhorst et al. demonstrate that in adamped virtual en-
vironment, sources are consistently perceivedtobecloser
than in areverberant virtual environment, due to the direct
to reverberant ratio. In their studies, the room simulation
is conducted using simulated Binaural Room Impulse Re-
sponses (BRIRs)created from the image source method
[15]. Theyshowhow perceiveddistance increases rapidly
with the number and amplitude of the reflections.
In asimilar study,Rychtarikova et al. [16] investi-
gated the dierence in localization accuracybetween real
rooms and computationally derivedBRIRs. Their findings
showthat at 1m,localization accuracyinboth the virtual
and real environments is in good agreement with the true
source position. However, at 2.4 m, the accuracydegrades,
and high frequencylocalization errors were found in the
virtual acoustic pertaining to the dierence in HRTFs be-
tween the model and the subject. In the same vain, Chan et
al. [17] have shown that distance perception using record-
ings made from the in-ear microphones on individual sub-
jects again lead to underestimation of the source distance
in virtual reverberant environments, more so than with real
Waller [18] and Ashmead et al. [10] have identified that
one of the factors improving distance perception is the lis-
tener movement in the virtual or real space. It is therefore
crucial to account for anylistener’smovements (orlack
thereof)inthe experimental design.
Similarly,for headphone reproduction of virtual acous-
tic environments, small, subconscious head rotations may
lead to improvements in distance perception by providing
enhanced ILD and ITD cues. Therefore, the sound field
transformations should reflect well the small changes of
orientation of the listener’shead.
3. Ambisonic Spatialization
Ambisonics wasoriginally developed by Gerzon, Barton
and Fellgett [7] as aunified system for the recording, re-
production and transmission of surround sound. The the-
ory of Ambisonics is based on the decomposition of the
sound field measured at asingle point in space into spher-
ical harmonic functions defined as
mn(Φ,Θ)=Amn Pmn (sin Θ)(1)
where mis the order and nis the degree of the spherical
harmonic and Pmn is the fully normalized (N3D)associ-
ated Legendre function. The coordinate system used com-
prises x,yand zaxes pointing to the front, left and up
respectively, Φis the azimuthal angle with the clockwise
rotation and Θis the elevation angle form the x-y plane.
Foreach order mthere are (2m+1) spherical harmonics.
In order for plane wave representation overaloud-
speaker array we must ensure that
where sis the pressure of the source signal from direction
(Φ,Θ)and giis the ith loudspeaker gain from direction
). We can then express the left hand side of equation
(2) in vector notation, giving the Ambisonic channels
(Φ,Θ),....Y σ
Equation (2) can then be rewritten as
where Care the encoding gains associated with the loud-
speaker positions and gis the loudspeaker signal vector.In
order to obtain g,werequire adecode matrix, D,which is
the inverse of C.However,toinvert Cwe need the matrix
to be asquare, which is only possible when the number of
Ambisonic channels is equal to the number of loudspeak-
ers. When the number of loudspeaker channels is greater
than the number of Ambisonic channels, which is usually
the case, we then obtain the pseudo-inverse of Cwhere
Since the sound field is represented by aspherical coor-
dinate system, sound field transformation matrices can be
used to rotate, tilt and tumble the sound fields. In this way,
the Ambisonic signals themselves can be controlled by the
user,allowing for the virtual loudspeaker approach to be
employed. For3-D reproduction, the number of Ivirtual
loudspeakers employed with the Ambisonics approach is
dependent on the Ambisonic order m,where
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.:Distance perception
Vol. 98 (2012)
4. Virtual Loudspeaker Reproduction
In the ‘virtual loudspeaker’ approach, HRTFs are mea-
sured at the ‘sweet-spot’ (the limited region in the cen-
tre of areproduction array where an adequate spatial im-
pression is generally guaranteed)inamulti-loudspeaker
reproduction setup, and the resultant binaural playback is
formed from the convolution of the loudspeaker feeds with
the virtual loudspeakers. This concept is illustrated in Fig-
ure 3. Forthe left ear we have
hLi qi,(7)
where denotes convolution and hLi is the left ear HRIR
corresponding to the ith virtual loudspeaker and qiis the
ith loudspeaker feed. Similar relations apply for the right
ear signal. This method wasfirst introduced by McKeag
and McGrath [19] and examples of its adoption can be
found in [20] and [21]. This approach has major computa-
tional advantages, since acomplexfilter kernel is not re-
quired and head rotation can be simulated by changing the
loudspeaker feeds pas opposed to the HRTFs. Whilst the
HRTFs in this case play an important role in the spatializa-
tion, ultimately it is the sound field creation overthe virtual
loudspeakers which givesthe overall spatial impression.
Most existing research uses ablock frequencydomain ap-
proach to this convolution. However, giventhat the virtual
loudspeaker feeds are controlled via head-tracking in real-
time, atime-domain filtering approach can also be utilized.
For short filter lengths, obtaining the output in apoint wise
manner avoids the inherent latencies introduced by block
convolution in the frequencydomain. Astrategy for sig-
nificant reduction of the filter length without artifacts has
been proposed in [22].
5. Higher Order Synthesis
In order to compare the distance perception of dierent
orders of Ambisonic sound fields, it is desirable to take
real world sound field measurements. However, the for-
mation of higher order spherical harmonic directional pat-
terns is non-trivial. Thus, in order for us to change FOA
impulse responses to HOArepresentations, we will em-
ployaperceptual based approach which will allowusto
synthesize the increased directional resolution that would
be achievedwith aHOA sound field recording. Forthis we
adopt the directional analysis method of Pulkki and Meri-
maa, found in [23]. Here the B-format signals are analyzed
in terms of sound intensity and energy in order to derive
time-frequencybased direction of arrivaland diuseness.
The instantaneous intensity vector is givenfrom the pres-
sure pand particle velocity uas
Since we are using FOAimpulse response measurements,
the pressure can be approximated by the 0th order Am-
bisonics component w(t)which is omnidirectional
Figure 3. The virtual loudspeaker reproduction concept.
and the particle velocity by
where ex,ey,and ezrepresent Cartesian unit vectors, x(t),
y(t), z(t)are the FOAsignals and Z0is the characteristic
acoustic impedance of air.
The instantaneous intensity represents the direction of
the energy transfer of the sound field and the direction of
arrivalcan be determined simply by the opposite direction
of I.For FOA, we can calculate the intensity for each coor-
dinate axis, and in the frequencydomain. Since aportion
of the energy will also oscillate locally,adiuseness esti-
mate can be made from the ratio of the magnitude of the
intensity vector to the overall energy density E,given as
where · denotes time averaging, || ·||denotes the norm
of the vector and cis the speed of sound. The diuseness
estimate will yield avalue of zero for incident plane waves
from aparticular direction, butwill give avalue of 1where
there is no net transport of acoustic energy,such as in the
cases of reverberation or standing waves. Time averaging
is used since it is dicult to determine an instantaneous
measure of diuseness.
Kearney et al.:Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)
The output of the analysis is then subject to smoothing
based on the Equivalent Rectangular Bandwidth (ERB)
scale, such that the resolution of the human auditory sys-
tem is approximated. Since the frequencydependent direc-
tion of arrivalofthe non-diuse portion of the sound field
can be determined, HOAreproduction can be achieved
by re-encoding point likesources corresponding to the di-
rection indicated in each temporal average and frequency
band into ahigher order spherical harmonic representa-
tion. The resultant Ambisonic signals are then weighted in
each frequencyband kaccording to 1ψk.However,it
is only vital to re-encode non-diuse components to higher
order and the diuse field can be obtained by multiplying
the FOAsignals by ψkand forming afirst order decode.
This is justified since source localisation is dependent on
the direction of arrivalofthe direct sound and early reflec-
tions and not on late room reverberation [24]. Thus, from
the perceptual point of view, it is questionable whether
there is aneed to preservethe full directional accuracyof
the reverberant field. Furthermore, if there exists ageneral
directional distribution to the diuse field, this will still be
preserved in first order form. On the other hand, the diuse
component should not be simply derivedfrom the 0th order
signal. One can easily see that such asolution would pro-
vide perfectly correlated versions of the diuse field to the
left and right ear signals, which have no equivalent in the
physical world (i.e. real, physical sound field). Moreover,
interaural decorrelation is an important factor in providing
spatial impression in enclosed environments [25].
Figure 4shows an example of the first 20 ms of a1
order impulse response taken in areverberant hall [26].
Here the source waslocated 3m from aSoundfield
ST350 microphone, and the Spatial Room Impulse Re-
sponse (SRIR)captured using the exponentially swept-
sine tone technique [27]. In these plots, particular attention
is drawn to the direct sound (coming from directly in front
of the microphone)and aleft wall reflection at approxi-
mately 14 ms. It can be seen that the directional resolution
increases significantly with HOArepresentation. It should
be noted, that the A-format capsule on sound field micro-
phones only display adequate directionality up to 10 kHz
[28]. Spatial aliasing is therefore an issue for high fre-
quencies and as aresult, the directional information above
10 kHz cannot be relied upon.
6. Method: Localization of Distance of Test
Dierent protocols have been used in literature for subjec-
tive assessment of distance perception, most notably aver-
bal report [29, 30], direct or indirect blind walking [31, 32]
or imagined timed walking [32]. All of these methods have
provedtoprovide reliable and comparable results for both,
auditory and visual stimuli, with direct blind walking ex-
hibiting the least between-subject variability [31, 32].
In former work [26], authors of this paper developed a
method where subjects indicated the perceiveddistance of
real and virtual sound sources by selecting one of several
Direct sound
Left wall reflection
Direct sound
Left wall reflection
Figure 4. Ambisonic sound field from 1st order measurement
with aSoundfield ST350: (a) 1st order representation, (b) 3rd
order up-mix.
physical loudspeakers lined up (and slightly oset in order
to provide ‘acoustic transparency’)infront of their eyes.
However, for the present study,inorder to completely
eliminate anypossible anchors as well as visual cues, it
wasdecided to utilize the method of direct blind walking.
Of the main concerns in the experiment wasadirect com-
parison of distance perception of real sound sources versus
virtual sound sources presented overheadphones. Due to
dierent apparatus requirements, the experiment had to be
conducted in twoseparate phases.
6.1. Participants
Sevenparticipants aged 24–58 took part in the experiment.
All subjects were of good hearing and were either music
technology students or practitioners actively involved in
audio research or production. Prior to the test, HRIR data
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.:Distance perception
Vol. 98 (2012)
Figure 5. Measuring Head Related Impulse Responses with
miniature microphones.
for all the participants has been obtained in asound-proof,
large (18×15×10 m3)but quite damped (T60 @1000 Hz =
0.57 s) multipurpose room (Black Box)inthe Department
of Theatre, Film and Television at the University of Yo rk.
Additional damping wasassured by thick, heavy,curtains
covering all four walls and acarpet on the floor.The
measurement process consisted of astandard procedure
where miniature, omnidirectional microphones (Knowles
FG-23629-P16)were placed at the entrance of ablocked
ear canal in order to capture acoustic pressure generated
by one loudspeaker at atime located at constant distance
and varying angular direction.
Subjects were seated on an elevated platform so that
their ears were 2.20 mabove the ground and their head
wasinthe centre of aspherical loudspeaker array,arranged
in diametrically opposed pairs. The ear height wascali-
brated using alaser guide, as shown in Figure 5. The ar-
ray consisted of 16 full range Genelec 8050A loudspeak-
ers since the intention wastoreproduce Ambisonic sound
fields up to and including 3rd order.This 3-D setup, shown
in Figure 6, comprised aflat-front, horizontal octagon and
acube (four loudspeakers on top, and four on the bottom).
The radius of the loudspeaker array (and thus the virtual
loudspeaker array)was 3.27 m. ForFOA-to-binaural de-
code, only virtual loudspeakers from the cube configura-
tion were utilized, since no directional resolution is gained
by using ahigher number of loudspeakers. Furthermore,
despite careful alignment, oversampling of the sound field
with higher numbers of speakers has the potential to yield
sound field distortions [33]. Note that for 2nd and 3rd order
Figure 6. Array of 16 loudspeakers used for HRIR measure-
reproduction, all 16 loudspeakers were used. Although the
oversampled configuration wasnot optimal from the 2nd
order reproduction point of view, it wasnot possible to
easily and accurately rearrange the loudspeaker array in
order to accommodate for adierent layout.
HRIRs were captured using the exponentially swept-
sine tone technique [27] at 44.1 kHz sampling rate and
16-bit resolution. Since the measurement environment was
not fully anechoic, further processing of the measured data
wasnecessary.The HRIRs were tapered before the arrival
of the first reflection (from the floor)yielding filter kernels
with 257 taps and were subsequently diuse-field equal-
6.2. Stimuli
The stimuli used in the experiment were pink noise
bursts and phonetically balanced phrases selected from
the TIMIT Acoustic-Phonetic Continuous Speech Corpus
database and recorded by afemale reader [34]. Asam-
pling rate of 44.1 kHz and 16 bit resolution wasused in
both cases. These twosample types were selected in order
to represent both unfamiliar and familiar sound sources.
Theywere presented to the subjects in apseudo-random-
ized manner to avoid anyordering eects.
Forheadphone reproduction, prior to the test phase,
FOAimpulse response measurements were taken from the
listener position of each loudspeaker using the exponen-
tially swept-sine tone technique [27]. From these measure-
ments, 2nd and 3rd order impulse response sets were ex-
tracted using the directional analysis approach outlined in
section 5. 0th order Ambisonics does not provide anydi-
rectional information which means that it would lack the
cues that are investigated in the higher order renderings.
Therefore, it wasdecided not to include it in this compari-
The only psychoacoustical optimization applied to the
Ambisonics decodes wasshelf filtering and wasintended
to satisfy Gerzon’slocalization criteria for maximized ve-
locity decode at lowfrequencies and energy decode at
higher frequencies [35]. This involved changing the ratio
Kearney et al.:Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)
of the pressure to velocity components at lowand high
frequencies. Whilst the crossoverfrequencyfor the high
frequencyboost in the pressure channel at first order is
normally in the region of 400 Hz for regular loudspeaker
listening, here, we restore the crossoverpoint to 700 Hz,
since the subject is always perfectly centred in the virtual
loudspeaker array.
6.3. Test Environment and Apparatus
Aseries of subjective listening tests wasconducted in the
Large Rehearsal Room in the Department of Theatre, Film
and Television in the University of Yo rk. The room dimen-
sions were 12 ×9×3.5m
3and the spatially averaged T60
at 1kHz was0.26 s. Alow T60 wasdesired for this study,
so the walls were covered with thick, heavy curtains, as
shown in Figure 7. Since the up-mix from 1st to 2nd and 3rd
order Ambisonics concerned only the deterministic part of
the measured SRIRs, it wasassumed that no advantage
would be gained from using amore reverberant space.
Aprofessional camera dolly track wasset up roughly
in the direction of the diagonal of the room. It not only
allowed for testing distances of the real loudspeaker up
to 8m butits non-symmetrical position also assured that
early reflections of the same order from dierent surfaces
did not easily coincide at the subjects ears, butinstead ar-
rivedatdierent times. Asingle full-range loudspeaker
(Genelec 8050A)was mounted on acamera dolly which
enabled it to be noiselessly translated by the experiment
assistant to dierent locations. The guiding rope washung
along the dolly track which wasintended to help and guide
the participants when walking toward the sound source.
Since it wasnot possible to walk exactly on the dolly track,
it wasdecided that the walking path would be directly next
to it, as shown in Figure 7. The only weakness of this so-
lution wasthat the sound source horizontal angle varied
from 14.04 degrees at the closest distance (2 m) to 3.58 de-
grees at the furthest distance (8 m).However,this did not
have anyeect on the distance judgments for tworeasons:
Firstly,the subjects were allowed (orevenencouraged)to
rotate their head in order to fully utilize the available ITD
and ILD cues. Secondly,the initial head orientation was
not in anyway fixed. This, combined with the fact that
there were no clear cues to the subject’sinitial orientation
in the room at the origin, made this small initial angular
oset unimportant. Furthermore, none of the participants
reported anybias in their assessment based on the horizon-
tal oset of the sound source.
Fortrials with binaural presentation, high quality open
back headphones (AKG-K601)were used, which exhibit
lowlevels of interaural magnitude and group delay dis-
tortion. Sound field rotation, tilt and tumble control was
implemented via the TrackIR 5infra-red head tracking
system [36], resulting in stable virtual images with head
rotations. The system responsible for playback of virtual-
ized sound sources wascompletely built in the Pure Data
visual programming environment [37] and its combined
latency(including head-tracker data porting and audio up-
date rate)was 20 ms.
Figure 7. Participant performing atrial during the experiment.
6.4. Procedure
In the experiment, subjects entered the test environment
blindfolded and without anyprior expectation regarding
the room dimensions, its acoustic properties or the test
apparatus. Theywere guided by the experimenter to the
reference point (the ‘origin’). After ashort explanation of
the experiment objectives, atraining session beganwith
ashort (3–5 min)walking-only trial until participants felt
comfortable with walking blindfolded and using aguide
rope. Next, theyperformed 4–6 training trials in which the
same test stimuli to be used in the experiment (speech and
pink noise)were played by the loudspeaker at randomly
chosen distances. No feedback wasgiven and no results
were recorded after each test trial. The end of the training
session wasclearly announced and after a1minute inter-
val, the first phase of the test began.
In test phase I, participants were asked to listen to static
sound sources at arandomly chosen points, focusing on
the perceiveddistance. Theycould listen to anyaudio sam-
ple as manytimes as theywished. During the playback
theywere instructed to stay still and refrain from anytrans-
lational head movements. However, theywere encouraged
to rotate their head freely.After the playback had stopped,
theywere asked to walk guided by the rope to the point
where theythought the sound originated from. The dis-
tance walked wassubsequently recorded by the assistant
using alaser measuring tool, after which the participant
walked backwards to the origin. In the meantime, the loud-
speaker wasnoiselessly translated to its newposition and
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.:Distance perception
Vol. 98 (2012)
the test proceeded. Similar to the training session, no feed-
back wasgiven at anystage.
During the first test phase, participants had to indicate
the perceiveddistance for sound sources randomly lo-
cated at 2m,4m, 6m or 8m.Taking into account that
both speech and pink noise bursts samples were used (in
apseudo-random order), the number of trials in the first
phase added up to 8. Each subject performed all the trials
only once.
Upon completion of the first phase of the test there was
ashort (approximately 2minutes)interval that wasre-
quired in order to put on the headphones and calibrate
the head-tracking system. In phase II, subjects were also
asked to identify the sound source distance, butthis time
using Ambisonic sound fields presented overheadphones.
Other than the fact that headphones and the head-tracking
system were used, the test protocol remained the same as
in phase I. However, due to the fact that there were three
playback configurations to be tested (1st,2
nd and 3rd order
Ambisonics), participants had to perform 24 trails instead
of 8. Instead of separate phases for each Ambisonic order,
all samples were randomly presented to the subject within
the same test phase. Again, subjects performed all the tri-
als only once and no feedback wasgiven at anystage.
7. Results
The perceivedsound source distance (indicated by the dis-
tance walked)was collected from 7subjects for 4presen-
tation points (2 m, 4m,6mand 8m), twostimuli (female
speech and pink noise bursts)and four playback options:
nd and 3rd Order Ambisonics and real loudspeakers,
which for analysis we will denote FOA, SOA, TOAand
REAL respectively.With headphone trials, none of the
participants reported in-head localization, howeverthere
were 3cases were the proximity of the sound source was
very apparent so participants decided not to move at all.
In some cases, the virtual sound source wasinitially local-
ized behind the subjects butall participants were able to
resolvethe confusion by applying head-rotation.
We computed the mean values of walked distances µfor
each test condition along with the corresponding standard
errors se(µ). The results are presented separately for each
stimulus type within 95% Confidence Intervals.
As expected, the perception of distance for the real sour-
ces wasmore accurate for near sources. Beyond 4m,dis-
tance perception wascontinuously underestimated which
is congruent with the previous studies outlined in sec-
tion 2. Furthermore, the standard deviation of localiza-
tion increases as the source movesfurther into the diuse
field. We also see, that unfamiliar stimuli produce greater
variability in subjects’ answers. The mean localization of
the virtual sources follows the reference source localiza-
tion well. The answers for virtual sources deviate from
their means roughly in the same fashion as the answers for
reference sources, as localization becomes more dicult
within the diuse field.
Since the study followed the within-subject factorial de-
sign with 2(stimuli)*4(playback conditions), in order to
0 1 2 3 4 5 6 7 8 9
Real distance [m]
Distance walked [m]
Figure 8. Mean localization of real and virtual sound sources (fe-
male speech).
0 1 2 3 4 5 6 7 8 9
Real distance [m]
Distance walked [m]
Figure 9. Mean localization of real and virtual sound sources
(pink noise bursts).
investigate the eects of these twofactors (referred later
as factors Aand B) as well as potential interaction ef-
fects, for each presentation distance atwo-way ANOVA
has been performed. The null hypothesis being tested here
is that all the mean perceiveddistances for all the stimuli
and playback methods do not dier significantly
H0:µFOA=µSOA=µTOA=µReal =µ,
H1:not all localization means (µi)are the same.
No statistically significant eect of stimuli (familiar vs.
unfamiliar)onthe perception of distance has been found
(F2m(3,48) =0.835, p=0.365; F4m (3,48) =2.0462,
p=0.159; F6m(3,48) =2.575, p=0.115; F8m (3,48) =
2.0462, p=0.159). Fordistances of 4m and more,
playback option had also no statistically significant eect
(F4m(3,48) =2.192, p=0.101; F6m (3,48) =0.665,
p=0.577; F8m(3,48) =0.202, p=0.894).
However, astatistically significant dierence has been
detected for the distance of 2m.Inlarger study de-
signs with multiple levels it is advisable to use the Hon-
Kearney et al.:Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)
Table I. Mean localization [m] of virtual and real sound sources
at 2m.
Speech 1.119 1.389 0.841 1.638
Noise 0.877 1.001 0.902 1.641
Table II. Correlation coecients ρand corresponding pvalues
for pairs of distance estimations for real and virtual sound
sources (Speech).
Real vs FOA0.9828 0.0172
Real vs SOA0.9960 0.0040
Real vs TOA0.9590 0.0410
Table III. Correlation coecients ρand corresponding pvalues
for pairs of distance estimations for real and virtual sound
sources (Noise).
Real vs FOA0.9913 0.0087
Real vs SOA0.9857 0.0143
Real vs TOA0.9972 0.0028
estly Significant Dierence (HSD)approach since there
is an increased risk of spuriously significant dierence
arisen purely by chance. So, in order to investigate further
where the dierence occurs, an HSD has been computed,
(HSD =1.423m). If we nowcompile the table of mean
perceiveddistances for the sound sources located at 2m
we can see that all of the above values clearly lie within
asingle HSD to each other and cannot be distinguished.
We can safely assume then an ANOVA false alarm (type
Ierror)and no statistically significant eect of playback
method for the sources at the distance of 2mas well.
Lastly,for all the distances no synergetic eects of fac-
tors A(stimuli)and B(playback conditions)havebeen
Additionally,wecalculated correlation coecients ρfor
pairs of distance estimations for real and virtual sound
sources (either 1st,2
nd or 3rd order)and twostimuli. In
all cases, high correlation coecients have been obtained,
which confirms our findings that for these particular test
conditions, the perception of distance of binaurally ren-
dered Ambisonic sound fields of orders 1to3cannot be
distinguished from the perception of distance of the real
sound sources.
8. Discussion
The results presented for real sources corroborate the clas-
sic underestimation of source distance, as reported in the
literature. These results were used as abasis with which
to measure the ability of Ambisonic sound fields of dier-
ent orders to present sources at dierent distances. It was
expected that afurther underestimation of the source dis-
tance would ensue with the binaural rendering, as reported
in [17]. However, this wasnot the case, even for first or-
der presentations, and the apparent distances of the vir-
tual sources matched the real source distances well. One
should note that the major dierence between this study
and that of [17] is our use of head-tracking, indicating
the importance of head-movements in perceiving source
distance, which develops the findings of Waller [18] and
Ashmead et al. [10] on user interaction in avirtual space.
Further work is required to quantify the eect of this.
Moreoverthe presented study demonstrates that the en-
hanced directional accuracygained by presenting sound
sources in HOAthrough head-tracked binaural rendering
does not yield asignificant improvement in the perception
of the source distance. What is noteworthyisthat for each
order,there is no significant dierence in the perception of
the source location when compared to real-world sources.
We therefore conclude that sound field directionality for
distance perception is sucient with 1st order playback.
The presence of the ANOVA false alarm at the 2mpoint
is of interest. It is noteworthythat the 2m point represents
asource inside the virtual array geometry.Itisaknown
issue that virtual sound sources rendered inside the array
of loudspeakers cannot be reproduced in astraightforward
waywithout artifacts. Some of these artifacts include in-
correct wave-front curvature and insucient bass boost.
In the first case, there is ample evidence in the litera-
ture to suggest that the wavefront curvature translates to a
significant binaural cues for sound sources near the head
[30, 38]. It wasalready shown in section 2.1 that as a
source movescloser to the head the levels of the monau-
ral transfer function and the ILD both change significantly
with source angle. Howeverthis eect is not strong at 1m
and beyond. Forsources further away,ithas been shown
in [39] that it is very dicult to assess distance by binaural
cues alone.
In the second case, the requirement for distance com-
pensation filtering due to near field eects for the large
loudspeaker radius (3.27 m) and the givensource distances
(>2m)isonly prominent below100 Hz. Forthe female
speech test stimuli, this will not have an eect, since the
first formant frequencies do not go down below180 Hz.
Also, the current method employed for capturing HRIRs
allowed for reliably obtaining filters with afrequency
response reaching down to around 170 Hz, thereby also
band-limiting the delivery of the pink noise stimuli.
Finally,there wasnosignificant dierence in the results
presented for dierent sources, although the greater vari-
ance in the results for pink noise suggest that the famil-
iarity of the source does indeed play arole in the percep-
tion of source distance, as mentioned in section 2.3. Future
studies will investigate the use of these monaural cues fur-
ther,and will utilize 0th order sound field rendering, since
it will remove the influence of anydirectional information.
Considering the aforementioned study of Bronkhorst et
al. [14], where the accuracyofdistance perception for bin-
aural playback increases with the number of reflections,
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.:Distance perception
Vol. 98 (2012)
our findings demonstrate that the net eect of the monaural
cues of direct to reverberant ratio, leveldierence and time
of arrivalofearly reflections are of greater importance in
distance perception for binaural rendering than Ambisonic
directional accuracybeyond 1st order.
9. Conclusions
We have assessed through subjective analysis the per-
ceivedsource distance in virtual Ambisonic sound fields
in comparison to real world sources. The hypothesis tested
wasthat enhanced directional accuracyofdeterministic
part of the sound field may lead to better reconstruction
of environmental depth and thus improve the perception
of sound source distance. However, it wasshown that
Ambisonic reproduction matches the perceivedreal world
source distances well even at 1st order and no improvement
in this regard wasobserved when increasing the order.It
must be emphasized though, that this analysis applies to
Ambisonic-to-binaural decodes with higher order synthe-
sis achievedusing the directional analysis method of [23].
Therefore, further work will examine this topic for loud-
speaker reproduction for both centre and o-centre listen-
ing as well as investigate the eectiveness of HOAsynthe-
sis in comparison to real world HOAmeasurements.
The authors gratefully acknowledge the participation of
the test subjects for both their time and constructive com-
ments, as well as the technical support staat the Depart-
ment of Theatre, Film and Television at the University of
York for their assistance in the experimental setups. This
research is supported by Science Foundation Ireland.
[1] L. Fauster: Stereoscopic techniques in computer graphics.
Technical paper,TUWien, 2007.
[2] J. Lee: Head tracking for desktop VR displays using the
accessed 30th Sept. 2011.
[3] D. R. Begault: Direct comparison of the impact of head
tracking, reverberation, and individualized head-related
transfer functions on the spatial perception of avirtual
sound source. J. Audio Eng. Soc 49 (2001)904–916.
[4] M. Otani, T. Hirahara: Auditory artifacts due to switching
head-related transfer functions of adynamic virtual audi-
tory display.IEICE Trans. Fundam. Electron. Commun.
Comput. Sci. E91-A (2008)1320–1328.
[5] V. Pulkki: Virtualsound source positioning using Vector
Base Amplitude Panning. J. Audio Eng. Soc. 45 (1997)
[6] A. J. Berkhout: AHolographic Approach to Acoustic Con-
trol. J. Audio Eng. Soc 36 (1988)977–995.
[7] M. A. Gerzon: Periphony: With-height sound reproduction.
J. Audio Eng. Soc 21 (1973)2–10.
[8] F. Rumsey: Spatial quality evaluation for reproduced
sound: Terminology,meaning, and ascene-based para-
digm. J. Audio Eng. Soc. 50 (2002)651–666.
[9] J. Blauert: Communication acoustics. Springer,2008.
[10] D. H. Ashmead, D. L. Davis, A. Northington: Contribution
of listeners’ approaching motion to auditory distance per-
ception. J. Exp. Psy: Hum. Percep. and Perform. 21 (1995)
[11] E. Czerwinski, A. Vo ishvillo, S. Alexandrov, A. Te rekhov:
Propagation distortion in sound systems: Can we avoid it?
J. Audio Eng. Soc 48 (2000)30–48.
[12] S. H. Nielsen: Auditory distance perception in dierent
rooms. J. Audio Eng. Soc. 41 (1993)755–770.
[13] M. B. Gardner: Distance estimation of 0or apparent 0
oriented speech signals in anechoic space. J. Acoust. Soc.
Am. 45 (1969)47–53.
[14] A. W. Bronkhorst, T. Houtgast: Auditory distance percep-
tion in rooms. Nature 397 (1999)517–520.
[15] J. B. Allen, D. A. Berkley: Image method for eciently
simulating small-room acoustics. J. Acoust. Soc. Am. 65
[16] M. Rychtarikova, T. V. d. Bogaert, G. Ve rmeir,J.Wouters:
Binaural sound source localization in real and virtual
rooms. J. Audio Eng. Soc. 57 (2009)205–220.
[17] J. S. Chan, C. Maguinness, D. Lisiecka, C. Ennis, M.
Larkin, C. O’Sullivan, F. Newell: Comparing audiovisual
distance perception in various real and virtual environ-
ments. Proc. of the 32nd Euro. Conf. on Vis. Percep., Re-
gensburg, Germany, 2009.
[18] D. Wa ller: Factors aecting the perception of interobject
distances in virtual environments. Presence: Teleoper.Vir-
tual Environ. 8(1999)657–670.
[19] A. McKeag, D. McGrath: Sound field format to binaural
decoder with head-tracking. Proc. of the 6th Australian Re-
gional Convention of the AES, 1996.
[20] M. Noisternig, A. Sontacchi, T. Musil, R. Holdrich: A
3D Ambisonic based binaural sound reproduction system.
Proc. of the 24th Int. Conf. of the Audio Eng. Soc., Alberta,
Canada, 2003.
[21] B.-I. Dalenb ..
ack, M. Str..
omberg: Real time walkthrough au-
ralization -the first year.Proc. of the Inst. of Acous.,
Copenhagen, Denmark, 2006.
[22] C. Masterson, S. Adams, G. Kearney, F. Boland: Amethod
for head related impulse response simplification. Proc.
of the 17th European Signal Processing Conference (EU-
SIPCO), Glasgow, Scotland, 2009.
[23] J. Merimaa, V. Pulkki: Spatialimpulse response rendering
i: Analysis and synthesis. J. Audio Eng. Soc. 53 (2005).
[24] W. M. Hartmann: Localization of sound in rooms. J.
Acoust. Soc. Am. 74 (1983)1380–1391.
[25] D. Griesinger: Spatial impression and envelopment in small
rooms. Proc. of the 103rd Conv. of the Audio. Eng. Soc,
NewYork, USA, 1997.
[26] G. Kearney, M. Gorzel, H. Rice, F. Boland: Depth per-
ception in interactive virual acoustic environments using
higher order ambisonic soundfields. Proc. of the 2nd Int.
Ambisonics Symp., Paris, France, 2010.
[27] A. Farina: Simultaneous measurement of impulse response
and distortion with aswept-sine technique. Proc. of the
108th Conv. of the Audio Eng. Soc., Paris, France, 2000.
[28] M. Gerzon: The design of precisely coincident microphone
arrays for stereoand surround sound. Proc. of the 50th
Conv. of the Audio Eng. Soc., London, UK, 1975.
Kearney et al.:Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)
[29] C. Guastavino, B. F. G. Katz: Perceptual evaluation of
multi-dimensional spatial audio reproduction. J. Acoust.
Soc. Am. 116 (2004)1105–1115.
[30] P. Zahorik:Assessing auditory distance perception using
virtual acoustics. J. Acoust. Soc. Am. 111 (2002)1832–
[31] J. M. Loomis, R. L. Klatzky, J. W. Philbeck, R. G. Goll-
edge:Assessing auditory distance perception using percep-
tually directed action. Perception And Psychophysics 60
[32] T. Y. Grechkin, T. D. Nguyen, J. M. Plumert, J. F. Cremer,
J. K. Kearney: Howdoes presentation method and measure-
ment protocol aect distance estimation in real and virtual
environments? ACMTrans. Appl. Percept. 7(2010)26:1–
[33] S. Bertet: Formats audio 3d hiérarchiques: Caractérisation
objective et perceptive des systémes ambisonicsd’ordres
supérieurs. Ph.D. dissertation, INSA Lyon, 2008.
[34] W. M. Fisher,G.R.Doddington, K. M. Goudie-Marshall:
The darpa speech recognition research database: Specifica-
tions and status. Proc. of the DARPAWorkshop on Speech
Recognition, 1986.
[35] M. A. Gerzon, G. J. Barton: Ambisonic decoders for
HDTV. Proc. of the 92nd Conv. of the Audio Eng. Soc.,
Vienna, Austria, 1992.
[36] NaturalPoint: Tr ackir 5. http://www.naturalpoint.
com/trackir/,accessed 30th Sept. 2011.
[37] M. Puckette: Pure data.,ac-
cessed 30th Sept. 2011.
[38] P. Zahorik,D.S.Brungart, A. W. Bronkhorst: Auditory dis-
tance perception in humans: Asummary of past and present
research. Acta Acustica united with Acustica 91 (2005)
[39] H. Wi ttek: Perceptual dierences between Wavefield Syn-
thesis and Stereophony. Department of Music and Sound
Recording, School of Arts, Communication and Humani-
ties, University of Surrey, UK, 2007.
... It is important to note that, overall, participants' perceived distance of sounds was about 15% more accurate for close stimuli than far stimuli. This result may be due, at least in part, to a general underestimation of distances in auditory representation for sounds located at the edge or beyond the reaching space (e.g., Zahorik et al. 2005;Kearney et al. 2012), though this aspect is still not well understood (Kolarik et al. 2016). ...
Full-text available
Previous studies have identified a ‘defensive graded field’ in the peripersonal front space where potential threatening stimuli induce stronger blink responses, mainly modulated by top–down mechanisms, which include various factors, such as proximity to the body, stimulus valence, and social cues. However, very little is known about the mechanisms responsible for representation of the back space and the possible role of bottom–up information. By means of acoustic stimuli, we evaluated individuals’ representation for front and back space in an ambiguous environment that offered some degree of uncertainty in terms of both distance (close vs. far) and front–back egocentric location of sound sources. We aimed to consider verbal responses about localization of sound sources and EMG data on blink reflex. Results suggested that stimulus distance evaluations were better explained by subjective front–back discrimination, rather than real position. Moreover, blink response data were also better explained by subjective front–back discrimination. Taken together, these findings suggest that the mechanisms that dictate blink response magnitude might also affect sound localization (possible bottom–up mechanism), probably interacting with top–down mechanisms that modulate stimuli location and distance. These findings are interpreted within the defensive peripersonal framework, suggesting a close relationship between bottom–up and top–down mechanisms on spatial representation.
... In its most equivalent conditions, it is possible to interpret that the transition from overestimation to underestimation occurred approximately at 40-45 cm (Hüg et al., 2019), 45-60 cm (Brungart, 1999), and 50-55 cm (Parseihian et al., 2014). Comparison with other studies of distance perception that included some close body positions is difficult due to methodological reasons, mainly related to the type of response, range of distances evaluated, the stimuli employed, and the test administration conditions (e.g., Zahorik, 2002;Fontana and Rocchesso, 2008;Kearney et al., 2012). ...
Full-text available
Despite the recognized importance of bodily movements in spatial audition, few studies have integrated action-based protocols with spatial hearing in the peripersonal space. Recent work shows that tactile feedback and active exploration allow participants to improve performance in auditory distance perception tasks. However, the role of the different aspects involved in the learning phase, such as voluntary control of movement, proprioceptive cues, and the possibility of self-correcting errors, is still unclear. We study the effect of guided reaching exploration on perceptual learning of auditory distance in peripersonal space. We implemented a pretest-posttest experimental design in which blindfolded participants must reach for a sound source located in this region. They were divided into three groups that were differentiated by the intermediate training phase: Guided, an experimenter guides the participant’s arm to contact the sound source; Active, the participant freely explores the space until contacting the source; and Control, without tactile feedback. The effects of exploration feedback on auditory distance perception in the peripersonal space are heterogeneous. Both the Guided and Active groups change their performance. However, participants in the Guided group tended to overestimate distances more than those in the Active group. The response error of the Guided group corresponds to a generalized calibration criterion over the entire range of reachable distances. Whereas the Active group made different adjustments for proximal and distal positions. The results suggest that guided exploration can induce changes on the boundary of the auditory reachable space. We postulate that aspects of agency such as initiation, control, and monitoring of movement, assume different degrees of involvement in both guided and active tasks, reinforcing a non-binary approach to the question of activity-passivity in perceptual learning and supporting a complex view of the phenomena involved in action-based learning.
... La raison en est que c'est la plus compliquée, l'incertitude dépendant fortement du stimulus sonore et des conditions. De manière générale, la distance sera sur-estimée pour des sources proches et sous-estimée pour des sources lointaines (Kearney et al. (2015); Zahorik and Wightman (2001b)). ...
La localisation sonore est le procédé utilisé par les êtres humains pour repérer un son dans l’espace. Afin de localiser ces sons, le cerveau traite l’information reçue, et crée des indices acoustiques. L’approche de la thèse pour la localisation sonore perceptive, reposant sur le travail d’Harald Viste pour la localisation de l’azimut, consiste à utiliser ces indices acoustiques dans un algorithme. L’algorithme initial est légèrement simplifié dans cette thèse, et testé dans des conditions réelles. De plus, une approche perceptive innovante pour la localisation de l’élévation est également présentée. La spatialisation sonore est le procédé inverse, permettant de produire un son que l’on percevra à la position souhaitée dans l’espace. Du fait de l’impossibilité d’avoir un système de diffusion en tout point de l’espace, il est nécessaire de recourir à des algorithmes de spatialisation, permettant par exemple des diffusions via des hautparleurs. L’approche perceptive de la thèse, basée sur le travail de Joan Mouba, est d’utiliser les indices acoustiques de la localisation sonore, dans ce travail en les créant dans les sources sonores spatialisées. Ce travail de thèse approfondit les recherches initiales, crée des outils pour aboutir à une proposition de méthode de spatialisation sonore perceptive 3D nommée STAR (Synthetic Transaural Audio Rendering), tout en validant la méthode par des tests rigoureusement menés.
... Results in Figure 8.4 demonstrates that FW group's participants that were exposed to a longer spatial boundary, present a smaller compression effect. This result can be compared to past studies concluding that the presence of congruent visual cues could enhance the accuracy of auditory distance perception [7,33,87]. ...
Full-text available
This thesis aims to investigate a variety of effects linking the auditory distance perception of virtual sound sources to the context of audio-only augmented reality (AAR) applications. It focuses on how its specific perceptual context and primary objectives impose constraints on the design of the distance rendering approach used to generate virtual sound sources for AAR applications. AAR refers to a set of technologies that aim to merge computer-generated auditory content into a user's acoustic environment. AAR systems have fundamental requirements as an audio playback system must enable a seamless integration of virtual sound events within the user's environment. Different challenges arise from these critical requirements. The first part of the thesis concerns the critical role of acoustic cue reproduction in the auditory distance perception of virtual sound sources in the context of audio-only augmented reality. Auditory distance perception is based on a range of cues categorized as acoustic, and cognitive. We examined which strategies for weighting auditory cues are used by the auditory system to create the perception of sound distance. By considering different spatial and temporal segmentations, we attempted to characterize how early energy is perceived in relation to reverberation. The second part of the thesis's motivations focuses on how, in AAR applications, environment-related cues could impact the perception of virtual sound sources. In AAR applications, the geometry of the environment is not always completely considered. In particular, the calibration effect induced by the perception of the visual environment on the auditory perception is generally overlooked. We also became interested in the instance in which co-occurring real sound sources whose placements are unknown to the user could affect the auditory distance perception of virtual sound sources through an intra-modal calibration effect.
... Following the upper reproduction path, the respective method-dependent loudspeaker signals are subsequently convolved with the individual playback HRTFs, representing a binaural downmix (Jot, Wardle, & Larcher, 1998;Kearney et al., 2012). As necessary for the respective evaluation approach, cf. ...
Full-text available
Hearing loss (HL) has multifaceted negative consequences for individuals of all age groups. Despite individual fitting based on clinical assessment, consequent usage of hearing aids (HAs) as a remedy is often discouraged due to unsatisfactory HA performance. Consequently, the methodological complexity in the development of HA algorithms has been increased by employing virtual acoustic environments which enable the simulation of indoor scenarios with plausible room acoustics. Inspired by the research question of how to make such environments accessible to HA users while maintaining complete signal control, a novel concept addressing combined perception via HAs and residual hearing is proposed. The specific system implementations employ a master HA and research HAs for aided signal provision, and loudspeaker-based spatial audio methods for external sound field reproduction. Systematic objective evaluations led to recommendations of configurations for reliable system operation, accounting for perceptual aspects. The results from perceptual evaluations involving adults with normal hearing revealed that the characteristics of the used research HAs primarily affect sound localisation performance, while allowing comparable egocentric auditory distance estimates as observed when using loudspeaker-based reproduction. To demonstrate the applicability of the system, school-age children with HL fitted with research HAs were tested for speech-in-noise perception in a virtual classroom and achieved comparable speech reception thresholds as a comparison group using commercial HAs, which supports the validity of the HA simulation. The inability to perform spatial unmasking of speech compared to their peers with normal hearing implies that reverberation times of 0.4 s already have extensive disruptive effects on spatial processing in children with HL. Collectively, the results from evaluation and application indicate that the proposed systems satisfy core criteria towards their use in HA research.
... Results in Section 4.4 demonstrate that the FW group, whose participants were exposed to a longer spatial boundary, presents a smaller compression effect. This result can be compared to past studies concluding that the presence of congruent visual cues could enhance the accuracy of auditory distance perception [13,29,41]. ...
Full-text available
Audio-only augmented reality consists of enhancing a real environment with virtual sound events. A seamless integration of the virtual events within the environment requires processing them with artificial spatialization and reverberation effects that simulate the acoustic properties of the room. However, in augmented reality, the visual and acoustic environment of the listener may not be fully mastered. This study aims to gain some insight into the acoustic cues (intensity and reverberation) that are used by the listener to form an auditory distance judgment, and to observe if these strategies can be influenced by the listener’s environment. To do so, we present a perceptual evaluation of two distance-rendering models informed by a measured Spatial Room Impulse Response. The choice of the rendering methods was made to design stimuli categories in which the availability and reproduction quality of acoustic cues are different. The proposed models have been evaluated in an online experiment gathering 108 participants who were asked to provide judgments of auditory distance about a stationary source. To evaluate the importance of environmental cues, participants had to describe the environment in which they were running the experiment, and more specifically the volume of the room and the distance to the wall they were facing. It could be shown that these context cues had a limited, but significant, influence on the perceived auditory distance.
... In reflective environments, the direct to reverberation energy ratio is a distance perception cue. Some experiments demonstrated that the first-order binaural Ambisonics is sufficient to recreate auditory distance perception in a reflective environment [49]. Based on a similar psychoacoustic principle, in practice, WFS is able to recreate free-field perceived virtual sources at different distances [50], although the experiment, which was based on a dynamic binaural reproduction experiment, indicated that in practice, WFS may cause some artifacts in recreating focused source. ...
Full-text available
One purpose of spatial audio is to create perceived virtual sources at various spatial positions in terms of direction and distance with respect to the listener. The psychoacoustic principle of spatial auditory perception is essential for creating perceived virtual sources. Currently, the technical means for recreating virtual sources in different directions of various spatial audio techniques are relatively mature. However, perceived distance control in spatial audio remains a challenging task. This article reviews the psychoacoustic principle, methods, and problems with perceived distance control and compares them with the principles and methods of directional localization control in spatial audio, showing that the validation of various methods for perceived distance control depends on the principle and method used for spatial audio. To improve perceived distance control, further research on the detailed psychoacoustic mechanisms of auditory distance perception is required.
Full-text available
p style="text-align: justify;">Localization of sound in space is an important component of auditory perception, which is involved in the selection of various sound streams, the perception of speech in noise, and the organization of auditory images. Research over the past century has shown that sound localization is achieved through: differences in the intensity and time delay of sound waves arriving at different ears; spectral distortions arising from the anatomical features of the structure of the auricles, head, torso; dynamic cues (listener head movements), etc. However, some scientific and methodological issues (primarily related to the perception of natural sounds and the ecological validity of studies) have not been resolved. The development of digital audio techniques also leads to the emergence of new areas of research, including the processing of sound for the transmission of spatial information in headphones (which is solved using the head related transfer function — HRTF) and the creation of auditory interfaces. The tasks facing researchers in these areas are to improve the perception of spatial information (by manipulating the characteristics of the sound, prompts or training) and the creation of such sound events that can be perceived as object-related, i.e., inextricably linked with the purpose of the operator's activity. The methodology of the perceived quality of events, which makes it possible to distinguish which properties of the auditory image become the most important in human activity and which physical properties of the event they correspond to, can help in solving the tasks set and increasing the ecological validity of research.</p
In the communication process of 5G big data Internet of Things, because of the difference of communication data, the data interaction process needs information mutual recognition scheduling, which greatly reduces the effectiveness of the interactive perception system. In order to improve the real-time performance of platform interaction in the Internet of Things environment, an Internet of Things information interaction perception system based on 5G mobile communication technology is designed. This is mainly important in a hospital network. The system perception layer completes the accurate acquisition of Internet of Things information through data acquisition module and transmits it to the application layer through 5G mobile communication module in the network layer. The intelligent interaction module based on context awareness in the application layer will be heterogeneous and unstructured. The Internet of Things perceived data is encapsulated in context as a unified data processing object of the Internet of Things. The context rule-matching algorithm based on Storm is used to achieve accurate matching between heterogeneous and unstructured perceptual data and rules, and an interactive perceptual parallel query scheduling strategy is used to achieve the optimal scheduling of information mutual perception in the Internet of Things. The results show that there is a positive correlation between the validity of sensing temperature information and the amount of sensing data. The average data processing delay of the system is stable at 50 ms, which meets the real-time requirement of data processing.
Conference Paper
Full-text available
A study of sound localization performance was conducted using headphone-delivered virtual speech stimuli, rendered via HRTF-based acoustic auralization software and hardware, and blocked-meatus HRTF measurements. The independent variables were chosen to evaluate commonly held assumptions in the literature regarding improved localization: inclusion of head tracking, individualized HRTFs, and early and diffuse reflections. Significant effects were found-for azimuth and elevation error, reversal rates, and externalization.
Full-text available
Spatial impulse response rendering (SIRR) is a recent technique for the reproduction of room acoustics with a multichannel loudspeaker system. SIRR analyzes the time-dependent direction of arrival and diffuseness of measured room responses within frequency bands. Based on the analysis data, a multichannel response suitable for reproduction with any chosen surround loudspeaker setup is synthesized. When loaded to a convolving reverberator, the synthesized responses create a very natural perception of space corresponding to the measured room. A technical description of the analysis-synthesis method is provided. Results of formal subjective evaluation and further analysis of SIRR are presented in a companion paper to be published in JAES in 2006 Jan./Feb.
Full-text available
On the basis of a statistical evaluation of the binaural sound source localization performance during listening tests by human subjects, it is shown that the convolution of a measured head-related transfer function (HRTF) with the room impulse response generated by a hybrid image source model with a stochastic scattering process using secondary sources provides an adequate model for the prediction of binaural room impulse responses (BRIR) for directional sound localization in the frontal horizontal plane. The source localization performance for sound stimuli presented to test subjects in a natural way was compared to that presented via headphones. Listening via headphones tends to decrease the localization performance only, and only slightly when localizing narrow-band high-frequency stimuli. Binaural sound presented to test subjects via headphones was generated on the basis of measurements and simulations. Two different headphone conditions were evaluated. The overall localization performance for simulated headphone sound obtained using a simulated BRIR was found to be equivalent to the one using a measured BRIR. The study also confirms expectations that the deterioration of sound source localization performance in reverberant rooms is closely linked to the direct-to-reverberant ratio for given loudspeaker and listener positions, rather than to the reverberation time of the room as such.
Loud sound reproduction is accompanied by air propagation distortion, depending on a sound signal's frequency spectrum, sound pressure level, and distance. Potential limits of suppression of the propagation distortion are discussed. The dependence of predistorted waveforms on coordinates, frequency, and sound pressure level is investigated. Stronger spreading waves can be better linearized over distance from a source than slower ones. Nonlinear interaction of waves propagating noncollinearly is discussed. Nonparallel wave propagation is responsible for a decrease in nonlinear and intermodulation products as compared to collinear wave propagation.
This paper describes the most important techniques to create stereoscopy. At the beginning a brief summary of the history of stereoscopic viewing is outlined. The classical stereoscopy methods such as anaglyph or polarized images and the more modern techniques such as head mounted displays or three-dimensional LCD's are described and compared. Advantages and disadvantages of each method are stated. At the end volumetric and holographic displays are described, although they don't belong to the stereoscopy tech-niques.
This paper reports the results of tests of the ability of an observer to estimate the distance of speech signals that originate, or appear to originate, at the intersection of the horizontal and median planes in anechoic space. Both live and recorded sources were used over a range of distances from 3 to 30 ft. For such a range (i.e., where differences in the selective effect of air absorption with frequency are relatively small), the results showed a wide difference in observer ability to estimate the distance for these two types of sources. When loudspeaker sources of recorded speech were employed, the judgments reported were essentially independent of the actual distance involved. This was true whether single, multiple symmetrical, or asymmetrical arrays were employed. When a live voice was used as a source, a rather marked degree of ability to estimate relative distance was found depending on the type of vocal output employed and on the degree to which normal level changes with distance were eliminated as a parameter. Use of a shouted voice resulted in overestimating the distance, whereas the apparent distance was foreshortened when a whispered voice was employed.