Content uploaded by Henry Rice
Author content
All content in this area was uploaded by Henry Rice on Feb 23, 2015
Content may be subject to copyright.
ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012) 61 –71 DOI 10.3813/AAA.918492
Distance Perception in Interactive Virtual
Acoustic Environments using First and Higher
Order Ambisonic Sound Fields
Gavin Kearney1),Marcin Gorzel2),Henry Rice3),Frank Boland2)
1) Department of Theatre, Film and Television, University of Yo rk, United Kingdom. gavin.kearney@york.ac.uk
2) Department of Electronic and Electrical Engineering, Trinity College Dublin, Ireland.
[gorzelm, fboland]@tcd.ie
3) Department of Mechanical and Manufacturing Engineering, Trinity College Dublin, Ireland. hrice@tcd.ie
Summary
In this paper,wepresent an investigation into the perception of source distance in interactive virtual auditory
environments in the context of First (FOA)and Higher Order Ambisonic (HOA)reproduction. In particular,we
investigate the accuracyofsound field reproduction overvirtual loudspeakers (headphone reproduction)with
increasing Ambisonic order.Performance of 1st,2
nd and 3rd order Ambisonics in representing distance cues is
assessed in subjective audio perception tests. Results demonstrate that 1st order sound fields can be sufficient in
representing distance cues for Ambisonic-to-binaural decodes.
PACS no. 43.20.-f,43.55.-n, 43.58.-e, 43.60.-c, 43.71.-k, 43.75.-z
1. Introduction
Recent advances in interactive entertainment technology
have led to visual displays with aconvincing perception
of source distance, based not only on stereo vision tech-
niques, butalso on real time graphics rendering technol-
ogy for correct motion parallax [1, 2].
Typically,such presentations are accompanied by loud-
speaker surround technology based on amplitude panning
techniques and aimed at multiple listeners. However, in
interactive virtual environments, headphone listening al-
lows for greater control overpersonalized sound field re-
production. One method of auditory spatialization is to in-
corporate Head Related Transfer Functions (HRTFs)into
the headphone reproduction signals. HRTFs describe the
interaction of alistener’shead and pinnae on impinging
source wavefronts. It has been shown that for effective
externalization and localization to occur,head-tracking
should be employed to control this spatialization pro-
cess [3], particularly where non-individualised HRTFs are
used. However, the switching of the directionally depen-
dent HRTFs with head movement can lead to auditory arti-
facts caused by wave discontinuity in the convolved binau-
ral signals [4]. Amore flexible solution is to form ‘virtual
loudspeakers’ from HRTFs, where the listener is placed
at the centre of an imaginary loudspeaker array.Here, the
loudspeaker feeds are changed relative to the head po-
sition and anytechnique for sound source spatialization
Received25February 2011,
accepted 1October 2011.
overloudspeakers can be used. Manydifferent spatializa-
tion systems have been proposed for such application in
the literature, most notably Vector Based Amplitude Pan-
ning (VBAP) [5] and Wavefield Synthesis [6]. However,
the Ambisonics system [7], which is based on the spheri-
cal harmonic decomposition of the sound field, represents
apractical and asymptotically holographic approach to
spatialization. It is well known in Ambisonic loudspeaker
reproduction, that as the order of sound field representa-
tion gets higher,the localization accuracyincreases due to
greater directional resolution.
However, there are manyunanswered questions of the
capability of Ambisonic techniques with regard to the per-
ception of depth and distance. In this paper,wewant to
investigate whether enhanced directional accuracyofdi-
rect sound and early reflections in asound field can pos-
sibly lead to better perception of environmental depth and
thus better localization of the sound source distance in this
environment. We approach the problem by means of sub-
jective listening tests in which we compare the perception
of distance of real sound sources to the First Order Am-
bisonic (FOA)and Higher Order Ambisonic (HOA)sound
fields presented overheadphones.
This paper is outlined as follows: We will begin by pre-
senting asuccinct reviewofthe relevant psychoacousti-
cal aspects of auditory localization and distance percep-
tion. We will then outline the incorporation of Ambisonic
techniques to virtual loudspeaker reproduction and sub-
sequent re-synthesis of measured FOAsound fields into
higher orders. Acase study investigating the perception
of source distance at higher Ambisonic orders is then pre-
sented through subjective listening tests.
©S.Hirzel Verlag ·EAA 61
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.:Distance perception
Vol. 98 (2012)
2. Distance Perception
It is important to note that throughout the literature there
exists aclear distinction between ‘distance’ and ‘depth’,
both understood as perceptual attributes of sound. Accord-
ing to [8], ‘distance’ is related to the physical range be-
tween the sound source and alistener,whereas ‘depth’ re-
lates to the recreated auditory scene as awhole and con-
cerns asense of perspective in that scene.
2.1. Distance Perception in aFreeField
Although the human ability to perceive sources at differ-
ent distances is not fully understood, there are several key
factors, which are known to contribute to distance percep-
tion. In the first case, changes in distance lead to changes
in the monaural transfer function (the sound pressure at
one ear). This is shown in Figure 1for aspherical model
of ahead. We see that for sources of less than 1m distance,
the sound pressure levelvaries depending on the angle of
incidence, due to the shadowing effects of the head. Be-
yond 1m, the intensity of the source decays according to
the inverse square law.
However, absolute monoaural cues will only be mean-
ingful if we have some prior knowledge of the source level,
i.e., howfamiliar we are with the source. In other words, a
form of semiosis occurs, where the perception of localiza-
tion is based on anticipation and experience [9]. Forexam-
ple, for normal levelspeech (approximately 60dB at 1m),
we expect nearer sources to be loud, and quieter sources
further away.However,this is more difficult to assess for
synthetic sounds or sounds that we are unfamiliar with.
It is interesting to note that for sources in the median
plane, the levelatdistances less than 1m does not change
as dramatically as sources located at the ipsilateral point.
This will not significantly affect the lowfrequencyInterau-
ral Time Difference (ITD), butitisreflected in the Interau-
ral LevelDifference (ILD)asshown in Figure 2. We note
that the most extreme ILD is exhibited at the side of the
head (90◦), due to the maximum head shadowing effect.
Forasimilar reason, subconscious head movements may
be regarded as another important cue since levelchanges
close to the source will be more apparent then farfrom it
[10]. Thus, near-field ILD cues exist, which aid us in dis-
criminating source distance.
On the other hand, for larger distances and high sound
pressure levels, the propagation speed of asound wave
in amedium ceases to be constant with frequency, which
may lead to distortion of the waveform [11]. Furthermore,
sound wavestravelling asubstantial distance also undergo
aprocess of energy absorption by water molecules in the
atmosphere. This is more apparent for high-frequencyen-
ergy of the wave and leads to spectral changes (low-pass
filtering)ofthe sound being heard.
2.2. Distance Perception in aReverberant Field
In reverberant rooms, the ratio of the direct to reverberant
sound plays an extremely important role in distance per-
ception. Fornear sources, where the direct field energy is
Source distance (m)
RMS Monaural Transfer Function (dB)
.1
-2
-1
0
1
2
0°
30°
60°
90°
3
4
.2 .5 12510
Figure 1. RMS monaural transfer function for aspherical head
model at the left ear for broadband source at different angles with
varying source distance (reference =plane wave at (0◦,0◦)).
Source distance (m)
ILD (dB)
.1
-5
0
5
10
15
20
25
30
0°
30°
60°
90°
.2 .5 12510
Figure 2. Interaural leveldifference of spherical head model for
broadband source at different angles with varying source dis-
tance.
much greater than the reverberant field, the sound pressure
levelapproximately changes in accordance to the free-field
conditions. However, for source-listener distances greater
than the critical distance, the levelofreverberation is in
general independent of the source position due to the ho-
mogeneous levelofthe diffuse field and the direct to re-
verberant ratio changes approximately 6dB per doubling
of distance from the source.
The directions of arrivalofthe early reflections are an-
other parameter,which change according to the source-
listener position and can be regarded as an important factor
in creating environmental depth. Whether it is useful to the
listeners in determining the distance to the sound source
in the presence of other cues likesound intensity,direct
to reverberant energy ratio or the arrivalpattern of delays,
remains an open question that needs to be addressed. Am-
bisonics allows for enhanced directional reproduction of
deterministic components of asound field by increasing
62
Kearney et al.:Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)
the order of spherical harmonic decomposition. However,
better directional localization can be achievedwithout af-
fecting other important cues for distance estimation like
overall sound intensity or direct to reverberant energy ra-
tio. Thus it can constitute an ideal framework for testing
whether less apparent properties of asound field can influ-
ence the perception of distance.
2.3. Former Psychoacoustical Studies on Distance
Perception
The perception of distance has been shown to be one that is
not linearly proportional to the source distance. Forexam-
ple, both Nielson et al. [12] and Gardner [13] have shown
that the localization of speech signals is consistently un-
derestimated in an anechoic environment. This underesti-
mation has also been shown by other authors in the context
of reverberant environments, both real and virtual. In [14],
Bronkhorst et al. demonstrate that in adamped virtual en-
vironment, sources are consistently perceivedtobecloser
than in areverberant virtual environment, due to the direct
to reverberant ratio. In their studies, the room simulation
is conducted using simulated Binaural Room Impulse Re-
sponses (BRIRs)created from the image source method
[15]. Theyshowhow perceiveddistance increases rapidly
with the number and amplitude of the reflections.
In asimilar study,Rychtarikova et al. [16] investi-
gated the difference in localization accuracybetween real
rooms and computationally derivedBRIRs. Their findings
showthat at 1m,localization accuracyinboth the virtual
and real environments is in good agreement with the true
source position. However, at 2.4 m, the accuracydegrades,
and high frequencylocalization errors were found in the
virtual acoustic pertaining to the difference in HRTFs be-
tween the model and the subject. In the same vain, Chan et
al. [17] have shown that distance perception using record-
ings made from the in-ear microphones on individual sub-
jects again lead to underestimation of the source distance
in virtual reverberant environments, more so than with real
sources.
Waller [18] and Ashmead et al. [10] have identified that
one of the factors improving distance perception is the lis-
tener movement in the virtual or real space. It is therefore
crucial to account for anylistener’smovements (orlack
thereof)inthe experimental design.
Similarly,for headphone reproduction of virtual acous-
tic environments, small, subconscious head rotations may
lead to improvements in distance perception by providing
enhanced ILD and ITD cues. Therefore, the sound field
transformations should reflect well the small changes of
orientation of the listener’shead.
3. Ambisonic Spatialization
Ambisonics wasoriginally developed by Gerzon, Barton
and Fellgett [7] as aunified system for the recording, re-
production and transmission of surround sound. The the-
ory of Ambisonics is based on the decomposition of the
sound field measured at asingle point in space into spher-
ical harmonic functions defined as
Yσ
mn(Φ,Θ)=Amn Pmn (sin Θ)(1)
·cos(mΦ)ifσ=+1
sin(mΦ)ifσ=−1,
where mis the order and nis the degree of the spherical
harmonic and Pmn is the fully normalized (N3D)associ-
ated Legendre function. The coordinate system used com-
prises x,yand zaxes pointing to the front, left and up
respectively, Φis the azimuthal angle with the clockwise
rotation and Θis the elevation angle form the x-y plane.
Foreach order mthere are (2m+1) spherical harmonics.
In order for plane wave representation overaloud-
speaker array we must ensure that
sYσ
mn(Φ,Θ)=
I
i=1
giYσ
mn(φi,θ
i
),(2)
where sis the pressure of the source signal from direction
(Φ,Θ)and giis the ith loudspeaker gain from direction
(φi,θ
i
). We can then express the left hand side of equation
(2) in vector notation, giving the Ambisonic channels
B=YΦΘs(3)
=Y1
0,0(Φ,Θ),Y1
1,0
(Φ,Θ),....Y σ
mm(Φ,Θ)Ts.
Equation (2) can then be rewritten as
B=C·g,(4)
where Care the encoding gains associated with the loud-
speaker positions and gis the loudspeaker signal vector.In
order to obtain g,werequire adecode matrix, D,which is
the inverse of C.However,toinvert Cwe need the matrix
to be asquare, which is only possible when the number of
Ambisonic channels is equal to the number of loudspeak-
ers. When the number of loudspeaker channels is greater
than the number of Ambisonic channels, which is usually
the case, we then obtain the pseudo-inverse of Cwhere
D=pinv(C)=CT(CCT)−1.(5)
Since the sound field is represented by aspherical coor-
dinate system, sound field transformation matrices can be
used to rotate, tilt and tumble the sound fields. In this way,
the Ambisonic signals themselves can be controlled by the
user,allowing for the virtual loudspeaker approach to be
employed. For3-D reproduction, the number of Ivirtual
loudspeakers employed with the Ambisonics approach is
dependent on the Ambisonic order m,where
I≥N=(m+1)2.(6)
63
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.:Distance perception
Vol. 98 (2012)
4. Virtual Loudspeaker Reproduction
In the ‘virtual loudspeaker’ approach, HRTFs are mea-
sured at the ‘sweet-spot’ (the limited region in the cen-
tre of areproduction array where an adequate spatial im-
pression is generally guaranteed)inamulti-loudspeaker
reproduction setup, and the resultant binaural playback is
formed from the convolution of the loudspeaker feeds with
the virtual loudspeakers. This concept is illustrated in Fig-
ure 3. Forthe left ear we have
L=
I
i=1
hLi ∗qi,(7)
where ∗denotes convolution and hLi is the left ear HRIR
corresponding to the ith virtual loudspeaker and qiis the
ith loudspeaker feed. Similar relations apply for the right
ear signal. This method wasfirst introduced by McKeag
and McGrath [19] and examples of its adoption can be
found in [20] and [21]. This approach has major computa-
tional advantages, since acomplexfilter kernel is not re-
quired and head rotation can be simulated by changing the
loudspeaker feeds pas opposed to the HRTFs. Whilst the
HRTFs in this case play an important role in the spatializa-
tion, ultimately it is the sound field creation overthe virtual
loudspeakers which givesthe overall spatial impression.
Most existing research uses ablock frequencydomain ap-
proach to this convolution. However, giventhat the virtual
loudspeaker feeds are controlled via head-tracking in real-
time, atime-domain filtering approach can also be utilized.
For short filter lengths, obtaining the output in apoint wise
manner avoids the inherent latencies introduced by block
convolution in the frequencydomain. Astrategy for sig-
nificant reduction of the filter length without artifacts has
been proposed in [22].
5. Higher Order Synthesis
In order to compare the distance perception of different
orders of Ambisonic sound fields, it is desirable to take
real world sound field measurements. However, the for-
mation of higher order spherical harmonic directional pat-
terns is non-trivial. Thus, in order for us to change FOA
impulse responses to HOArepresentations, we will em-
ployaperceptual based approach which will allowusto
synthesize the increased directional resolution that would
be achievedwith aHOA sound field recording. Forthis we
adopt the directional analysis method of Pulkki and Meri-
maa, found in [23]. Here the B-format signals are analyzed
in terms of sound intensity and energy in order to derive
time-frequencybased direction of arrivaland diffuseness.
The instantaneous intensity vector is givenfrom the pres-
sure pand particle velocity uas
I(t)=p(t)u(t).(8)
Since we are using FOAimpulse response measurements,
the pressure can be approximated by the 0th order Am-
bisonics component w(t)which is omnidirectional
p(t)=w(t),(9)
Figure 3. The virtual loudspeaker reproduction concept.
and the particle velocity by
u(t)=1
√2Z0x(t)ex+y(t)ey+z(t)ez,(10)
where ex,ey,and ezrepresent Cartesian unit vectors, x(t),
y(t), z(t)are the FOAsignals and Z0is the characteristic
acoustic impedance of air.
The instantaneous intensity represents the direction of
the energy transfer of the sound field and the direction of
arrivalcan be determined simply by the opposite direction
of I.For FOA, we can calculate the intensity for each coor-
dinate axis, and in the frequencydomain. Since aportion
of the energy will also oscillate locally,adiffuseness esti-
mate can be made from the ratio of the magnitude of the
intensity vector to the overall energy density E,given as
ψ=1−
I
cE,(11)
where · denotes time averaging, || ·||denotes the norm
of the vector and cis the speed of sound. The diffuseness
estimate will yield avalue of zero for incident plane waves
from aparticular direction, butwill give avalue of 1where
there is no net transport of acoustic energy,such as in the
cases of reverberation or standing waves. Time averaging
is used since it is difficult to determine an instantaneous
measure of diffuseness.
64
Kearney et al.:Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)
The output of the analysis is then subject to smoothing
based on the Equivalent Rectangular Bandwidth (ERB)
scale, such that the resolution of the human auditory sys-
tem is approximated. Since the frequencydependent direc-
tion of arrivalofthe non-diffuse portion of the sound field
can be determined, HOAreproduction can be achieved
by re-encoding point likesources corresponding to the di-
rection indicated in each temporal average and frequency
band into ahigher order spherical harmonic representa-
tion. The resultant Ambisonic signals are then weighted in
each frequencyband kaccording to 1−ψk.However,it
is only vital to re-encode non-diffuse components to higher
order and the diffuse field can be obtained by multiplying
the FOAsignals by √ψkand forming afirst order decode.
This is justified since source localisation is dependent on
the direction of arrivalofthe direct sound and early reflec-
tions and not on late room reverberation [24]. Thus, from
the perceptual point of view, it is questionable whether
there is aneed to preservethe full directional accuracyof
the reverberant field. Furthermore, if there exists ageneral
directional distribution to the diffuse field, this will still be
preserved in first order form. On the other hand, the diffuse
component should not be simply derivedfrom the 0th order
signal. One can easily see that such asolution would pro-
vide perfectly correlated versions of the diffuse field to the
left and right ear signals, which have no equivalent in the
physical world (i.e. real, physical sound field). Moreover,
interaural decorrelation is an important factor in providing
spatial impression in enclosed environments [25].
Figure 4shows an example of the first 20 ms of a1
st
order impulse response taken in areverberant hall [26].
Here the source waslocated 3m from aSoundfield
ST350 microphone, and the Spatial Room Impulse Re-
sponse (SRIR)captured using the exponentially swept-
sine tone technique [27]. In these plots, particular attention
is drawn to the direct sound (coming from directly in front
of the microphone)and aleft wall reflection at approxi-
mately 14 ms. It can be seen that the directional resolution
increases significantly with HOArepresentation. It should
be noted, that the A-format capsule on sound field micro-
phones only display adequate directionality up to 10 kHz
[28]. Spatial aliasing is therefore an issue for high fre-
quencies and as aresult, the directional information above
10 kHz cannot be relied upon.
6. Method: Localization of Distance of Test
Sounds
Different protocols have been used in literature for subjec-
tive assessment of distance perception, most notably aver-
bal report [29, 30], direct or indirect blind walking [31, 32]
or imagined timed walking [32]. All of these methods have
provedtoprovide reliable and comparable results for both,
auditory and visual stimuli, with direct blind walking ex-
hibiting the least between-subject variability [31, 32].
In former work [26], authors of this paper developed a
method where subjects indicated the perceiveddistance of
real and virtual sound sources by selecting one of several
Direct sound
Left wall reflection
Direct sound
Left wall reflection
(a)
(b)
Figure 4. Ambisonic sound field from 1st order measurement
with aSoundfield ST350: (a) 1st order representation, (b) 3rd
order up-mix.
physical loudspeakers lined up (and slightly offset in order
to provide ‘acoustic transparency’)infront of their eyes.
However, for the present study,inorder to completely
eliminate anypossible anchors as well as visual cues, it
wasdecided to utilize the method of direct blind walking.
Of the main concerns in the experiment wasadirect com-
parison of distance perception of real sound sources versus
virtual sound sources presented overheadphones. Due to
different apparatus requirements, the experiment had to be
conducted in twoseparate phases.
6.1. Participants
Sevenparticipants aged 24–58 took part in the experiment.
All subjects were of good hearing and were either music
technology students or practitioners actively involved in
audio research or production. Prior to the test, HRIR data
65
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.:Distance perception
Vol. 98 (2012)
Figure 5. Measuring Head Related Impulse Responses with
miniature microphones.
for all the participants has been obtained in asound-proof,
large (18×15×10 m3)but quite damped (T60 @1000 Hz =
0.57 s) multipurpose room (Black Box)inthe Department
of Theatre, Film and Television at the University of Yo rk.
Additional damping wasassured by thick, heavy,curtains
covering all four walls and acarpet on the floor.The
measurement process consisted of astandard procedure
where miniature, omnidirectional microphones (Knowles
FG-23629-P16)were placed at the entrance of ablocked
ear canal in order to capture acoustic pressure generated
by one loudspeaker at atime located at constant distance
and varying angular direction.
Subjects were seated on an elevated platform so that
their ears were 2.20 mabove the ground and their head
wasinthe centre of aspherical loudspeaker array,arranged
in diametrically opposed pairs. The ear height wascali-
brated using alaser guide, as shown in Figure 5. The ar-
ray consisted of 16 full range Genelec 8050A loudspeak-
ers since the intention wastoreproduce Ambisonic sound
fields up to and including 3rd order.This 3-D setup, shown
in Figure 6, comprised aflat-front, horizontal octagon and
acube (four loudspeakers on top, and four on the bottom).
The radius of the loudspeaker array (and thus the virtual
loudspeaker array)was 3.27 m. ForFOA-to-binaural de-
code, only virtual loudspeakers from the cube configura-
tion were utilized, since no directional resolution is gained
by using ahigher number of loudspeakers. Furthermore,
despite careful alignment, oversampling of the sound field
with higher numbers of speakers has the potential to yield
sound field distortions [33]. Note that for 2nd and 3rd order
Figure 6. Array of 16 loudspeakers used for HRIR measure-
ments.
reproduction, all 16 loudspeakers were used. Although the
oversampled configuration wasnot optimal from the 2nd
order reproduction point of view, it wasnot possible to
easily and accurately rearrange the loudspeaker array in
order to accommodate for adifferent layout.
HRIRs were captured using the exponentially swept-
sine tone technique [27] at 44.1 kHz sampling rate and
16-bit resolution. Since the measurement environment was
not fully anechoic, further processing of the measured data
wasnecessary.The HRIRs were tapered before the arrival
of the first reflection (from the floor)yielding filter kernels
with 257 taps and were subsequently diffuse-field equal-
ized.
6.2. Stimuli
The stimuli used in the experiment were pink noise
bursts and phonetically balanced phrases selected from
the TIMIT Acoustic-Phonetic Continuous Speech Corpus
database and recorded by afemale reader [34]. Asam-
pling rate of 44.1 kHz and 16 bit resolution wasused in
both cases. These twosample types were selected in order
to represent both unfamiliar and familiar sound sources.
Theywere presented to the subjects in apseudo-random-
ized manner to avoid anyordering effects.
Forheadphone reproduction, prior to the test phase,
FOAimpulse response measurements were taken from the
listener position of each loudspeaker using the exponen-
tially swept-sine tone technique [27]. From these measure-
ments, 2nd and 3rd order impulse response sets were ex-
tracted using the directional analysis approach outlined in
section 5. 0th order Ambisonics does not provide anydi-
rectional information which means that it would lack the
cues that are investigated in the higher order renderings.
Therefore, it wasdecided not to include it in this compari-
son.
The only psychoacoustical optimization applied to the
Ambisonics decodes wasshelf filtering and wasintended
to satisfy Gerzon’slocalization criteria for maximized ve-
locity decode at lowfrequencies and energy decode at
higher frequencies [35]. This involved changing the ratio
66
Kearney et al.:Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)
of the pressure to velocity components at lowand high
frequencies. Whilst the crossoverfrequencyfor the high
frequencyboost in the pressure channel at first order is
normally in the region of 400 Hz for regular loudspeaker
listening, here, we restore the crossoverpoint to 700 Hz,
since the subject is always perfectly centred in the virtual
loudspeaker array.
6.3. Test Environment and Apparatus
Aseries of subjective listening tests wasconducted in the
Large Rehearsal Room in the Department of Theatre, Film
and Television in the University of Yo rk. The room dimen-
sions were 12 ×9×3.5m
3and the spatially averaged T60
at 1kHz was0.26 s. Alow T60 wasdesired for this study,
so the walls were covered with thick, heavy curtains, as
shown in Figure 7. Since the up-mix from 1st to 2nd and 3rd
order Ambisonics concerned only the deterministic part of
the measured SRIRs, it wasassumed that no advantage
would be gained from using amore reverberant space.
Aprofessional camera dolly track wasset up roughly
in the direction of the diagonal of the room. It not only
allowed for testing distances of the real loudspeaker up
to 8m butits non-symmetrical position also assured that
early reflections of the same order from different surfaces
did not easily coincide at the subjects ears, butinstead ar-
rivedatdifferent times. Asingle full-range loudspeaker
(Genelec 8050A)was mounted on acamera dolly which
enabled it to be noiselessly translated by the experiment
assistant to different locations. The guiding rope washung
along the dolly track which wasintended to help and guide
the participants when walking toward the sound source.
Since it wasnot possible to walk exactly on the dolly track,
it wasdecided that the walking path would be directly next
to it, as shown in Figure 7. The only weakness of this so-
lution wasthat the sound source horizontal angle varied
from 14.04 degrees at the closest distance (2 m) to 3.58 de-
grees at the furthest distance (8 m).However,this did not
have anyeffect on the distance judgments for tworeasons:
Firstly,the subjects were allowed (orevenencouraged)to
rotate their head in order to fully utilize the available ITD
and ILD cues. Secondly,the initial head orientation was
not in anyway fixed. This, combined with the fact that
there were no clear cues to the subject’sinitial orientation
in the room at the origin, made this small initial angular
offset unimportant. Furthermore, none of the participants
reported anybias in their assessment based on the horizon-
tal offset of the sound source.
Fortrials with binaural presentation, high quality open
back headphones (AKG-K601)were used, which exhibit
lowlevels of interaural magnitude and group delay dis-
tortion. Sound field rotation, tilt and tumble control was
implemented via the TrackIR 5infra-red head tracking
system [36], resulting in stable virtual images with head
rotations. The system responsible for playback of virtual-
ized sound sources wascompletely built in the Pure Data
visual programming environment [37] and its combined
latency(including head-tracker data porting and audio up-
date rate)was 20 ms.
Figure 7. Participant performing atrial during the experiment.
6.4. Procedure
In the experiment, subjects entered the test environment
blindfolded and without anyprior expectation regarding
the room dimensions, its acoustic properties or the test
apparatus. Theywere guided by the experimenter to the
reference point (the ‘origin’). After ashort explanation of
the experiment objectives, atraining session beganwith
ashort (3–5 min)walking-only trial until participants felt
comfortable with walking blindfolded and using aguide
rope. Next, theyperformed 4–6 training trials in which the
same test stimuli to be used in the experiment (speech and
pink noise)were played by the loudspeaker at randomly
chosen distances. No feedback wasgiven and no results
were recorded after each test trial. The end of the training
session wasclearly announced and after a1minute inter-
val, the first phase of the test began.
In test phase I, participants were asked to listen to static
sound sources at arandomly chosen points, focusing on
the perceiveddistance. Theycould listen to anyaudio sam-
ple as manytimes as theywished. During the playback
theywere instructed to stay still and refrain from anytrans-
lational head movements. However, theywere encouraged
to rotate their head freely.After the playback had stopped,
theywere asked to walk guided by the rope to the point
where theythought the sound originated from. The dis-
tance walked wassubsequently recorded by the assistant
using alaser measuring tool, after which the participant
walked backwards to the origin. In the meantime, the loud-
speaker wasnoiselessly translated to its newposition and
67
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.:Distance perception
Vol. 98 (2012)
the test proceeded. Similar to the training session, no feed-
back wasgiven at anystage.
During the first test phase, participants had to indicate
the perceiveddistance for sound sources randomly lo-
cated at 2m,4m, 6m or 8m.Taking into account that
both speech and pink noise bursts samples were used (in
apseudo-random order), the number of trials in the first
phase added up to 8. Each subject performed all the trials
only once.
Upon completion of the first phase of the test there was
ashort (approximately 2minutes)interval that wasre-
quired in order to put on the headphones and calibrate
the head-tracking system. In phase II, subjects were also
asked to identify the sound source distance, butthis time
using Ambisonic sound fields presented overheadphones.
Other than the fact that headphones and the head-tracking
system were used, the test protocol remained the same as
in phase I. However, due to the fact that there were three
playback configurations to be tested (1st,2
nd and 3rd order
Ambisonics), participants had to perform 24 trails instead
of 8. Instead of separate phases for each Ambisonic order,
all samples were randomly presented to the subject within
the same test phase. Again, subjects performed all the tri-
als only once and no feedback wasgiven at anystage.
7. Results
The perceivedsound source distance (indicated by the dis-
tance walked)was collected from 7subjects for 4presen-
tation points (2 m, 4m,6mand 8m), twostimuli (female
speech and pink noise bursts)and four playback options:
1st,2
nd and 3rd Order Ambisonics and real loudspeakers,
which for analysis we will denote FOA, SOA, TOAand
REAL respectively.With headphone trials, none of the
participants reported in-head localization, howeverthere
were 3cases were the proximity of the sound source was
very apparent so participants decided not to move at all.
In some cases, the virtual sound source wasinitially local-
ized behind the subjects butall participants were able to
resolvethe confusion by applying head-rotation.
We computed the mean values of walked distances µfor
each test condition along with the corresponding standard
errors se(µ). The results are presented separately for each
stimulus type within 95% Confidence Intervals.
As expected, the perception of distance for the real sour-
ces wasmore accurate for near sources. Beyond 4m,dis-
tance perception wascontinuously underestimated which
is congruent with the previous studies outlined in sec-
tion 2. Furthermore, the standard deviation of localiza-
tion increases as the source movesfurther into the diffuse
field. We also see, that unfamiliar stimuli produce greater
variability in subjects’ answers. The mean localization of
the virtual sources follows the reference source localiza-
tion well. The answers for virtual sources deviate from
their means roughly in the same fashion as the answers for
reference sources, as localization becomes more difficult
within the diffuse field.
Since the study followed the within-subject factorial de-
sign with 2(stimuli)*4(playback conditions), in order to
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
Real distance [m]
Distance walked [m]
FOA
SOA
TOA
Real
Figure 8. Mean localization of real and virtual sound sources (fe-
male speech).
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
Real distance [m]
Distance walked [m]
FOA
SOA
TOA
Real
Figure 9. Mean localization of real and virtual sound sources
(pink noise bursts).
investigate the effects of these twofactors (referred later
as factors Aand B) as well as potential interaction ef-
fects, for each presentation distance atwo-way ANOVA
has been performed. The null hypothesis being tested here
is that all the mean perceiveddistances for all the stimuli
and playback methods do not differ significantly
H0:µFOA=µSOA=µTOA=µReal =µ,
H1:not all localization means (µi)are the same.
No statistically significant effect of stimuli (familiar vs.
unfamiliar)onthe perception of distance has been found
(F2m(3,48) =0.835, p=0.365; F4m (3,48) =2.0462,
p=0.159; F6m(3,48) =2.575, p=0.115; F8m (3,48) =
2.0462, p=0.159). Fordistances of 4m and more,
playback option had also no statistically significant effect
(F4m(3,48) =2.192, p=0.101; F6m (3,48) =0.665,
p=0.577; F8m(3,48) =0.202, p=0.894).
However, astatistically significant difference has been
detected for the distance of 2m.Inlarger study de-
signs with multiple levels it is advisable to use the Hon-
68
Kearney et al.:Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)
Table I. Mean localization [m] of virtual and real sound sources
at 2m.
µ
FOA µSOA µTOA µReal
Speech 1.119 1.389 0.841 1.638
Noise 0.877 1.001 0.902 1.641
Table II. Correlation coefficients ρand corresponding p−values
for pairs of distance estimations for real and virtual sound
sources (Speech).
ρp−value
Real vs FOA0.9828 0.0172
Real vs SOA0.9960 0.0040
Real vs TOA0.9590 0.0410
Table III. Correlation coefficients ρand corresponding p−values
for pairs of distance estimations for real and virtual sound
sources (Noise).
ρp−value
Real vs FOA0.9913 0.0087
Real vs SOA0.9857 0.0143
Real vs TOA0.9972 0.0028
estly Significant Difference (HSD)approach since there
is an increased risk of spuriously significant difference
arisen purely by chance. So, in order to investigate further
where the difference occurs, an HSD has been computed,
(HSD =1.423m). If we nowcompile the table of mean
perceiveddistances for the sound sources located at 2m
we can see that all of the above values clearly lie within
asingle HSD to each other and cannot be distinguished.
We can safely assume then an ANOVA false alarm (type
Ierror)and no statistically significant effect of playback
method for the sources at the distance of 2mas well.
Lastly,for all the distances no synergetic effects of fac-
tors A(stimuli)and B(playback conditions)havebeen
detected.
Additionally,wecalculated correlation coefficients ρfor
pairs of distance estimations for real and virtual sound
sources (either 1st,2
nd or 3rd order)and twostimuli. In
all cases, high correlation coefficients have been obtained,
which confirms our findings that for these particular test
conditions, the perception of distance of binaurally ren-
dered Ambisonic sound fields of orders 1to3cannot be
distinguished from the perception of distance of the real
sound sources.
8. Discussion
The results presented for real sources corroborate the clas-
sic underestimation of source distance, as reported in the
literature. These results were used as abasis with which
to measure the ability of Ambisonic sound fields of differ-
ent orders to present sources at different distances. It was
expected that afurther underestimation of the source dis-
tance would ensue with the binaural rendering, as reported
in [17]. However, this wasnot the case, even for first or-
der presentations, and the apparent distances of the vir-
tual sources matched the real source distances well. One
should note that the major difference between this study
and that of [17] is our use of head-tracking, indicating
the importance of head-movements in perceiving source
distance, which develops the findings of Waller [18] and
Ashmead et al. [10] on user interaction in avirtual space.
Further work is required to quantify the effect of this.
Moreoverthe presented study demonstrates that the en-
hanced directional accuracygained by presenting sound
sources in HOAthrough head-tracked binaural rendering
does not yield asignificant improvement in the perception
of the source distance. What is noteworthyisthat for each
order,there is no significant difference in the perception of
the source location when compared to real-world sources.
We therefore conclude that sound field directionality for
distance perception is sufficient with 1st order playback.
The presence of the ANOVA false alarm at the 2mpoint
is of interest. It is noteworthythat the 2m point represents
asource inside the virtual array geometry.Itisaknown
issue that virtual sound sources rendered inside the array
of loudspeakers cannot be reproduced in astraightforward
waywithout artifacts. Some of these artifacts include in-
correct wave-front curvature and insufficient bass boost.
In the first case, there is ample evidence in the litera-
ture to suggest that the wavefront curvature translates to a
significant binaural cues for sound sources near the head
[30, 38]. It wasalready shown in section 2.1 that as a
source movescloser to the head the levels of the monau-
ral transfer function and the ILD both change significantly
with source angle. Howeverthis effect is not strong at 1m
and beyond. Forsources further away,ithas been shown
in [39] that it is very difficult to assess distance by binaural
cues alone.
In the second case, the requirement for distance com-
pensation filtering due to near field effects for the large
loudspeaker radius (3.27 m) and the givensource distances
(>2m)isonly prominent below100 Hz. Forthe female
speech test stimuli, this will not have an effect, since the
first formant frequencies do not go down below180 Hz.
Also, the current method employed for capturing HRIRs
allowed for reliably obtaining filters with afrequency
response reaching down to around 170 Hz, thereby also
band-limiting the delivery of the pink noise stimuli.
Finally,there wasnosignificant difference in the results
presented for different sources, although the greater vari-
ance in the results for pink noise suggest that the famil-
iarity of the source does indeed play arole in the percep-
tion of source distance, as mentioned in section 2.3. Future
studies will investigate the use of these monaural cues fur-
ther,and will utilize 0th order sound field rendering, since
it will remove the influence of anydirectional information.
Considering the aforementioned study of Bronkhorst et
al. [14], where the accuracyofdistance perception for bin-
aural playback increases with the number of reflections,
69
ACTA ACUSTICA UNITED WITH ACUSTICA Kearney et al.:Distance perception
Vol. 98 (2012)
our findings demonstrate that the net effect of the monaural
cues of direct to reverberant ratio, leveldifference and time
of arrivalofearly reflections are of greater importance in
distance perception for binaural rendering than Ambisonic
directional accuracybeyond 1st order.
9. Conclusions
We have assessed through subjective analysis the per-
ceivedsource distance in virtual Ambisonic sound fields
in comparison to real world sources. The hypothesis tested
wasthat enhanced directional accuracyofdeterministic
part of the sound field may lead to better reconstruction
of environmental depth and thus improve the perception
of sound source distance. However, it wasshown that
Ambisonic reproduction matches the perceivedreal world
source distances well even at 1st order and no improvement
in this regard wasobserved when increasing the order.It
must be emphasized though, that this analysis applies to
Ambisonic-to-binaural decodes with higher order synthe-
sis achievedusing the directional analysis method of [23].
Therefore, further work will examine this topic for loud-
speaker reproduction for both centre and off-centre listen-
ing as well as investigate the effectiveness of HOAsynthe-
sis in comparison to real world HOAmeasurements.
Acknowledgments
The authors gratefully acknowledge the participation of
the test subjects for both their time and constructive com-
ments, as well as the technical support staffat the Depart-
ment of Theatre, Film and Television at the University of
York for their assistance in the experimental setups. This
research is supported by Science Foundation Ireland.
References
[1] L. Fauster: Stereoscopic techniques in computer graphics.
Technical paper,TUWien, 2007.
[2] J. Lee: Head tracking for desktop VR displays using the
Wiiremote. http://johnnylee.net/projects/wii/,
accessed 30th Sept. 2011.
[3] D. R. Begault: Direct comparison of the impact of head
tracking, reverberation, and individualized head-related
transfer functions on the spatial perception of avirtual
sound source. J. Audio Eng. Soc 49 (2001)904–916.
[4] M. Otani, T. Hirahara: Auditory artifacts due to switching
head-related transfer functions of adynamic virtual audi-
tory display.IEICE Trans. Fundam. Electron. Commun.
Comput. Sci. E91-A (2008)1320–1328.
[5] V. Pulkki: Virtualsound source positioning using Vector
Base Amplitude Panning. J. Audio Eng. Soc. 45 (1997)
456–466.
[6] A. J. Berkhout: AHolographic Approach to Acoustic Con-
trol. J. Audio Eng. Soc 36 (1988)977–995.
[7] M. A. Gerzon: Periphony: With-height sound reproduction.
J. Audio Eng. Soc 21 (1973)2–10.
[8] F. Rumsey: Spatial quality evaluation for reproduced
sound: Terminology,meaning, and ascene-based para-
digm. J. Audio Eng. Soc. 50 (2002)651–666.
[9] J. Blauert: Communication acoustics. Springer,2008.
[10] D. H. Ashmead, D. L. Davis, A. Northington: Contribution
of listeners’ approaching motion to auditory distance per-
ception. J. Exp. Psy: Hum. Percep. and Perform. 21 (1995)
239–256.
[11] E. Czerwinski, A. Vo ishvillo, S. Alexandrov, A. Te rekhov:
Propagation distortion in sound systems: Can we avoid it?
J. Audio Eng. Soc 48 (2000)30–48.
[12] S. H. Nielsen: Auditory distance perception in different
rooms. J. Audio Eng. Soc. 41 (1993)755–770.
[13] M. B. Gardner: Distance estimation of 0◦or apparent 0◦
oriented speech signals in anechoic space. J. Acoust. Soc.
Am. 45 (1969)47–53.
[14] A. W. Bronkhorst, T. Houtgast: Auditory distance percep-
tion in rooms. Nature 397 (1999)517–520.
[15] J. B. Allen, D. A. Berkley: Image method for efficiently
simulating small-room acoustics. J. Acoust. Soc. Am. 65
(1979)943–950.
[16] M. Rychtarikova, T. V. d. Bogaert, G. Ve rmeir,J.Wouters:
Binaural sound source localization in real and virtual
rooms. J. Audio Eng. Soc. 57 (2009)205–220.
[17] J. S. Chan, C. Maguinness, D. Lisiecka, C. Ennis, M.
Larkin, C. O’Sullivan, F. Newell: Comparing audiovisual
distance perception in various real and virtual environ-
ments. Proc. of the 32nd Euro. Conf. on Vis. Percep., Re-
gensburg, Germany, 2009.
[18] D. Wa ller: Factors affecting the perception of interobject
distances in virtual environments. Presence: Teleoper.Vir-
tual Environ. 8(1999)657–670.
[19] A. McKeag, D. McGrath: Sound field format to binaural
decoder with head-tracking. Proc. of the 6th Australian Re-
gional Convention of the AES, 1996.
[20] M. Noisternig, A. Sontacchi, T. Musil, R. Holdrich: A
3D Ambisonic based binaural sound reproduction system.
Proc. of the 24th Int. Conf. of the Audio Eng. Soc., Alberta,
Canada, 2003.
[21] B.-I. Dalenb ..
ack, M. Str..
omberg: Real time walkthrough au-
ralization -the first year.Proc. of the Inst. of Acous.,
Copenhagen, Denmark, 2006.
[22] C. Masterson, S. Adams, G. Kearney, F. Boland: Amethod
for head related impulse response simplification. Proc.
of the 17th European Signal Processing Conference (EU-
SIPCO), Glasgow, Scotland, 2009.
[23] J. Merimaa, V. Pulkki: Spatialimpulse response rendering
i: Analysis and synthesis. J. Audio Eng. Soc. 53 (2005).
[24] W. M. Hartmann: Localization of sound in rooms. J.
Acoust. Soc. Am. 74 (1983)1380–1391.
[25] D. Griesinger: Spatial impression and envelopment in small
rooms. Proc. of the 103rd Conv. of the Audio. Eng. Soc,
NewYork, USA, 1997.
[26] G. Kearney, M. Gorzel, H. Rice, F. Boland: Depth per-
ception in interactive virual acoustic environments using
higher order ambisonic soundfields. Proc. of the 2nd Int.
Ambisonics Symp., Paris, France, 2010.
[27] A. Farina: Simultaneous measurement of impulse response
and distortion with aswept-sine technique. Proc. of the
108th Conv. of the Audio Eng. Soc., Paris, France, 2000.
[28] M. Gerzon: The design of precisely coincident microphone
arrays for stereoand surround sound. Proc. of the 50th
Conv. of the Audio Eng. Soc., London, UK, 1975.
70
Kearney et al.:Distance perception ACTA ACUSTICA UNITED WITH ACUSTICA
Vol. 98 (2012)
[29] C. Guastavino, B. F. G. Katz: Perceptual evaluation of
multi-dimensional spatial audio reproduction. J. Acoust.
Soc. Am. 116 (2004)1105–1115.
[30] P. Zahorik:Assessing auditory distance perception using
virtual acoustics. J. Acoust. Soc. Am. 111 (2002)1832–
1846.
[31] J. M. Loomis, R. L. Klatzky, J. W. Philbeck, R. G. Goll-
edge:Assessing auditory distance perception using percep-
tually directed action. Perception And Psychophysics 60
(1998)966–980.
[32] T. Y. Grechkin, T. D. Nguyen, J. M. Plumert, J. F. Cremer,
J. K. Kearney: Howdoes presentation method and measure-
ment protocol affect distance estimation in real and virtual
environments? ACMTrans. Appl. Percept. 7(2010)26:1–
26:18.
[33] S. Bertet: Formats audio 3d hiérarchiques: Caractérisation
objective et perceptive des systémes ambisonicsd’ordres
supérieurs. Ph.D. dissertation, INSA Lyon, 2008.
[34] W. M. Fisher,G.R.Doddington, K. M. Goudie-Marshall:
The darpa speech recognition research database: Specifica-
tions and status. Proc. of the DARPAWorkshop on Speech
Recognition, 1986.
[35] M. A. Gerzon, G. J. Barton: Ambisonic decoders for
HDTV. Proc. of the 92nd Conv. of the Audio Eng. Soc.,
Vienna, Austria, 1992.
[36] NaturalPoint: Tr ackir 5. http://www.naturalpoint.
com/trackir/,accessed 30th Sept. 2011.
[37] M. Puckette: Pure data. http://puredata.info/,ac-
cessed 30th Sept. 2011.
[38] P. Zahorik,D.S.Brungart, A. W. Bronkhorst: Auditory dis-
tance perception in humans: Asummary of past and present
research. Acta Acustica united with Acustica 91 (2005)
409–420.
[39] H. Wi ttek: Perceptual differences between Wavefield Syn-
thesis and Stereophony. Department of Music and Sound
Recording, School of Arts, Communication and Humani-
ties, University of Surrey, UK, 2007.
71