Conference PaperPDF Available

Disparity in Horizontal Correspondence of Sound and Source Positioning: The Impact on Spatial Presence for Cinematic VR


Abstract and Figures

This study examines the extent to which disparity in azimuth location between a sound cue and image target can be varied in cinematic virtual reality (VR) content, before presence is broken. It applies disparity consistently and inconsistently across five otherwise identical sound-image events. The investigation explores spatial presence, a sub-construct of presence, hypothesizing that consistently applied disparity in horizontal audio-visual correspondence elicits higher tolerance before presence is broken, than inconsistently applied disparity. Guidance about the interactions of subjective judgments and spatial presence for sound positioning is needed for non-specialists to leverage VR’s spatial sound environment. Although approximate compared to visual localization, auditory localization is paramount for VR: it is lighting condition-independent, omnidirectional, not as subject to occlusion, and creates presence.
Content may be subject to copyright.
Disparity in horizontal correspondence of sound and source
positioning: The impact on spatial presence for cinematic VR
Angela McArthur
Media & Arts Technology (MAT), Queen Mary University London, UK
Correspondence should be addressed to Angela McArthur (
This study examines the extent to which azimuth disparity between sound-cue and image-target can be varied
in cinematic virtual reality (VR), before spatial presence is broken. Exploring conscious self-reporting and
autonomic arousal via galvanic skin response, it varies displacement of sound source along the horizontal plane,
in five otherwise identical sound-image events, for stimuli of human and object types. The displacements are
applied either consistently (at 5°, 10° or 15° offset to image) or inconsistently (two orders of randomness using
, 10° and 15° offsets) at uniform distance from the user/ camera position. It hypothesizes that consistently
applied disparity in audio-visual lateral correspondence elicits higher tolerance before presence is broken.
Stimuli of a bell alarm clock, and separately, a human actor, were produced and rendered for headset viewing
and dynamic binaural headphone listening. Content was presented to participants under controlled conditions
including familiarization trials. For both experiments the visual targets were associated with the sound cues
through synchronous corresponding visual movement. Thirty-five participants were tested during 70 trials, using
the BBC R&D 360TV player and BBC Spatial Sound Renderer. No support was found for the hypothesis
relating consistency to presence. Consistency as a determinant of arousal proved significant, but inversely to the
expected direction. Presence ratings were high throughout conditions, and though some participants verbally
discriminated precision in localization judgments, this did not impact ratings. The novice status of participants
for both VR and dynamic binaural sound, could account for high ratings, despite familiarization trials. The
multimodal environment of VR has many potentially confounding factors which affected the current study;
ultimately, azimuth sound-image disparity did not noticeably affect novice experience. Further work in study
design, and on the impact (both subjective and objective) of exposure on presence in VR is required at this time,
to better analyze the perceptual correlates of its spatial sound technologies.
McArthur Disparity in horizontal correspondence of sound & source positioning
Page 2 of 10
1 Introduction
Virtual reality technologies, evolving at speed,
promise the ultimate user experience. Employing
more of our senses, more powerfully, the need to
understand the perceptual correlates of such
technology is vital, as the medium becomes closer
to, and less distinct from, the recipient.
Presence the sense of being in an environment,
whether real or virtual [1] is an indicator of
involvement, affecting the richness of experience.
The importance of audio for presence has been
demonstrated in numerous studies [15]. Older
consumer-level VR technologies lacked many of the
current advantages to make best use of spatial audio
for VR. Even now, rendering is often achieved using
subjective aural judgments, and is affected and
constrained by hardware and system configurations,
process, and timescales. Understanding perceptual
tolerance for disparity in sound-image relations may
help professionals avoid breaking user presence.
Research has supported the idea that cross-modal
consistency is a main determinant of presence,
leading tosynergetic effects […which] enhance the
perception of information as compared to unimodal
conditions[2]. How then, do notions of parity or
disparity in sound-image relations interact with
temporal sequencing and the establishment of
expectation for a coherent and compelling virtual
experience? How far from the simulation of real-
world correspondence is feasible before presence is
compromised? These questions bear relevance to
practitioners who may be apprehensive about the
imprecision of their spatial sound rendering in the
absence of any end-to-end solution, as well as for
those wishing to employ sound design that aims not
just to replicate reality, but to apply sound
imaginatively, leveraging the unique multimodal
affordances of the medium.
Auditory perception has been considered a
hypothesis generation and testing process […]
constructed from the available information and
tested against subsequent experience (often over a
very short time interval)”[6]. This could prove
useful for creative compositional strategies - VR
content is currently almost exclusively short-form,
providing limited exposure times. Further, given the
relative novelty of the medium, expectations of users
are uninformed by pre-existing bodies of work
(though may be informed by expectations transposed
from other media or real-world spatial sound).
2 Background
Under discussion here is spatial presence as the most
popular construct of presence [7]. Spatial presence is
a concept developed in psychology [8] and
communication science [9] with various measures of
assessment available [10,11]. A theoretical two-
stage model of its structure was proposed [12] in
which users need to develop a mental model of the
space depicted, then accept this model as their own
(egocentric) locus of experience. If both stages are
passed, spatial presence is assumed. The later stage
can be considered unconscious and automatic.
Accordingly, users will activate the most convincing
(consistent and error-free) mental models from
alternatives to define and maintain their egocentric
position. This infers that spatial presence increases
with consistency.
Consistency between sound and image is of key
concern for VR, which is necessarily a multimodal
environment. With no distinct ‘frame’ within which
to focus, users have potentially more information to
process this processing involves complex
interactions. Sensory information received is not
processed independently. Each modality initially
encodes environmental information differently, but
ultimately we are presented with a coherent
perceptual experience [1316].
Take, for example, the way in which we resolve
conflicting sensory data from different modalities, to
a near-optimal (though inaccurate) combination.
The ‘Ventriloquist Effect’ [17,18] demonstrates this
in terms of localization: sensory information is
constructed in such a way to support an illusion.
Cinemas provide an example of our willingness to
accept dialogue not coming from the image source.
Spatially at least, vision over audition seems to exert
a dominant influence on the resolution of cross-
modal, biasing perceived location [19]. This raises a
concern about the regular sequencing of the sound
events, as the expectation of (particularly a visual)
target may influence attentional shifts in other
modalities (herein sound) [20,21].
Audio and presence
Understanding how sound positively impacts
presence in VEs is important. Improved realism and
quality may seem key, indeed they exert a powerful
effect, but we should be cautious in speculating that
ever-increasing fidelity and refinement produces
McArthur Disparity in horizontal correspondence of sound & source positioning
Page 3 of 10
proportionate returns in presence. Work to assess
the influence of perceived quality of audiovisual
reproduction when either audio or video is varied, is
largely inconclusive [2225]. A direct link between
perceived quality and presence cannot be inferred
however. In fact, the uncanny valley effect [26],
where artificial renderings that come close to realism
are seen as more uncanny (invoking different criteria
for evaluation) than those which are less realistic,
may apply to sound as much as image [27,28].
Certainly, advances in head tracking and motion
tracking, the convolution of sound with head-related
transfer functions (HRTFs) and room acoustic cues,
as well as binaural synthesis [29], improve
externalization [2,30] (where sound is located in the
environment, rather than within the head) thus also
localization. They also enable presence, though as
stated, not always realism [31].
Auditory localization
This study focuses on horizontal plane localization
which, in contrast to the vertical plane, provides
more precise localization acuity. It also attests to the
importance of the auditory processing of spatial
information in dealing with omnidirectional
information, a unique concern for VR content
The human ability to localize sounds depends on the
central auditory processing of binaural and monaural
acoustic information. Binaural inter-aural time
difference (ITD) and inter-aural level differences
(ILD) are the primary cues for localization in the
horizontal dimension. Generally, we are better able
to localize sounds frontally than laterally - the
minimum audible angle (MAA) [32] needed to
discriminate sources is lowest at central positions
(with reported accuracy of about 1°). This
declines at lateral positons, to about 10° [33].
Methodological background
In the present study, margins of displacement (5°,
10° and 15°) are sufficiently large to allow for
broadband source discrimination, particularly given
users’ freedom of head movement (which, as for
real-world sound cues, crucially aids disambiguation
of sound localization). This does however make
experimental design problematic, though more
ecologically valid.
With more narrow sound bandwidth, azimuth
localization worsens [3437] thus an alarm clock
(which allows both ITDs and ILDs to be used for
localization) was chosen. Human stimulus type
served two purposes as a useful comparison to the
clock, and because informational content of sound
stimuli has been shown to be critical for auditory
attentional capture [3841]. A semantically
impoverished utterance (“laaaa”) was chosen to
control out any confounding semantic effect.
Sound events were presented in identical order to
control for primacy (where order of presentation is
privileged) in a non-reverberant room to control for
precedence effect (early reflections being privileged)
[42]. It was necessary to duplicate the object and
human to control for image variation and auditory
spectral variation. Binaural room impulse responses
(BRIRs) were added for realistic reproduction. Such
signal processing has been shown to improve
presence ratings [31] [43] whilst reflecting that
azimuth localization error does not differ
significantly in generic Vs individually measured
HRTFs when head-tracking is utilized [33,4448].
Thus, a steady broadband source in a dry room
offered optimal localization conditions [35].
Participants in this study were not discounted on the
basis of partial hearing loss, which “hardly detracts
[…] from directional hearing in the horizontal
plane….” [49] and is not alone a good predictor of
localization performance [5055].
Including self-reporting measures in this study
reflects the fact spatial presence is considered a
conscious experience [5659]. The Swedish Viewer-
User Presence Questionnaire (SVUP) [60] unusually
includes questions dealing specifically with sound.
In this study, five items were used, relating to sound
localization, sound quality, and spatial presence.
Self reporting can be limited when examining
expectation, which is temporally contingent;
questionnaires are by nature reflexive, and allow
retrospective ‘fitting’ by participants. An autonomic
measure was considered useful to make detail
available on the first potential stage [12] of spatial
presence - the unconscious construction and
acceptance of a mental schema.
Changes in the electrical conductance of the skin,
broadly within the term ‘galvanic skin response’
(GSR), have been correlated to both reported
behavioral presence [11,61] and reported breaks in
presence [62]. GSR captures arousal data from the
sympathetic nervous system which mediates stress
responses, and plays a significant role in motivation,
emotion and orienting response to novelty [6365].
McArthur Disparity in horizontal correspondence of sound & source positioning
Page 4 of 10
3 Method
Of 35 adult participants, 5 were non-BBC, and were
consequently compensated for their time. Three
’non-responder’ participants were discounted. More
males (n24) than females (n11) took part of varying
ages (n21 = 25-34 years, n9 = 35-44 years, n3 = 18-
24 years, n2 = 45-54 years). Most participants (n21)
reported having neither eyesight or hearing
difficulties, with many (n12) reporting eyesight
difficulties (ranging from short sightedness to
astigmatism) and very few (n3) reporting hearing
difficulties (ranging from mild tinnitus to slight
hearing loss). Where possible, participants’ eyewear
was worn under the headset during trials, however in
some cases this was not feasible due to size, fogging,
or discomfort.
Most participants (n13) had never worn a headset
before or had used one for less than an hour (n11).
Some (n7) had used one for 1-5 hours, with very few
(n4) having used one for more than 15 hours, (n2 as
consumers, n2 as professionals). Most participants
(n26) had no specific interest in audio, whereas
some (n7) identified themselves as enthusiasts.
Consequently, the tests reflect novice listener
Ten unique stimuli (Table 1) of 105 seconds
duration were produced containing five sound events
at equal intervals, each of 3 sec duration to allow for
head-movement disambiguation.
Table 1. The ten experimental conditions
Kodak PixoPro SP3604K cameras and a AKG
C414B ULS microphone recorded the alarm clock
and human actor, separately, in the BBC R&D
listening room, at 0°, 45°, 100°, 250° and 315°
azimuths (Fig.1) at 150cm distance to camera, and
close distance to microphone, sampling at 48kHz.
Fig.1 Stimuli showing sequence of sound-image
events (1 5) and potential sound displacements
The clock was elevated to ensure it was the same
height as the actor’s face. An LED light was
attached to the clock so that - in addition to the
hammer striking the bells on the top of the clock -
there was a clear visual cue to indicate which clock
was sounding. The actor was instructed to
exaggerate mouth opening during sounding, to
produce a similarly clear visual cue. The utterance
frequency was a consistent 278Hz.
Video was stitched in PixPro 360 at maximum
resolution, before editing in Adobe Premiere, to
create five identical clocks (Fig.2) and five identical
actors. It was then down-sampled for smoothness of
playback (1920 x 1080 resolution, h.264 format).
Sound was post-processed in Cockos’ Reaper as a
multi-channel WAV, transcoded into uncompressed
multichannel PCM audio, and placed into an MKV
container, with the video file.
Fig.2 An equirectangular view of the object stimulus
Human stimuli
Consistent Condition
Human stimuli Random
5° displacement
5 - 15° displacement - first
10° displacement
5 - 15° displacement - second
15° displacement
Object stimuli
Consistent Condition
Object stimuli Random
5° displacement
5 - 15° displacement - first
10° displacement
5 - 15° displacement - second
15° displacement
McArthur Disparity in horizontal correspondence of sound & source positioning
Page 5 of 10
Upon arrival, participants were asked to complete a
questionnaire about their experience of VR and
audio. They then began trials in an enclosed room,
sitting on an office swivel chair for ease of (limited)
movement. They were connected to a BioTrace
NeXus 10mkii, an Oculus Rift DK2 headset, and to
Beyerdynamic DT 770 PRO, 250 ohms headphones.
The stimuli were loaded into the BBC R&D 360TV
player, which directly decoded video, and output
audio to the BBC Spatial Sound Renderer (a data-
based dynamic auralization system) [66] which
processed it with BRIRs using the SOFA
MultispeakerBRIR convention [67]. BRIRs were
previously measured with a dummy head
microphone on a rotational mount [68] at head yaw
rotations in steps in the BBC R&D listening room
[69]. Orientation data from the DK2 was relayed to
the 360TV player (Fig.3). Volume was kept at a
consistent 72dB SPL across trials.
Fig.3 Experimental playback configuration
Participants each experienced 4 trials, in one 30-
minute session (Fig.4). Trials were randomly
ordered and equally balanced so that each participant
had two trials of human stimuli, two of object
stimuli, and so that each stimulus was presented an
equal number of times across the 140 total trials.
The first two trials acted as familiarization trials and
were discounted from analyses.
After each trial, participants verbally scored (SVUP)
presence ratings, on a Likert scale 1-7. No feedback
was given during any of the sessions.
The BioTrace native software captured GSR at a rate
of 32 samples per second (sps).
Fig.4 A participant during trials
4 Results
Three trials were discounted due to technical
malfunctioning. Data was analyzed in Matlab - GSR
measurements were normalized and differences in
peak amplitude windows (pre and post stimulus)
were derived. These were processed using a moving
average filter of 64sps. To create skin conductance
responses (SCR) - the phasic component of the
signal [70] - linear detrending at breakpoints of 30
seconds was performed. This removed any negative
trends due to cumulative effects between the skin
and sensor, of charge over time [71].
Linear regression analyses for each dependent
variable, and linear mixed modelling, tested the
relationship between arousal and presence ratings, as
a function of experimental condition (consistency).
Presence scores from each question were analyzed
as separate variables [72]. Arousal and presence
ratings were also tested for correlation.
Condition (consistency) was not a significant
determinant of presence for either human or object
types. Arousal was not a function of condition in the
predicted direction. As such, the hypothesis was
unsupported. Further, there was no significant
relationship between arousal and presence.
Fig 5 shows all data for all 5 conditions / stimuli (c1
c5) Note: arousal and presence data have been
overlaid for comparison but are represented by
different Y values. Presence bar (1-7) coloration
denotes that <4 infers lack of presence.
McArthur Disparity in horizontal correspondence of sound & source positioning
Page 6 of 10
Fig.5 Presence ratings with arousal data
The null hypothesis that presence and condition are
independent, was tested using the Pearson’s chi-
squared test. This showed that the null hypothesis
could not be rejected, and was in fact supported.
A relationship was found between condition and
arousal (p=0.01894) with an inverse line of fit to the
expected direction (Fig.6).
Fig.6 Arousal as a function of condition for human
stimuli type
Linear mixed modelling took GSR as independent
variable, and introduced presence scores as
potentially random elements in turn, assessing their
relationship to condition using an ANOVA. The
same was done GSR as a potentially random factor.
No significant results were yielded.
As participants had been observed discriminating
some sound-image correspondence precision during
trials, the SVUP questions relating specifically to
sound quality and localization were regressed to
condition, using a linear model. No statistical
significance was shown (p = 0.30648 for the
question “To what extent were you able to localize
sounds?” and p = 0.86395 for the question “How
much did the sound add to the perceived realism?”).
Fig.7 Presence ratings for human stimuli
Fig.7 shows consistently high presence ratings;
though c3 and c4 offer more variance, c5 offers
least. Looking at variance in responses more
broadly, we see (Fig.8) a greater range for human
stimuli (high informational content) than object,
particularly for presence ratings.
Fig.8 Variance in data by stimulus type
McArthur Disparity in horizontal correspondence of sound & source positioning
Page 7 of 10
5 Discussion
The decline of arousal with condition type could
reflect participants loosening their schema once
beyond a certain threshold of cross-modal conflict
(revising or relaxing evaluative schema as in the
uncanny valley effect). The displacements, if
exaggerated further, may have overcome this.
Skin conductance measures were more problematic
than anticipated, and may require VR-specific
methodological consideration. Expectation studies
in music center around a corpus of material and
conditioned responses, where VR is a truly novel
medium for most people.
The inclusion of only five SVUP measures may have
contributed to the results. Inclusion of a greater
number of measures would have introduced more
samples though also more potentially random factors
(participants conflated visual awareness in ratings
e.g., the realization they were without legs). Overall,
a more robust study design would have mitigated
against such ‘leakage’ into presence ratings.
Presence scores were a particularly problematic,
being high throughout trials. Participants used
comparative judgments across trials which may have
impacted absolute ratings, and thus results. It was
also observed that participants sometimes initially
rated stimuli highly, being then unable to rate
subsequent stimuli higher, which they conceded they
would like to. Their novice status may be key here.
Familiarization trials aimed at controlling for
novelty (as well as ‘training’ participants for the
non-individualized BRIRs [48]) may have been
insufficient. Novices could be assumed to have
greater cognitive involvement - being more
concerned with sense-making - in the experience,
which is considered a determinant of spatial
presence [12]. In prior work exploring disparity in
sound-image correspondence, naïve participants
were far less critical than research engineers in
ratings when asked to score their ‘annoyance’ at
different configurations of speaker-screen setups as
a measure of disparity [22]. It may be unsurprising
that experts are sometimes able to create situational
models in circumstances where novices were not
[73]. This demonstrates a capacity to construct a
schema which may require more discrimination than
a novice can apply.
However, attention is a prerequisite for presence
[12]. Freedom of head movement meant, in nearly
all instances, users’ attention was oriented towards
the visual target. The ventriloquist effect may not
depend on deliberate visual attention [18], but sound
cueing may have enabled it, essentially contributing
to its own ‘capture’ by image.
The compound effect of naïve listening and viewing
should not be underestimated. If gains in spatial
sound rendering do not address visual biasing in VR,
and imprecise and/or inconsistent sound-image
correspondence have no bearing on presence, a
useful question might be for how long? Both in
terms of the individual, and the market.
6 Conclusions
In novel experiences, users may have an elevated
willingness to suspend disbelief. This willingness to
engage and be spatially present, may be the very
obstacle to its assessment. Participants had
experienced neither VR nor dynamic binaural sound,
perhaps tolerating disparity in sound-image cues
whilst still motivated to maintain a plausible
schema. Prior studies, despite encountering high
presence ratings when varying spatial sound
rendering in VEs [74], have make much progress in
assessing how technical rendering correlates to
perceptual measures. Yet research examining human
factors’ impact on presence is still needed [75,76]
and the current study supports this.
Overall, this study concurs with research showing
how spatial presence and perceptual realism does not
affect enjoyment, underling the need for presence
dimensions need to be treated separately during
analyses [7] and perhaps developed further to
separate out the perceptual correlates of spatial
sound, from image.
[1] Hendrix C, Barfield W. The Sense of Presence
within Auditory Virtual Environments.
Presence Teleoperators Virtual Environ
[2] Larsson P, Vastfjall D, Kleiner M. Better
presence and performance in virtual
environments by improved binaural sound
rendering. Audio Eng. Soc. Conf. 22nd Int.
Conf. Virtual Synth. Entertain. Audio, Audio
Engineering Society; 2002.
[3] Slater M. A note on presence terminology.
Presence Connect 2003;3:15.
[4] Freeman J, Lessiter J. Here, there and
everywhere: the effects of multichannel audio
on presence 2001.
McArthur Disparity in horizontal correspondence of sound & source positioning
Page 8 of 10
[5] Serafin S, Serafin G. Sound Design to Enhance
Presence in Photorealistic Virtual Reality.
ICAD, 2004.
[6] Rumsey F. Spatial Audio. Taylor & Francis;
[7] Skalski P, Whitbred R. Image versus Sound: A
Comparison of Formal Feature Effects on
Presence and Video Game Enjoyment.
PsychNology J 2010;8:67–84.
[8] Bailenson JN, Blascovich J, Beall AC, Loomis
JM. Interpersonal Distance in Immersive
Virtual Environments. Pers Soc Psychol Bull
[9] Lee S, Kim GJ, Rizzo A, Park H. Formation of
spatial presence: by form or content. Proc. 7th
Annu. Int. Presence Workshop Valencia Spain,
[10] Ravaja N, Saari T, Turpeinen M, Laarni J,
Salminen M, Kivikangas M. Spatial Presence
and Emotions during Video Game Playing:
Does It Matter with Whom You Play? Presence
Teleoperators Virtual Environ 2006;15:38192.
[11] Meehan M, Insko B, Whitton M, Brooks Jr FP.
Physiological measures of presence in stressful
virtual environments. ACM Trans Graph TOG
[12] Wirth W, Hartmann T, Böcking S, Vorderer P,
Klimmt C, Schramm H, et al. A process model
of the formation of spatial presence
experiences. Media Psychol 2007;9:493525.
[13] Driver J, Spence C. Attention and the
crossmodal construction of space. Trends Cogn
Sci 1998;2:25462.
[14] Spence C. Crossmodal correspondences: A
tutorial review. Atten Percept Psychophys
2011;73:97195. doi:10.3758/s13414-010-
[15] Clark A. Cross-modal cuing and selective
attention. Senses Class Contemp Philos
Perspect Oxf Univ Press Oxf 2010.
[16] Föcker J, Hötting K, Gondan M, Röder B.
Unimodal and Crossmodal Gradients of Spatial
Attention: Evidence from Event-related
Potentials. Brain Topogr 2010;23:113.
[17] Alais D, Burr D. The Ventriloquist Effect
Results from Near-Optimal Bimodal
Integration. Curr Biol 2004;14:25762.
[18] Bertelson P, Vroomen J, De Gelder B, Driver J.
The ventriloquist effect does not depend on the
direction of deliberate visual attention. Percept
Psychophys 2000;62:321–32.
[19] Morein-Zamir S, Soto-Faraco S, Kingstone A.
Auditory capture of vision: examining temporal
ventriloquism. Cogn Brain Res 2003;17:154
[20] Spence C, Driver J. Audiovisual links in
endogenous covert spatial attention. J Exp
Psychol Hum Percept Perform 1996;22:1005
30. doi:10.1037/0096-1523.22.4.1005.
[21] Driver J, Spence CJ. Spatial synergies between
auditory and visual attention. In: Umilt C,
Moscovitch M, editors. Atten. Perform. 15
Conscious Nonconscious Inf. Process.,
Cambridge, MA, US: The MIT Press; 1994, p.
[22] Komiyama S. Subjective evaluation of angular
displacement between picture and sound
directions for HDTV sound systems. J Audio
Eng Soc 1989;37:2104.
[23] Woszczyk W, Bech S, Hansen V. Interaction
between audio-visual factors in a home theater
system: definition of subjective attributes.
Audio Eng. Soc. Conv. 99, Audio Engineering
Society; 1995.
[24] Počta P, Beerends JG. Subjective and Objective
Assessment of Perceived Audio Quality of
Current Digital Audio Broadcasting Systems
and Web-Casting Applications. IEEE Trans
Broadcast 2015;61:40715.
[25] Hollier MP, Rimell AN, Hands DS, Voelcker
RM. Multi-modal perception. BT Technol J
[26] Mori M, MacDorman KF, Kageki N. The
Uncanny Valley [From the Field]. IEEE Robot
Autom Mag 2012;19:98100.
[27] Grimshaw M. The audio Uncanny Valley:
Sound, fear and the horror game 2009.
[28] Rumsey F. Sound Field Control. J Audio Eng
Soc 2013;61:1046–50.
[29] Larsson P, Västfjäll D, Kleiner M. Spatial
auditory cues and presence in virtual
environments. Submitt Int J Hum-Comput Stud
[30] Begault DR. 3-D sound for virtual reality and
multimedia 2000.
[31] Larsson P, Västfjäll D, Kleiner M. Effects of
auditory information consistency and room
acoustic cues on presence in virtual
environments. Acoust Sci Technol
[32] Mills AW. On the Minimum Audible Angle. J
Acoust Soc Am 1958;30:23746.
[33] Birchfield ST, Gangishetty R. Acoustic
localization by interaural level difference.
Acoust. Speech Signal Process. 2005
ProceedingsICASSP05 IEEE Int. Conf. On, vol.
4, IEEE; 2005, p. iv 1109.
McArthur Disparity in horizontal correspondence of sound & source positioning
Page 9 of 10
[34] Lloyd J. Binaural signal detection- Vector
theory(Vector correlation theory and neural
mechanisms of binaural signal detection in
human auditory system). vol. 2. 1972.
[35] Hartmann WM. Localization of sound in rooms.
J Acoust Soc Am 1983;74:138091.
[36] Trahiotis C, Stern RM. Lateralization of bands
of noise: Effects of bandwidth and differences
of interaural time and phase. J Acoust Soc Am
1989;86:128593. doi:10.1121/1.398743.
[37] Yost WA, Zhong X. Sound source localization
identification accuracy: bandwidth
dependencies. J Acoust Soc Am
2014;136:273746. doi:10.1121/1.4898045.
[38] Dalton P, Hughes RW. Auditory attentional
capture: implicit and explicit approaches.
Psychol Res 2014;78:31320.
[39] Juslin PN. From everyday emotions to aesthetic
emotions: Towards a unified theory of musical
emotions. Phys Life Rev 2013;10:23566.
[40] Parmentier FBR. Towards a cognitive model of
distraction by auditory novelty: The role of
involuntary attention capture and semantic
processing. Cognition 2008;109:34562.
[41] Parmentier FBR, Elsley JV, Ljungberg JK.
Behavioral distraction by auditory novelty is
not only about novelty: The role of the
distracter’s informational value. Cognition
[42] Wallach H, Newman EB, Rosenzweig MR. A
Precedence Effect in Sound Localization. J
Acoust Soc Am 1949;21:468468.
[43] Begault D. perceptual effects of synthetic
reverberation en 3-D audio systems 1992.
92_Perceptual_Effects.pdf (accessed August
28, 2016).
[44] Hoffmann PF, Møller H. Some observations on
sensitivity to HRTF magnitude. J Audio Eng
Soc 2008;56:972–82.
[45] Hofman P, Van Opstal J. Identification of
spectral features as sound localization cues in
the external ear acoustics. Int. Work-Conf.
Artif. Neural Netw., Springer; 1997, p. 1126
[46] Begault DR. Auditory and non-auditory factors
that potentially influence virtual acoustic
imagery. Audio Eng. Soc. Conf. 16th Int. Conf.
Spat. Sound Reprod., Audio Engineering
Society; 1999.
[47] Begault DR, Wenzel EM, Anderson MR. Direct
comparison of the impact of head tracking,
reverberation, and individualized head-related
transfer functions on the spatial perception of a
virtual speech source. J Audio Eng Soc
[48] Mendonça C, Campos G, Dias P, Vieira J,
Ferreira J, Santos J. On the improvement of
localization accuracy with nonindividualized
HRTF-based sounds. J Audio Eng Soc 2012.
[49] Blauert J. Spatial Hearing: The Psychophysics
of Human Sound Localization. MIT Press;
[50] Noble W, Byrne D, Lepage B. Effects on sound
localization of configuration and type of
hearing impairment. J Acoust Soc Am
1994;95:9921005. doi:10.1121/1.408404.
[51] Abel SM, Hay VH. Sound Localization the
Interaction of Aging, Hearing Loss and Hearing
Protection. Scand Audiol 1996;25:312.
[52] Abel SM, Giguère C, Consoli A, Papsin BC.
The effect of aging on horizontal plane sound
localization. J Acoust Soc Am 2000;108:743
52. doi:10.1121/1.429607.
[53]Dobreva MS, O’Neill WE, Paige GD. Influence
of aging on human sound localization. J
Neurophysiol 2011;105:247186.
[54] Neher T, Laugesen S, Jensen NS, Kragelund L.
Can basic auditory and cognitive measures
predict hearing-impaired listeners’ localization
and spatial speech recognition abilities? a). J
Acoust Soc Am 2011;130:154258.
[55] Häusler R, Colburn S, Marr E. Sound
Localization in Subjects with Impaired Hearing.
Acta Otolaryngol (Stockh) 1983;96:162.
[56] Lombard M, Ditton T. At the Heart of It All:
The Concept of Presence. J Comput-Mediat
Commun 1997;3:00. doi:10.1111/j.1083-
[57] Schubert TW. A New Conception of Spatial
Presence: Once Again, with Feeling. Commun
Theory 2009;19:16187. doi:10.1111/j.1468-
[58] Jerome C, Darnell R, Oakley B, Pepe A. The
Effects of Presence and Time of Exposure on
Simulator Sickness. Proc Hum Factors Ergon
Soc Annu Meet 2005;49:225862.
[59] Kim T, Biocca F. Telepresence via Television:
Two Dimensions of Telepresence May Have
Different Connections to Memory and
Persuasion.[1]. J Comput-Mediat Commun
McArthur Disparity in horizontal correspondence of sound & source positioning
Page 10 of 10
1997;3:00. doi:10.1111/j.1083-
[60] Västfjäll D, Larsson P, Kleiner M.
Development and validation of the Swedish
viewer-user presence questionnaire (SVUP)
[61] Dillon C, Keogh E, Freeman J. It’s been
emotional”: Affect, physiology, and presence.
Proc. Fifth Annu. Int. Workshop Presence Porto
Port., 2002.
[62] Slater M, Brogni A, Steed A. Physiological
responses to breaks in presence: A pilot study.
Presence 2003 6th Annu. Int. Workshop
Presence, vol. 157, Citeseer; 2003.
[63] Figner B, O’Murphy R. Using skin conductance
in judgment and decision making research.
Handb. Process Tracing Methods Decis. Res.,
Psychology Press; 2011, p. 16384.
[64] Dillon C, Keogh E, Freeman J, Davidoff J.
Aroused and immersed: the psychophysiology
of presence. Proc. 3rd Int. Workshop Presence
Delft Univ. Technol. Delft Neth., 2000, p. 278.
[65] Wiederhold BK, Jang DP, Kaneda M, Cabral I,
Lurie Y, May T, et al. An investigation into
physiological responses in virtual
environments: an objective measurement of
presence. Cyberpsychology Mind Cogn Soc
Internet Age 2001.
[66] Pike C, Melchior F, Tew T. Descriptive
Analysis of Binaural Rendering with Virtual
Loudspeakers Using a Rate-All-That-Apply
Approach, Aalborg, Denmark: 2016.
[67] MultiSpeakerBRIR - Sofaconventions n.d.
ex.php/MultiSpeakerBRIR (accessed August
29, 2016).
[68] Shotton M, Pike C, Melchior F. A Motorised
Telescope Mount as a Computer-Controlled
Rotational Platform for Dummy Head
Measurements. 136th AES Conv., Berlin,
Heidelberg: 2014.
[69] NIxon T, Bonney A, Melchior F. A Reference
Listening Room for 3D Audio Research. Int.
Conf. Spat. Audio, 2015.
[70] Boucsein W. Electrodermal Activity. Springer
Science & Business Media; 2012.
[71] Salimpoor VN, Benovoy M, Longo G,
Cooperstock JR, Zatorre RJ. The Rewarding
Aspects of Music Listening Are Related to
Degree of Emotional Arousal. PLOS ONE
[72] Lessiter J, Freeman J, Keogh E, Davidoff J. A
Cross-Media Presence Questionnaire: The ITC-
Sense of Presence Inventory. Presence
Teleoperators Virtual Environ 2001;10:28297.
[73] Morrow D, Leirer V, Altiteri P, Fitzsimmons C.
When expertise reduces age differences in
performance. Psychol Aging 1994;9:13448.
[74] Larsson P, Västfjäll D, Olsson P, Kleiner M.
When what you hear is what you see: Presence
and auditory-visual integration in virtual
environments. Proc. 10th Annu. Int. Workshop
Presence, 2007, p. 118.
[75] Jurnet IA, Beciu CC, Maldonado JG. Individual
differences in the sense of presence. Proc. 8th
Int. Workshop Presence “‘Presence 2005’”
Univ. Coll. Lond., Citeseer; 2005.
[76] Alsina-Jurnet I, Gutiérrez-Maldonado J.
Influence of personality and individual abilities
on the sense of presence experienced in anxiety
triggering virtual environments. Int J Hum-
Comput Stud 2010;68:788801.
... Several of the included studies note this as a potential confounding factor, e.g. (Lind et al. 2017;McArthur 2016;Moraes et al. 2020;Oberman, Bojanić Obad Šćitaroci, and Jambrošić 2018;Rogers et al. 2018;Sikström et al. 2016a;Suarez et al. 2017;Vosmeer and Schouten 2017). The effect of this on the measured outcomes connected to audio feedback is expected to be more pronounced than for visual feedback since the experience of mediated binaural/spatial audio, which is the standard for many VR experiences, is expected to be novel to many participants within the context of media experiences (Lind et al. 2017). ...
... Hoekstra, and van Egmond (2015); Davies et al., (2017); Geronazzo et al., (2018); Hong et al., (2018); Huang et al., (2019); Kurabayashi et al., (2014); McArthur (2016); Mehra et al., (2015); Moraes et al. (2020); Olko et al. (2017); Pelegrin Garcia et al., (2015); Rummukainen et al., (2017); Schoeffler et al. (2015); Steadman et al., (2019); Stecker et al., (2018); Suarez et al. (2017); Ulsamer et al., (2020); Yan, Wang, and Li (2019); Chen et al. (2017a, 2017b, 2018); Chirico and Gaggioli (2019); Chittaro (2012); Feng, Dey, and Lindeman (2016); Gao, Kim, and Kim (2018); Ghosh et al. (2018); Kruijff et al. (2016); Lee, Bruder, and Welch (2017); Lee and Lee (2017); Liao et al. (2020); Narciso et al. (2019); Oh, Herrera, and Bailenson (2019); O'Hagan, Williamson, and Khamis (2020); Peng et al. (2020); Sawada et al. (2020); Shimamura et al. (2020); Sikström, de Götzen, and Serafin (2015, 2016a, 2016b); Van den Broeck, Pamplona, and Fernandez Langa (2017); Zhang et al. (2018); Zhao et al. discuss a specific domain per se, but focuses on specific topics within the general area of VR. b Also does not discuss a specific domain, but rather VR research in general. et al. 2017;Schoeffler et al. 2015;Ulsamer et al. 2020;Yan, Wang, and Li 2019) and specific technological factors within this area such as HRTFsSuarez et al. 2017), head-tracking (Kurabayashi et al. 2014Steadman et al. 2019), room acoustical properties(Garcia et al. 2015;Stecker et al. 2018), sound source displacement(McArthur 2016;Moraes et al. 2020), and different sound propagation techniques ...
Full-text available
The use of virtual reality (VR) has seen significant recent growth and presents opportunities for use in many domain areas. The use of head-mounted displays (HMDs) also presents unique opportunities for the implementation of audio feedback congruent with head and body movements, thus matching intuitive expectations. However, the use of audio in VR is still undervalued and there is a lack of consistency within audio-centedd research in VR. To address this shortcoming and present an overview of this area of research, we conducted a scoping review (n = 121) focusing on the use of audio in HMD-based VR and its effects on user/player experience. Results show a lack of standardisation for common measures such as pleasantness and emphasize the context-specific ability of audio to influence a variety of affective, cognitive, and motivational measures, but are mixed for presence and generally lacking for social experiences and descriptive research.
... The idea of being required to present something that the user finds unique and special could suggest that the user's perception of novelty, and their prior experience with the medium, may have an impact on the level of immersion they experience. For users inexperienced in IME environments there may be a greater inclination to suspend disbelief and engage with the experience (McArthur, 2016), and this may cause them to be more likely to ignore/not notice quality issues that may be apparent to those more experienced. If this is the case, it raises the question of how long this "novelty effect" might last for, and once users become more accustomed to the experiences will it become increasingly difficult to elicit the same perceived quality of immersion? ...
Full-text available
Sound design with the goal of immersion is not new. However, sound design for immersive media experiences (IMEs) utilizing spatial audio can still be considered a relatively new area of practice with less well-defined methods requiring a new and still emerging set of skills and tools. There is, at present, a lack of formal literature around the challenges introduced by this relatively new content form and the tools used to create it, and how these may differ from audio production for traditional media. This article, through the use of semi-structured interviews and an online questionnaire, looks to explore what audio practitioners view as defining features of IMEs, the challenges in creating audio content for IMEs and how current practices for traditional stereo productions are being adapted for use within 360 interactive soundfields. It also highlights potential direction for future research and technological development and the importance of practitioner involvement in research and development in ensuring future tools and technologies satisfy the current needs.
... Dynamic binaural synthesis, utilizing head-related transfer functions (HRTFs), virtual loudspeakers and headset orientation data, provides a compelling experience (Fig. 2) which can be delivered (minimally) over standard headphones, a smartphone and cardboard headset. This affords a sounding world where distance, location and environmental cues remain independently static (or dynamic) when we move and it serves to reinforce presence [4,5] which in turn can 'uplift' potentially presence-breaking features of image [6,7]. Yet the ease of consumption does not reflect the challenges faced by content creators. ...
Conference Paper
Full-text available
Spatial audio is enjoying a surge in a ention in both scene and object based paradigms, due to the trend for, and accessibility of, immersive experience. is has been enabled through convergence in computing enhancements, component size reduction, and associated price reductions. For the first time, applications such as virtual reality (VR) are technologies for the consumer. Audio for VR is captured to provide a counterpart to the video or animated image, and can be rendered to combine elements of physical and psychoacoustic modelling, as well as artistic design. Given that distance is an inherent property of spatial audio, that it can augment sound’s efficacy in cueing user attention (a problem which practitioners are seeking to solve), and that conventional film sound practices have intentionally exploited its use, the absence of research on its implementation and effects in immersive environments is notable. This paper sets out the case for its importance, from a perspective of research and practice. It focuses on cinematic VR, whose challenges for spatialized audio are clear, and at times stretches beyond the restrictions specific to distance in audio for VR, into more general audio constraints.
full text available at The Virtual Concert Hall is a virtual environment that has been specifically designed for the optoacoustic simulation of performance rooms. As a tool for experimental research, its design is derived from particular methodological demands including harvesting comparable acoustical and optical information, dissociating the space and content of multisensory events, varying optical and acoustical room proper-ties in a mutually independent manner, while also providing a full set of stimulus cues. The system features 3D sound and vision by applying dynamic binaural synthesis and a 161° stereoscopic projection on a cylindrical screen. Room simulation data were acquired in situ in the form of orientational binaural room impulse responses and stereoscopic panoramic images. Music and speech performances were recorded acoustically in an anechoic room and optically in a greenbox studio and inserted into the virtual rooms. The Virtual Concert Hall provides nearly all perceptually relevant acoustical and optical cues, enabling experiments on the audiovisual perception of optoacoustically conflicting rooms under rich-cue condition.
Full-text available
Auralization is a powerful tool to increase the realism and sense of immersion in Virtual Reality environments. The Head Related Transfer Function (HRTF) filters commonly used for auralization are non-individualized, as obtaining individualized HRTFs poses very serious practical difficulties. It is therefore extremely important to understand to what extent this hinders sound perception. In this paper we address this issue from a learning perspective. In a set of experiments, we observed that mere exposure to virtual sounds processed with generic HRTF did not improve the subjects' performance in sound source localization, but short training periods involving active learning and feedback led to significantly better results. We propose that using auralization with non-individualized HRTF should always be preceded by a learning period.
Full-text available
This paper investigates the impact of different audio codecs typically deployed in current digital audio broadcasting (DAB) systems and web-casting applications, which represent a main source of quality impairment in these systems and applications, on the quality perceived by the end user. Both subjective and objective assessments are used. Two different audio quality prediction models, namely Perceptual Evaluation of Audio Quality (PEAQ) and Perceptual Objective Listening Quality Assessment (POLQA) Music, are evaluated by comparing the predictions with subjectively obtained grades. The results show that the degradations introduced by the typical lossy audio codecs deployed in current DAB systems and web-casting applications operating at the lowest bit rate typically used in these distribution systems and applications seriously impact the subjective audio quality perceived by the end user. Furthermore, it is shown that a retrained POLQA Music provides the best overall correlations between predicted objective measurements and subjective scores allowing to predict the final perceived quality with good accuracy when scores are averaged over a small set of musical fragments (R = 0.95).
Full-text available
Sound source localization accuracy using a sound source identification task was measured in the front, right quarter of the azimuth plane as rms (root-mean-square) error (degrees) for stimulus conditions in which the bandwidth (1/20 to 2 octaves wide) and center frequency (250, 2000, 4000 Hz) of 200-ms noise bursts were varied. Tones of different frequencies (250, 2000, 4000 Hz) were also used. As stimulus bandwidth increases, there is an increase in sound source localization identification accuracy (i.e., rms error decreases). Wideband stimuli (>1 octave wide) produce best sound source localization accuracy (∼6°-7° rms error), and localization accuracy for these wideband noise stimuli does not depend on center frequency. For narrow bandwidths (<1 octave) and tonal stimuli, accuracy does depend on center frequency such that highest accuracy is obtained for low-frequency stimuli (centered on 250 Hz), worse accuracy for mid-frequency stimuli (centered on 2000 Hz), and intermediate accuracy for high-frequency stimuli (centered on 4000 Hz).
Full-text available
The extent to which distracting items capture attention despite being irrelevant to the task at hand can be measured either implicitly or explicitly (e.g. Simons, 2000). Implicit approaches include the standard attentional capture paradigm in which distraction is measured in terms of reaction time and/or accuracy costs within a focal task in the presence (vs. absence) of a task-irrelevant distractor. Explicit measures include the inattention paradigm in which people are asked directly about their noticing of an unexpected task-irrelevant item. Although the processes of attentional capture have been studied extensively using both approaches in the visual domain, there is much less research on similar processes as they may operate within audition, and the research that does exist in the auditory domain has tended to focus exclusively on either an explicit or implicit approach. This paper provides an overview of recent research on auditory attentional capture, integrating the key conclusions that may be drawn from both methodological approaches.
Full-text available
More than 40 years ago, Masahiro Mori, a robotics professor at the Tokyo Institute of Technology, wrote an essay [1] on how he envisioned people's reactions to robots that looked and acted almost like a human. In particular, he hypothesized that a person's response to a humanlike robot would abruptly shift from empathy to revulsion as it approached, but failed to attain, a lifelike appearance. This descent into eeriness is known as the uncanny valley. The essay appeared in an obscure Japanese journal called Energy in 1970, and in subsequent years, it received almost no attention. However, more recently, the concept of the uncanny valley has rapidly attracted interest in robotics and other scientific circles as well as in popular culture. Some researchers have explored its implications for human-robot interaction and computer-graphics animation, whereas others have investigated its biological and social roots. Now interest in the uncanny valley should only intensify, as technology evolves and researchers build robots that look human. Although copies of Mori's essay have circulated among researchers, a complete version hasn't been widely available. The following is the first publication of an English translation that has been authorized and reviewed by Mori. (See “Turning Point” in this issue for an interview with Mori.).
This study explored the relationships among presence, simulator sickness, and length of simulator exposure. Previous research shows that presence and simulator sickness are negatively correlated with each other, but both have been found to be positively correlated with length of simulator exposure. The general goal of this research was to determine whether an interaction relationship existed. Results of the analysis provide evidence showing an interaction relationship (β = -1.08, p < .01; Fchange = 6.51, p < .01), i.e., at different levels of presence, simulator sickness increases at different rates over time. High presence led to less of an increase in simulator sickness over time than low presence.
The spatial resolution at which head-related transfer functions (HRTFs) are available is an important aspect in the implementation of virtual spatial sound. How close HRTFs must be depends on how much their characteristics differ between adjacent directions and, most important, when these differences become audible. Thresholds for the audibility of differences in the spectral characteristics of HRTFs as a function of angular separation were measured. Listeners had to discriminate between stimuli spectrally shaped with different HRTFs but whose interaural time difference remained the same. Results showed that listeners were more sensitive to changes in the vertical position than to changes in the horizontal position. Results are discussed in connection with requirements for spatial resolution of HRTFs in the synthesis of three-dimensional sound.
Conference Paper
This paper describes the construction and validation of an affordable and accurate two degree-of-freedom rotational mount for making head-related impulse response (HRIR) and binaural room impulse response (BRIR) measurements using a dummy head microphone. We review the design requirements for a rotational mount in the context of measurements for binaural rendering, with reference to perceptual factors. In order to achieve a low-cost solution, we evaluate the suitability of a motorised telescope mount. Issues considered during design of the system are discussed. The use of affordable electronics to convert the mount into a general-purpose computer-controlled rotational platform is presented, as well as objective measurements to validate performance. Finally the limitations of this system are discussed and further use cases proposed.
To design, optimise and deliver multimedia and virtual-reality products and services it is necessary to match performance to the capabilities of users. When a multimedia system is used, the presence of audio and video stimuli introduces significant cross-modal effects (the sensory streams interact). This paper introduces a number of cross-modal interactions that are relevant to communications systems and discusses the advanced experimental techniques required to provide data for modelling multi-modal perception. The aim of the work is to provide a multi-modal perceptual model that can be used for performance assessment and can be incorporated into coding algorithms. The current and future applications of multi-modal modelling are discussed.
The fact that sounds are localized in reverberant surroundings points up a critical problem which has not been explored sufficiently. A brief description will be given of experiments we have done which demonstrate that there is a precedence effect, whereby the first in line of a series of closely spaced sounds is the one which determines the place where the sound is heard. This demonstration of the importance of first arrival makes clear how we are able to discount the ambiguous clues from the reflected sounds of an ordinary hard-walled room. More extended measurements of the precedence effect have been made by synthesizing a sound out of four clicks arranged to give first one pair to the two ears representing one location, then a second pair to the ears representing a different location. Two parameters have been studied systematically, the interval between first pair and second pair, and the temporal disparity of the second pair. All measurements were made by varying the disparity of the first pair until the fused sound appeared to be in the middle of the head. Results of these experiments will be discussed.