ArticlePDF Available

Foregrounding Expectation: Spatial Sound-image Composition in Virtual Environments


Abstract and Figures

Spatial treatment of diegetic sound in virtual reality (VR) is generally moderate, conceding to environmental realism, particularly for cinematic VR. The affordances of the medium far exceed this strategy, but require consideration and skills in areas which may be distributed across divergent specialisms in the production ensemble. Strategies which unify the conceptual sound-image agenda of projects are useful for practical as well as theoretical reasons, with VR being a multi-disciplinary undertaking. This article examines the contemporary sound-making context for VR, and suggests a potential framework for spatial sound design (SSD) in virtual environments. It suggests a means to elicit sound-image disjunctures, whilst maintaining a sense of immersion. It does not present a fully developed framework, offering instead an experimental outline. Utilising knowledge of perceptual and generative processes, this approach aims to enable SSD practitioners to experiment with sound-image correspondences systematically, reducing the additional time resource needed and the risk of adversely impacting user experience, whilst balancing predictability and complexity.
Content may be subject to copyright.
Foregrounding Expectation:
Spatial Sound-image
Composition in Virtual
Spatial treatment of diegetic sound in virtual reality (VR) is generally
moderate, conceding to environmental realism, particularly for cinematic VR.
The affordances of the medium far exceed this strategy, but require consideration
and skills in areas which may be distributed across divergent specialisms in the
production ensemble. Strategies which unify the conceptual sound-image agenda
of projects are useful for practical as well as theoretical reasons, with VR being
a multi-disciplinary undertaking. This article examines the contemporary
sound-making context for VR, and suggests a potential framework for spatial
sound design (SSD) in virtual environments. It suggests a means to elicit
sound-image disjunctures, whilst maintaining a sense of immersion. It does not
present a fully developed framework, offering instead an experimental outline.
Utilising knowledge of perceptual and generative processes, this approach aims
to enable SSD practitioners to experiment with sound-image correspondences
systematically, reducing the additional time resource needed and the risk
of adversely impacting user experience, whilst balancing predictability and
virtual reality
Markov Chain
spatial sound design
The New Soundtrack 7.2 (2017): 95–110
DOI: 10.3366/sound.2017.0102
#Edinburgh University Press and Angela McArthur & Stefano Kalonaris
Current levels of investment and interest in virtual reality (VR)
technologies suggest that the medium is being taken very seriously. That
it is ‘market ready’ is the result of converging developments in computing
capability, component size, and manufacturing (thus pricing), together
with a steady trend toward immersive experiences (Gartner 2016).
The temporary nihilism afforded by vivid, multi-sensory envelopment,
suits our over-stimulated minds, releasing us from lingering anxieties.
And quickly. We can be transported into a substitute corporeal world, and
can play inhabitant, without the inconvenience of having to physically
prepare, transition, or consider the consequences of our actions in that
world. In the panoptic information age, where all is knowable, all is
surveilled, such abandon may be welcomed.
That this medium might be considered the most ‘complete’ work of art
(or Gesamtkunstwerk) isn’t mere hyperbole; it alerts us to the need for
(if not the dearth of current) research in VR (Biocca 1999; Davis 2016),
as well as to the importance such research could have for practice, to elicit
reflexivity whilst forging ahead with production as a form of epistemology.
We risk much more than a low return on investment with VR.
Having reduced the sense of technological mediation (paradoxically by
employing more and more complex technology) VR engages a fuller range
of our sensoria, it interfaces more directly with our perceptual and
cognitive processes than did its predecessor technologies. This rendered
reality - contrived, constructed and codified - is less distinguishable from
direct experience, particularly for younger users, which heightens the
ethical urgency for us to understand what is at stake (Madary and
Metzinger 2016). It is this potency that allures. Sound-image
practitioners, drawn as we are to its unique affordances, would do well
to cast our minds back to earlier incarnations of sound-image works, to
perhaps identify the cyclical elements at work in what appear as wholly
original concerns. Eisler’s (1947) critique of the Gestamtkunstwerk is
particularly incisive in discussing the problem of ‘unifying’ film music
(which he saw as antiquated, even bourgeois) with film, a comparatively
more materially exciting development at the time.
VR has an impressive list of potentially concerning effects associated
with its use, including depersonalisation, cyber-sickness (a form of motion
sickness), after-effects (disturbed locomotion, changes in postural control,
perceptual-motor disturbances, flashbacks, drowsiness, fatigue) and
illusions of embodiment whose effects on behaviour and psychology can
endure (Rizzo, Schultheis and Rothbaum 2003; Madary and Metzinger
Unlike physical environments, virtual environments can be modified
quickly and easily with the goal of influencing behaviour ... The
comprehensive character of VR plus the potential for the global
control of experiential content introduces opportunities for new and
especially powerful forms of both mental and behavioral
manipulation. (Madary and Metzinger 2016)
Angela McArthur and Stefano Kalonaris96
Although manipulation is a loaded term, it can be extremely useful.
It is this extremity of impact which should alert us to the need to approach
forward movement with the caution of reflexivity. For this, the notion of
criticality, of being enmeshed in the object of critical enquiry (Rogoff 2004)
is germane. VR is a medium which asks us to abdicate our perceptual
apparatus in place of its own. The excitement this educes may be
counterpointed by its physical and financial production and consumption
demands. Yet our fatigue in meeting these demands may keep us further
from criticality.
As practitioners of sound in space for image, what can be thought of
as ‘spatial sound design’ (SSD), we may find ourselves working with
animators, software developers, cinematographers and dramaturges in
one contained project. Not only that, we must ourselves inhabit each
quadrant in Krebs Cycle of Creativity (Oxman 2016) of artist, scientist,
engineer and designer. Without the skills to implement the technology,
we cannot design the sound; without the skills to design sound we cannot
communicate our expression and risk failing to engage our publics;
without the skills to express artistic concepts we do no more than model
conceptions of realism and perpetuate notional stasis; and without the
skills to model scientifically-informed realism we produce unintentional
glitches, arbitrarily converging (or failing to converge with) principles of
auditory perception in a haphazard manner.
Even with skills in all four quadrants, production may be somewhat
chaotic. All VR production exists within a fragmented ecosystem without
end-to-end solutions. We must practically and conceptually navigate
Krebs’ four quadrants - under pressure, trouble-shooting, hacking
solutions, learning new and changing technologies - whilst attending to
our aesthetic sensibilities.
Modelling realism is a requisite step, if we are deliberately to move
beyond it. This article focuses on designed diegetic sound (a problematic
term which here is further troubled through its interchangeability with
‘environmental sound’) and the artistic impulse to criticality, as an ongoing
act of (personal or collective) interrogation. It is particularly appropriate
for the cinematic content, which is situated at the ‘passive’ end of the
passive-interactive continuum for VR. Cinematic work is least likely to
recover from the compound effect of production-based errors of judgement
which may, for screen-based media, have been remediated in post-
production. Sound design must inhabit this high-pressure, experiment-
intolerant setting. It is the claim of this article that sound is deft to
Space and time are the immanent contingencies of immersive
environments. They decree both its opportunities, and its challenges.
Like other forms of VR, cinematic VR contends with users’ freedom of
head movement, which opposes the desire to direct their attention. Sound
responds by cueing attention effortlessly (McArthur 2016). VR faces
Foregrounding Expectation 97
an apparent conflict of interaction versus emotional engagement. Sound
can reflect both states simultaneously. Cinematic VR is limited by its
(generally) monoscopic format which reduces it to a flattened spherical
image. Sound is presented to the ears in a form (sound waves) which
retains its original dimensionality. Not only that. The subjective and
conscious sense of presence (‘being there’) is enhanced by sound (see
Hendrix and Barfield 1996; Serafin and Serafin 2004). It positively effects
attentional resourcing and task performance (Driver and Spence 1994;
Spence and Driver 1996).
The efficacy of these effects attest to the fact that audition in immersive
settings, where the user is ego-centrically situated, serves an adaptive
function – one which helps us survive in our environment. This at times
enjoys a primacy which any intellectual, emotional, aesthetic or narrative
function cannot override. Our more sophisticated cognitive interpretations
may well be bypassed in favour of a reductive reflex response an
answering of the ‘what’ and ‘where’ of events, to help us make sense of, and
respond to, the environment. Sound is always related to one’s own ‘being
there’ (Blesser and Salter 2009).
As such, where sound may have previously ‘added value’ to image (see
Chion 1994) for screen-based media, its value in VR production is being
reconfigured, its contribution recognised, at least by practitioners:
Directors of VR/AR are now professing that sound is playing a more
important role than ever and that it makes up for as much as 60% of the
overall experience. (Bos
ˇnjak 2016)
Its elevated impact can be achieved with relatively low computational
cost. Google Cardboard’s spatial audio computes audio in real-time,
yet is implemented on mobile devices whilst ... most of the processing takes
place outside of the primary CPU [central processing unit]’ (Martz 2016).
Dynamic binaural synthesis, utilising head-related transfer functions
(HRTFs), virtual loudspeakers and headset orientation data, provides
a compelling experience which can be delivered (minimally) over
standard headphones, a smartphone and cardboard headset. This affords
a sounding world where distance, location and environmental cues remain
independent when we move, and it serves to reinforce presence (Larsson,
¨ll and Kliener 2002, 2005; Begault 2000) which in turn can ‘uplift’
potentially presence-breaking features of image (Clark 2005).
Advances in spatial sound rendering and reproduction techniques
mean not only that spatial audio is possible, but extremely affordable.
The free-to-low pricing of spatial audio plugins, room impulse responses,
games engines and multi-channel digital audio workstations (DAWs)
mean that amateur practitioners can contribute to this fledging field.
In many cases monaural or stereophonic recordings are sufficient and
ambisonic (full sphere surround sound) recordings are not a vital
That audio achieves so much with relatively little in immersive settings
should not result in complacency. If auditory discrimination seems inferior
1. Cockos’ Reaper affords
up to 64 channels of
Angela McArthur and Stefano Kalonaris
to vision, the question we might ask is ‘for how long?’ With exposure,
we adapt quickly to novel experience, and require more sophisticated and
complex offerings to sustain our interest. For the SSD, anticipating
the aggregate effects of sound-image composition on publics, and
understanding the perceptual correlates of their sound design, is
worthwhile. It provides a foundation upon which to navigate the
technological and perceptual terrain they face.
The challenges outlined may produce interesting production constraints
and outcomes, but they are not self-imposed conceptual limitations,
designed for artistic concerns about, or scientific understanding of, process.
That production methods contains a multitude of such operational
obstacles, can be enlivening. It can also result in us stopping short
at problem-solving and reproduction. Content production for VR is
costly and cumbersome, neither of which encourage experimentation.
Conventions imported from film and game sound design are
commonplace. Diegetic sound is mapped, one-to-one, to its visual
source. Non-diegetic sound is placed ‘in head’ stereophonically. This is
in part a result of time constraints, in part acceptable (if reductive) practice
based on widespread contemporary headphone listening practices. It does,
however, anchor sound to perform as an effect of image, originating from it
rather than extending to it.
In real-world discriminations, auditory perception often precedes visual
perception, we may for example hear distant animal calls significantly
earlier than we see the caller(s). In enclosed environments sound sources
can be visually occluded whilst remaining audible. In some cases, sound
is unbound from directly answering the questions of where and what,
and trace ambiguities of the source endure. Need sound be dialectically
neutered by image? In Altman’s essay, ‘Moving Lips: Cinema as
Ventriloquism’, the role which sound plays in completing and
reinforcing image is discussed, as well as its less well documented role of
using the image to mask its own actions. In this latter view, neither sound
or image is redundant; the two are ‘locked in a dialectic where each is
alternately master and slave to the other’ (1980: 79). This idea develops
Eisler’s (1947) earlier proposal that sound provides a dialectic counterpoint
to image, revealing, providing an opportunity for criticality (or at least
reducing redundancy). Examples are emerging of such a dialectic. Gattai
Games’ Stifled (2017), uses the player’s character sound and their actual
microphone input, to propagate sound in the virtual environment.
These in turn produce the visual component of the world, as a form of
‘echolocation’. Too little sounding, and the world remains largely hidden,
too much and antagonists in the game will locate the player. Middleton
and Spinney’s Notes on Blindness (2016), which imaginatively represents
the sensory and psychological experience of losing sight, won the
Storyscapes Award at Tribeca Film Festival and the Alternate Realities
VR Award at Sheffield Doc/Fest.
Foregrounding Expectation 99
Such cases refute ideas that immersion or presence would be at risk
if virtual environments (VEs) deviated from ‘objective’ simulation of
real-world acoustics. To uphold the necessity of realism ignores the willing
suspension of disbelief, for example the ‘Ventriloquist Effect’ (Alais and
Burr 2004; Spence and Driver 1996, 2000) where conflicting information
from different modalities is resolved to favour consistency over accuracy.
It discounts the ‘uncanny valley’ phenomenon (Mori 2012) where artificial
renderings that come close to realism are uncannier (invoking different
criteria for evaluation) than those less realistic, for sound as for image
(Grimshaw 2009; Rumsey 2013). It casts no criticality upon the symbiotic
evolution of signified and signifier, where codification can be established
more by media than real-world experience and acts upon our memory as if
concrete (Chion 1994). We may have a clear and definite representation
for the sound of a gunshot without ever hearing its referent. These
renderings affect our expectations and sit alongside unmediated material
from our personal archives. Indeed, the rise of hyperrealism might suggest
that rendered realism induces a presence which reality very often does not
(Lennox 2004).
Perhaps then, audition is more active, more selective in its own
rendering of a coherent perceptual experience, than may be apparent.
Veridicality between ‘real’ and ‘perceived’ might be called into question,
and rather than answer its call, we might loosen our grip on the impulse
to extend this close coupling to virtual worlds. Our in-built tendency
of resolving potentially conflicting information, even in complex
settings, to produce consistency in perception (Battaglia, Jacobs &
Aslin 2003), employs both voluntary and involuntary processing. This is
cognitively ‘expensive’. Yet with increasing exposure, we favour more
sophisticated content, precisely because we would otherwise become
bored. We want to be tested by content: not so much that it overwhelms,
but enough that it sustains interest (for an explanation of this process
at work in music, see North and Hargreaves 1995; Orr and Ohlsson
Such active perceptual composition, which draws upon prior exposure,
memory (actual or mediated), and personally distinct interpretations
and subjectivities, indicates that the user is explicitly co-authoring
the experience within a virtual environment (VE), as they would in a
real environment. It may be these very possibilities for ‘enactment’ the
‘human ability to turn complex patterns into units which can be dealt
with in terms of actions’ (Leman 2016: 26) which VR is positioned to
VR is often described as creating experiences through which stories can be
understood, rather than more directly narrating them. A virtual space can
be seen as the latent opportunity, constraining and enabling certain
movements and interactions. ‘Place’ according to Dourish (2006) is the
style created and later decoded. However, this does not suggest the level of
Angela McArthur and Stefano Kalonaris100
co-authorship which VEs can offer, given the generative processes which
can be designed into them, VR’s enactment potential benefits from our
physiological response to it. Its potential hazards (for example, motion
sickness) demonstrate that we do not differentiate well between virtual and
real experience. We actively (de)compose its soundscape. Sound tells
us about its source, the environment, and ultimately ourselves within
this context. Drawing from Gibson’s (1982) ecological psychology, the
environment/perceiver relationship can be foregrounded to update the idea
that ‘place’ is simply decoded. Instead it becomes an inter-subjective and
evolving construction. The environment provides information via events,
from which the perceiver’s understanding and behaviour arises. The events
are dynamic, shifting and multimodal. They provide opportunities for
action. Perception is understood in relation to this action, resulting in an
acquired environmental responsiveness.
The main function of the human auditory system is to construct a sonic
percept that is useful for further prediction of the possible causes of that
percept, so that it can be handled as an action-functional object in
interaction. (Leman 2016: 78)
This perspective posits information as structure, and structure as
information. It does not separate form and meaning. It acknowledges
user agency. Every activity consisting of perception and operation imprints
meaning on an otherwise meaningless object (von Uexku¨ll 1998) thus
turning it into a subject-related carrier of meaning.
What a thing is, and what it means are not separate, the former being
physical and the latter mental as we are accustomed to believe.
(Gibson 1982: 408)
Such an outlook emphasises inherently meaningful affordances rather than
abstraction and representation, which serve to separate referent from
concept. Abstraction effectively removes the relations between the thing
and its meaning, relations which have substance in themselves (Lennox
2004). Meaning could thus be co-constituted by the user and the
environment as a generative, moment-by-moment singularity. Pierre
Schaeffer’s acousmatic discourses prove helpful here, having extensively
dealt with the ecological concerns of sound in relation to image.
In ecological listening, the phenomenology of audition is underlined and
sound does not rely on internal codification. This echoes the acousmatic
notion of ’reduced’ listening, which asks us to suspend causal or anecdotal
sound significations and interpretations based on a sound’s original
context, concentrating instead on its material features. A virtual world
provides a novel environmental context through which sound could be
phenomenologically foregrounded, helping SSD as an artistically critical
practice by ‘capturing and reflecting on the otherwise ephemeral or
transitory nature of musical sound’ (Godoy 2006: 149). Reduced listening
in effect interrupts us, potentiating the perception of new semiotic
Foregrounding Expectation 101
structures; a task which users in VR, whether novice or experienced, orient
towards (McArthur 2016). For Schaeffer, the move away from
representation, towards an autonomous, intentional unit of sound – the
‘sonorous object’ (objet sonore) – allows our consciousness to play a
constitutional role in listening (Schaeffer 1966). Yet with accumulated
exposure, events become increasingly invariant and schema more
automatic. Seen as such, SSD as critical practice echoes Schaeffer’s
aspirations for musique concre
`te, (music which starts with the concrete and
moves to abstracted values in contrast to classical music which moves from
abstract notation and moves to performance):
If I gather together fragments of noises, cries of animals, the
modulated sound of machines, I myself also strive to articulate them
like words of a language that I would practise without even
understanding and without ever having learned it [...] So this is
what art is: a translation whose exactness is periodically monitored
by experiment; establishing by groping around, rigorous
correspondences between man and the world... (Chion & Reibel
1976: 47)
SSD – like the object sonore – becomes autonomous, insomuch that it
does not require ‘correct’ decoding for its realisation, and seeks not a
specific affect. Rather, by at times avoiding parallel sound-image
correspondence it disrupts the automata of our schema, jolting user and
perhaps also practitioner, into critical awareness. That is its affordance.
We have a means of circumventing the biases of prior experience which
prevent us from appreciating the truly novel:
When faced with a totally new situation, we tend always to attach
ourselves to the objects, to the flavor of the most recent past. We look
at the present through a rear-view mirror. We march backwards into
the future. (McLuhan and Fiore 1967: 74–75)
How might such affordances be actualised? A generative process –
systematic yet unpredictable enough to traverse the threshold of
perception – is now proposed.
Introducing discontinuities in experiential events which neither
overwhelm nor bore in their level of predictability, is the aim for this
framework, ensuring relations between signified and signifier remain in
flux. To achieve this, a systematic approach is useful. For the SSD
practitioner, a systematic procedure may encourage experimentation,
reflexive comparison, or may impose some limitations and regularities in
an ocean of possibilities. It may not be obviously systematic to the user.
An impression of randomness and complexity in sound-image composition
can contribute to the illusion of an emergent reality. This ‘impression’ may
Angela McArthur and Stefano Kalonaris102
vary between users, and for the same user over time. The suspension and
violation of expectation can be manipulated (in this case through
disjunctures in sound-image composition) to create instability, and
contribute to the impression of heterogeneity. The fulfilment of
expectation meanwhile creates an impression of stability. We may
stretch a user so that their existing constructs break down, stretch,
assimilate new information or require restructuring. A user may resist the
experience of a distant avatar’s voice appearing to sound at close range,
intimate and hushed. But if this sound-image correspondence is reliable,
they will adapt to it. Such adaption is an in-built fitness function, to have
us respond quickly and accurately to our environment. Perceptual or
semantic congruence is learnt:
... the logic of the environment is in the environment, the essence of
adaptive behaviour is that a perceiver can learn the new set of
contextual rules just as one might learn an entirely new language
without the aid of a translator. (Lennox 2004: 233)
Adaptations that prove most beneficial in predicting future events are
preserved and reinforced. Complex signs may need to be learnt, but can be
learnt without instruction. Even arbitrary relations between signs and
referents can be understood with limited exposure, leading to the
formation of expectations – internalised as a system of ‘transitional
probabilities.’ This is a tacit awareness of the statistical regularities in
sequences of events (Saffran et al. 1999), a kind of embodied syntax.
Meanings arise from the fissure between the syntax and the user’s
awareness of it, in its fullness. Enactment could help catalyse this process
in a VE.
With the passing of time, our sense of environmental affordances and
possible future events, is reduced in scope. Why is this? With sparse
information about a structure, more is potential. Information provides
definition, edges, a sense of ‘fit’. With temporally and causally contingent
events, our expectations are guided (and indeed guide our perception in
turn) as to what will happen next.
The ability to imagine causation based on evidence is a process of
induction. This involves going back and forth between the observed state
of the world (posterior) and what caused it (prior). When one’s
expectations are repeatedly not met, a process of active inference may
ensue, and new evidence may be incorporated into a revised belief about
the prior. This process may continue recursively until a reliable
correspondence is established.
To purposefully disrupt expectations about sound-image
correspondences, it is important that the SSD practitioner presents a
bespoke, systematic distribution of sound-image events. The user in turn
can (and will) then form increasingly accurate inferences about this
distribution (prior). This process can help induce the user into the
ecologies of the virtual world, and its enactment possibilities (its ‘space’)
through which their experience can arise (‘place’). A specific framework for
Foregrounding Expectation 103
generating such sound-image distributions, might employ the formal
architecture of Markov Chains.
A Markov Chain (MC), is a generative process developed in probability
theory, used for a myriad of tasks including algorithmic composition
of chorales in the style of J. S. Bach (Allan and Williams 2004) and
jazz idiomatic musical phrases (Pachet and Roy 2001). A MC is described
as a system comprising a set of states and the probabilities of moving from
a given state to another. It is thus possible to predict, from an initial
position in the system, the next state in a sequence, having observed a
finite number of previous states. As a concrete example, let Ebe the set of
sound-image events {A, B, C}. The probabilities associated with the
transitions from any of these events can be expressed in tabular form,
normally referred to as a transition matrix. Table 1 illustrates an example of
such matrix.
To calculate the probability of the system being at Cafter two steps,
having started at A, one could sum over all the conditional probabilities
over all possible paths. In this example, this would be equal to the
probability of the system self-transitioning from A to A, multiplied by
the probability of going from A to C in the second step ((A!A)(A!C)),
plus the probability of moving from A to B in one step multiplied by the
probability of going from B to C in the second step ((A!B)(B!C)),
plus the probability of going from A to C in one step times the
probability of self-transitioning from C to C in the second step
((A!C)(C!C)). Using the values in the transition matrix in Table 1,
one would obtain:
Figure 1: An expectation loop.
Angela McArthur and Stefano Kalonaris104
The MC can generate sequences of sound-image events according to
the distribution desired by the SSD practitioner. However, it would be
unreasonable to assume that the average practitioner would engage in the
task of manually specifying numerical values for each transition. This is
particularly true if we consider how many sound-image events a VR
experience might involve, and the number of features each event could
An alternative to such labour-intensive task would be to have the SSD
practitioner specify levels of salience, whereby events would be ranked
hierarchically. This could be achieved using an ordinal ranking of the SSD
practitioner’s preferences over the events. In the above example, one could
express these as transitive preferences (e.g. B more probable than A more
probable than C, in the following manner:B%A%C).
This process could be applied to different levels/layers, for example to
parameters of the sound in a sound-image event. In so doing, a nested
hierarchy of salience would be created. Let us assume that the SSD
practitioner has also expressed an ordinal ranking of preferences regarding
the spatial position of the events. Concretely, let us suppose that there are
three possible locations, {pos1, pos2, pos3}. If the SSD practitioner specified
the following ranking: B%A%Cfor sound events and pos3%pos1%pos2
for the locations of the events, the MC algorithm would consequently be
applied to this two-layered hierarchical structure. To do this, an arbitrary
conversion/mapping step, which takes the preference rankings and
translates them to a set of weights to be assigned to the transition
probabilities, would be necessary. This task could be achieved by a simple
set of sliders denoting the relative magnitudes for the rankings of features
of interest. A typical output of the MC could be the following:
Table 1: An example of a transition matrix.
Foregrounding Expectation 105
The above reads: sound event Bat location 3, followed by sound event Aat
location 3, followed by sound event Bat location 1, and so on. The SSD
practitioner would be able to query the next event in the sequence one at a
time, as well as requesting a sequence of specified length which can be
arbitrarily long. This process can be extended to other features, such as
duration of sound events, or sound synthesis parameters. Figure 2
illustrates the structural flow of the process:
What are the advantages of using this framework, an abstract system
which could be realised as a DAW plugin, a standalone software
application or even a physical device? Many spatial sound recording,
rendering, and playback technologies already inform the design process.
This approach serves as an addition to the toolbox, one which can produce
automated yet interesting results, which do not risk ‘breaking’ presence,
nor burden the SSD with manual programming. Its systematic and
internally coherent affordances provide an opportunity for experimentation
within structure, potentially eliciting reflexivity in user and practitioner.
The framework does not depend on conscious expectations, though the
tacit knowledge which it can produce might enable new ways of designing
sound in space.
Spatial sound design straddles a fascinating, liminal expanse. McLuhan’s
(2011) questioning of the growing divide between art and science (arising
from increasingly human-made environments) may be challenged by VR,
whose practitioners draw from all of Krebs’ four quadrants. McLuhan may
be proved right, if in time specialist divisions in labour practices arise. For
now, its terrain is relatively unchartered, and interwoven.
In this framework, the virtual space is composed from aesthetic sound-
image distributions the SSD practitioner engineers, but which exceed her
own predictive ability. Generative, novel environmental structures, and
opportunities for response, are forged to create a virtual ‘place’. With space
as a compositional device, diegetic sound becomes structural. Meanings
contingent on VR’s spatio-temporal nature may emerge, and may even
resist the neutering effects of convention.
If sound is to perform as more than an anchor to image, an
enhancement of presence, or an extended margin for visual error-
correction, it may need to uphold itself not against image, but against
its own representational inheritance. If, as Altman (1980) describes, sound
Figure 2: Flow diagram.
Angela McArthur and Stefano Kalonaris106
asks the questions and image merely answers them, the examination of
how sound articulates such questions is judicious. If we are to ensure that
our beguilement with this technology does not anesthetise us to its potent
and potentially detrimental effects, we must retain critical reflexivity in our
engagement with it. We must expose its hidden environments – its
subliminal forces and their effects (McLuhan and Fiore 1967) – whilst
ourselves enmeshed.
We can consider spatial sound design as an autonomous, generative
aim; something capable of producing new methods of enquiry, new kinds
of knowledge and meaning-making. This framework aims to tackle some
of its practical and aesthetic concerns whilst affording a reflexive structure.
To be unbound from, but aware of our past, technological mediation can
be put to work. Let us use such tools for better imaginings.
The authors would like to acknowledge support from the EPSRC and
AHRC Centre for Doctoral Training in Media and Arts Technology
through Queen Mary, University of London, the Department for
Employment and Learning NI and the Sonic Arts Research Centre,
Queen’s University Belfast.
Alais, D. and D. Burr (2004), ‘The Ventriloquist Effect Results from
Near-Optimal Bimodal Integration’ in Current Biology, vol. 14,
pp. 257–62.
Allan, M. and C. Williams (2004), ‘Harmonising Chorales by
Probabilistic Inference’, in Proceedings of Advances in Neural
Information Processing Systems 17, December 13–18, Vancouver,
Altman, R. (1980), ‘Moving Lips: Cinema as Ventriloquism’ in Yale French
Studies, No. 60, Cinema/Sound, pp. 67–79.
Battaglia, P. W., R. A. Jacobs and R. N. Aslin (2003), ‘Bayesian
integration of visual and auditory signals for spatial localization’ in
Journal of Optical Society of America, vol. 20, pp. 1391–7.
Biocca, F. (1999), ‘Lab probes long-term effects of exposure to virtual
reality’, interview by Johnson, C. for EE Times, http://www.eetimes.
com/document.asp?doc_id=1138729, accessed 7 December 2016.
Blesser, B. and L. R. Salter (2009), Spaces Speak, Are You Listening?
Experiencing Aural Architecture, Cambridge, MA: MIT Press.
ˇnjak, D. (2016), ‘How 3D Sound Makes Virtual Reality More
Real’, interview by D. J. Pangburn for The Creators Project Blog,
reality-more-real, accessed 22 March 2017.
Chion, Michel (1994), Audio-Vision: Sound on Screen, trans. Claudia
Gorbman, New York: Columbia University Press.
Chion, M. and G. Reibel (1976), Les musiques e
´lectroacoustiques, Aix-
en-Provence: Ina / Edisud.
Foregrounding Expectation 107
Clark, Austen (2005), ‘Cross-modal cuing and selective attention’ in
Proceedings of Individuating the Senses Glasgow, Scotland: University of
Davis, Nicola (2016), ‘Long-term effects of virtual reality use need more
research, say scientists’, Guardian Online, https:// www.theguardian.
com/technology/2016/mar/19/ long-term-effects-of-virtual- reality-
use-need-more-research-say-scientists, accessed 4 December 2016.
Dourish, Paul (2006), ‘Re-space-ing place: place and space ten years on’
in Proceedings of Computer Supported Cooperative Work, Banff, Canada,
November 2006, pp. 299–308.
Driver, J. and C. Spence (1994), ‘Spatial synergies between auditory
and visual attention’ in C. Umilt and M. Moscovitch (eds) Attention
and Performance. XV: Conscious Nonconscious Information Processing,
Cambridge, MA, USA: The MIT Press, pp. 311–31.
Eisler, Hanns (1947), Composing for the Films, London: Dennis Dobson.
Gartner (2016), ‘Hype Cycle for Emerging Technologies, 2016’, https://, accessed 6 February 2017.
Gattai Games (2017), ‘Stifled’,, accessed
6 April 2017.
Gibson, James (1982), Reasons for Realism: Selected Essays of James J.
Gibson, E. Reed and R. Jones (eds), New Jersey and London: Laurence
Erlbaum Associates.
Godoy, Rolf Inge (2006), ‘Gestural-Sonorous Objects: embodied
extensions of Schaeffer’s conceptual apparatus’ in Organised Sound,
vol. 11, pp. 149–157.
Grimshaw, Mark (2009), ‘The audio Uncanny Valley: Sound, fear and the
horror game’ in Proceedings of Audio Mostly, 2009, pp. 21–26.
Hendrix, C. and W. Barfield (1996), ‘The Sense of Presence within
Auditory Virtual Environments’ in Presence, Teleoperators & Virtual
Environments, vol. 5, pp. 290–301.
Larsson, P., D. Va
¨ll and M. Kliener (2002), ‘Better presence
and performance in virtual environments by improved binaural
sound rendering’ in Audio Engineering Society Conference, 22nd
International Conference, Virtual, Synthetic and Entertainment Audio,
Larsson, P., D. Va
¨ll and M. Kliener (2005), ‘Spatial auditory cues and
presence in virtual environments’, submitted for International Journal of
Human-Computer Studies, 2005.
Leman, Marc (2016), The Expressive Moment: How Interaction
(with Music) Shapes Human Empowerment, Cambridge, MA:
MIT Press.
Lennox, Peter (2004), ‘The Philosophy of Perception in Artificial
Auditory Environments: Spatial Sound and Music’, Thesis for PhD,
University of York.
Madary, M. and T. Metzinger (2016), ‘Real Virtuality: A Code of Ethical
Conduct. Recommendations for Good Scientific Practice and the
Consumers of VR-Technology’ in Frontiers in Robotics and AI, 3:3,
February 2016.
Angela McArthur and Stefano Kalonaris108
Martz, Nathan (2016), ‘Spatial audio comes to the Cardboard SDK’,
cardboard-sdk.html, accessed 22 March 2017.
McArthur, Angela (2016), ‘Disparity in horizontal correspondence of
sound and source positioning: The impact on spatial presence for
cinematic VR’, in Audio Engineering Society International Conference on
Audio for Virtual and Augmented Reality, September 2016, AES.
McLuhan, M. and Q. Fiore (1967), The Medium is the Message:
An Inventory of Effects, New York, London, Toronto: Bantam Books.
McLuhan, M. and E. McLuhan (2011), Media and Formal Cause,
Houston, Texas: NeoPoiesis Press.
Middleton, P. and J. Spinney (2016), Notes on Blindness: Into Darkness,
VR project, France, UK: Ex Nihilo, ARTE France, AudioGaming,
Archer’s Mark,, accessed
12 April 2017.
Mori, Masahiro (2012), ‘The Uncanny Valley’ in IEEE Robotics &
Automation Magazine, vol. 19, pp. 98–100.
North, A. and D. Hargreaves (1995), ‘Subjective complexity, familiarity,
and liking for popular music’ in Psychomusicology: A Journal of Research
in Music Cognition, 14, no. 1–2, 1995, pp. 77.
Orr, M. and S. Ohlsson (2005), ‘Relationship between complexity and
liking as a function of expertise’ in Music Perception: An Interdisciplinary
Journal, 22(4), p583–611.
Oxman, Neri (2016), ‘Krebs Cycle of Creativity’ in Joichi Ito, Design and
Science: Can design advance science, and can science advance design?, accessed 13 March 2017.
Pachet, F. and P. Roy (2001), ‘Musical harmonization with constraints:
A survey’ in Constraints, 6(1), pp. 7–19.
Rizzo, A., M. Schultheis and B. Rothbaum (2003), ‘Ethical issues for the
use of virtual reality in the psychological sciences’ in Ethical issues in
clinical neuropsychology, p.243.
Rogoff, Irit (2004), ‘What is a theorist?’ in Was ist ein Kunstler, Munich:
Wilhelm Fink.
Rumsey, Francis (2013), ‘Sound Field Control’ in Journal of Audio
Engineering Society, vol. 61 pp. 1046–50.
Saffron, Jenny R., K. Johnson, R. Aslin and E. Newport (1999), ‘Statistical
Learning of Tone Sequences by Human Infants and Adults’ in
Cognition, vol. 70, pp. 27–52.
Schaeffer, Pierre (1966), Traite
´des objets musicaux, Paris: E
´ditions du
Schaeffer, Pierre (1998), Solfe
`ge de l’objet sonore, Paris: INA/GRM.
Serafin, S. and G. Serafin (2004), ‘Sound Design to Enhance Presence in
Photorealistic Virtual Reality’ in Proceedings of ICAD 04-Tenth Meeting
of the International Conference on Auditory Display, Sydney, Australia,
July 2004.
Spence, C. and J. Driver (1996), ‘Audiovisual links in endogenous covert
spatial attention’ in Journal of Experimental Psychology: Human
Perception and Performance, vol. 22, pp. 1005–30.
Foregrounding Expectation 109
Spence, C. and J. Driver (2000), ‘Attracting attention to the illusory
location of a sound: reflexive cross-modal orienting and ventriloquism’
in Neuroreport, vol. 11, pp. 2057–61.
von Uexku¨ll, Jakob (1998), ‘The Theory of Meaning’ in Semiotica>, 42–1,
pp. 25–82.
Angela McArthur is a PhD student on the Media & Arts Technology
programme at Queen Mary University London. Her research focuses on
the aesthetics of spatial sound for immersive environments, developing
experimental approaches to its design. Her work synthesises technical and
creative practice with auditory perception research. She has worked with
the BBC R&D lab, the V&A, and is currently in post-production with
several Soho-based production partners on her own cinematic VR film,
which employs audio interactivity, and aims to embody and further
develop her conceptual research.
Stefano Kalonaris is currently a PhD candidate at the Sonic Arts Research
Centre, Queen’s University, investigating fruitful intersections between
Game Theory, Probabilistic Graphical Models and Free Improvisation.
He is both a seasoned performer and an avid researcher of myriad musical
styles and cultures. Stefano integrates the use of software, sensors and
electronics into his improvisational practice/work and is particularly keen
on heteronomic music models that mix improvisational and compositional
perspectives. He has performed at many renown national and international
venues and festivals, to include Sonorities, STEIM, the London Jazz
Festival and Cafe
Angela McArthur and Stefano Kalonaris110
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
The goal of this article is to present a first list of ethical concerns that may arise from research and personal use of virtual reality (VR) and related technology, and to offer concrete recommendations for minimizing those risks. Many of the recommendations call for focused research initiatives. In the first part of the article, we discuss the relevant evidence from psychology that motivates our concerns. In section 1.1, we cover some of the main results suggesting that one’s environment can influence one’s psychological states, as well as recent work on inducing illusions of embodiment. Then, in section 1.2, we go on to discuss recent evidence indicating that immersion in VR can have psychological effects that last after leaving the virtual environment. In the second part of the article we turn to the risks and recommendations. We begin, in section 2.1, with the research ethics of VR, covering six main topics: the limits of experimental environments, informed consent, clinical risks, dual-use, online research, and a general point about the limitations of a code of conduct for research. Then, in section 2.2, we turn to the risks of VR for the general public, covering four main topics: long-term immersion, neglect of the social and physical environment, risky content, and privacy. We offer concrete recommendations for each of these ten topics, summarized in Table 1.
Full-text available
This thesis examines the philosophies of auditory spatial perception that can underpin spatial music and explores auditory perception in real and artificial environments. Some theoretical problems are reviewed and their constraining influence on current technological implementation is noted. Progress in manipulating perceptions of direction, distance and movement (of sources and percipients) is reviewed. The “sound field” (via its appeal to interaural differences) as the basis for spatial hearing is rejected as overly-constraining, failing to encapsulate some fundamental meaningful relationships between environment and perceiver. Cognitive and ecological approaches are considered, and an approach is taken that draws from both. An alternative description of the ontological distinctions that are available in real environments is offered, and the implications for theories of perception are outlined. A modular scheme of spatial perception is proposed, based on the view that causal distinctions in the world influence phylogenetic and ontogenetic development. The thesis describes how these perceptual context modules can be invoked in artificial auditory environments in a process of perceptual cartoonification. Attention is paid to the concept of ambience labelling, a class of information that is extrinsic to the percipient, and which economically appeals to perceptual significance-grading. It is proposed that artificial spatial sound can be managed in terms of causal foreground and background. This is a novel view that perception engages in modelling causality in a deeply territorial way and so modelling causal consistency in the light of this can provide a flexible and scalable basis for managing artificial spatial sound; some speculations are made about how scale of spatiality might be explored. Modelling the causal field incorporates notions of sensory-fields, but can be extended to provide perceptual interaction. It is concluded that “music as an environment” qualitatively differs from spatially-arranged sound; its exploration providing useful theoretical supplements to theories of perception.
Full-text available
The optimal complexity and preference-feedback hypotheses make specific predictions about the effects of stimulus familiarity and subjective complexity on liking for music excerpts. This study investigates the relationships between each of these three variables within the same experimental design. Seventyfive undergraduate subjects rated 60 excerpts of contemporary popular music for liking, subjective complexity, or familiarity. The results strongly supported the predictions of the two models, indicating a positive relationship between liking and familiarity, and an inverted-U relationship between liking and subjective complexity. The observed relationship between familiarity and subjective complexity was more difficult to predict and explain, although there was some evidence that this relationship might best be described as an inverted-U function. The different relationships of these two variables with liking are explained in terms of subjective complexity being related to objective properties of the stimuli, and familiarity being determined by cultural exposure and subjects' own volition.
Human observers localize events in the world by using sensory signals from multiple modalities. We evaluated two theories of spatial localization that predict how visual and auditory information are weighted when these signals specify different locations in space. According to one theory (visual capture), the signal that is typically most reliable dominates in a winner-take-all competition, whereas the other theory (maximum-likelihood estimation) proposes that perceptual judgments are based on a weighted average of the sensory signals in proportion to each signal's relative reliability. Our results indicate that both theories are partially correct, in that relative signal reliability significantly altered judgments of spatial location, but these judgments were also characterized by an overall bias to rely on visual over auditory information. These results have important implications for the development of cue integration and for neural plasticity in the adult brain that enables humans to optimally integrate multimodal information.
Sound localization can be affected by vision; in the ventriloquism effect, sounds that are hard to localize within hearing become mislocalized toward the location of concurrent visual events. Here we tested whether spatial attention is drawn to the illusory location of a ventriloquized sound. The study exploited our previous finding that visual cues do not attract auditory attention. We report an important exception to this rule; auditory attention can be drawn to the location of a visual cue when it is paired with a concurrent unlocalizable sound, to produce ventriloquism. This demonstrates that crossmodal integration can precede reflexive shifts of attention, with such shifts taking place toward the crossmodally determined illusory location of a sound. It also shows that ventriloquism arises automatically, with objective as well as subjective consequences.
The inverted-U hypothesis has much empirical support in the field of experimental aesthetics. This hypothesis predicts that moderately complex art objects should be preferred over very simple or very complex art objects. Although it is tacitly believed that this hypothesis applies to experts, the literature does not contain any convincing studies that demonstrate this as fact. The present study addresses this issue. Professional jazz and bluegrass musicians rated the complexity of and their liking for short, but complete, jazz and bluegrass improvisations. Complexity and liking were operationalized by subjects� judgments on seven-point Likert-like scales. Regressing liking onto complexity did not reveal any evidence for an inverted-U relation for experts. Moreover, no relationship between liking and complexity was found for the jazz musicians; a negative relation was found for the bluegrass musicians, but only when listening to the bluegrass improvisations. Furthermore, by comparing the expert data with a reanalysis of nonexpert data collected in a previous, but identical study, we propose as a first approximation that musical expertise dissolves the relationship between liking and complexity.