Available via license: CC BY 4.0
Content may be subject to copyright.
Human-Computer Interaction Series
Sonic
Interactions
in Virtual
Environments
Michele Geronazzo
Stefania Serafin Editors
Human–Computer Interaction Series
Editors-in-Chief
Jean Vanderdonckt, Louvain School of Management, Université catholique de
Louvain, Louvain-La-Neuve, Belgium
Q. Vera Liao, Microsoft Research Canada, Montréal, Canada
The Human –Computer Interaction Series, launched in 2004, publishes books that
advance the science and technology of developing systems which are effective and
satisfying for people in a wide variety of contexts. Titles focus on theoretical
perspectives (such as formal approaches drawn from a variety of behavioural
sciences), practical approaches (such as techniques for effectively integrating user
needs in system development), and social issues (such as the determinants of utility,
usability and acceptability).
HCI is a multidisciplinary field and focuses on the human aspects in the
development of computer technology. As technology becomes increasingly more
pervasive the need to take a human-centred approach in the design and
development of computer-based systems becomes ever more important.
Titles published within the Human–Computer Interaction Series are included in
Thomson Reuters’ Book Citation Index, The DBLP Computer Science
Bibliography and The HCI Bibliography.
Michele Geronazzo ·Stefania Serafin
Editors
Sonic Interactions
in Virtual Environments
Editors
Michele Geronazzo
Department of Engineering
and Management
University of Padova
Padova, Italy
Dyson School of Design Engineering
Imperial College London
London, UK
Department of Humanities and Cultural
Heritage
University of Udine
Udine, Italy
Stefania Serafin
Aalborg University
København SV, Denmark
Nordic SMC University Hub, NordForsk, Stensberggata 27, 0170 Oslo, Norway represented
by: Hans Jørgen Andersen, Professor
Università degli Studi di Udine, Dipartimento di Studi Umanistici e del Patrimonio Culturale,
Vicolo Florio n. 2/B, 33100 Udine, Italy represented by: Prof. Andrea Zannini, Department
Director
EU project SONICOM, funded by the Horizon 2020 research and innovation programme
(grant agreement No 101017743), represented by: Dr. Lorenzo Picinali, Reader
ISSN 1571-5035 ISSN 2524-4477 (electronic)
Human–Computer Interaction Series
ISBN 978-3-031-04020-7 ISBN 978-3-031-04021-4 (eBook)
https://doi.org/10.1007/978-3-031-04021-4
© The Editor(s) (if applicable) and The Author(s) 2023. The book is an open access publication.
Open Access This book is licensed under the terms of the Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribu-
tion and reproduction in any medium or format, as long as you give appropriate credit to the original
author(s) and the source, provide a link to the Creative Commons license and indicate if changes were
made.
The images or other third party material in this book are included in the book’s Creative Commons license,
unless indicated otherwise in a credit line to the material. If material is not included in the book’s Creative
Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted
use, you will need to obtain permission directly from the copyright holder.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Between stimulus and response there is a
space. In that space is our power to choose
our response. In our response lies our growth
and our freedom.
—Viktor E. Frankl
Preface
Sonic Interaction Design (SID) is the study and exploitation of sound being one of the
principal channels conveying information, meaning, esthetic, and emotional quali-
ties in interactive contexts. The field of Sonic Interactions in Virtual Environments
(SIVE) extends SID to immersive media, i.e., virtual/augmented/mixed reality (XR).
Considering a virtuality continuum, this book mainly focused on virtual reality (VR)
also facing occasionally mixed and hybrid reality settings.
The basic and most obvious assumption that motivates this volume is: it is hard
to live in a world without sound and it is hard in virtual environments (VE) too. VR
without plausible and convincing sounds feels unnatural to users. Auditory infor-
mation is a powerful omnidirectional source of learning for our interaction in real
and virtual environments. The good news brought by this book is that VR finally
sounds plausible. Advances in several fields are now able to provide an immersive
listening experience that is perceptually indistinguishable from reality which means
that immersive sounds could make interaction intrinsically natural. Auralization and
spatial audio technologies play a fundamental role in providing immersion and pres-
ence in VR applications at an unprecedented level. The combination of recent devel-
opments in VR headsets and earables further strengthens the perceptual validity of
multimodal virtual environments and experiences.
We can therefore promote a true audio-centered and audio-first design for VR
with levels of realism and immersiveness that can even surpass the visual counterpart.
Visuals, although rightly emphasized by manystudies and products, are often not very
effectively enhanced and strengthened by sound. The final result is a weakening of
multisensory integration and the corresponding VR potentials that strongly determine
the quality and durability of the experience.
The editors would like to identify two starting points in the past 10 years that have
given rise and awareness to the SIVE research area and studies. The first episode
is symbolic: we would like to anecdotally bring back from our memories the first
meeting between us, the two editors of the book. The year was 2011, exactly 10 years
ago. Michele had recently started his Ph.D. at the Sound and Music Computing
Group of the Department of Information Engineering at the University of Padua,
under the supervision of Dr. Avanzini. The Italian Association of Musical Informatics
vii
viii Preface
(AIMI) organized the workshop “Sound and Music Computing for Human-Computer
Interaction” at the ninth edition of the Biannual Conference of the Italian ACM
SIGCHI Chapter (CHItaly) at the beautiful Alghero in Sardinia in early September.
A great period for the seaside.
Michele was asked to write his first conference paper to be presented at the work-
shop entitled “Customized 3D Sound for Innovative Interaction Design,” An article
with a high-sounding title that promises a lot but provides little: in short, an article of
which not to be proud. On the other hand, there were some valuable references to the
egocentric audio perspective that will be formalized in the introductory chapter of
this book. However, the reason why we tell this anecdote is that at his first presentation
at a scientific conference for the Ph.D. student Michele Geronazzo, among the very
small audience, there was Dr. Stefania Serafin. Ten years ago, we began to discuss
issues that connected sonic interaction design with immersive 3D audio in VR. The
AIMI president of that time failed to get the workshop’s contributions included in
the official ACM CHItaly proceedings despite a regular peer-review process. The
poor Ph.D. student Michele found himself without an official publication, at his first
conference, in an unknown scientific community. We like to think that at that event
and with that meeting started something much more relevant and impactful: SIVE.
We are here to give it a shape in this book edited and structured together.
Another temporal coincidence brings us to connect this story with the second and
official starting point of this adventure. Michele’s unpublished conference paper was
finally published within his doctoral thesis, defended in 2014, the year in which the
IEEE Virtual Reality workshop series “Sonic Interactions in Virtual Environments
(SIVE)”started(https://sive.create.aau.dk/). The mission of IEEE VR SIVE was to
increase among the virtual reality community and junior researchers the awareness
of the importance of sonic elements when designing immersive XR environments.
However, we can also identify a certain degree of reciprocity while considering
the fragmented nature and specificity of those studies aim at developing immersive
XR environments for sound and music. First, we, therefore, refer to our beloved
Sound and Music Computing (SMC) network, and then we consider the Interna-
tional Community for Auditory Display (ICAD), the Audio Engineering Society
(AES), and the communities linked to the International Conference on New Inter-
faces for Musical Expression (NIME), the Digital Audio Effects (DAFX), and the
Sonic Interaction Design COST Action (COST-SID IC601, ended in 2012). All these
communities address aspects of the SIVE topics according to their specificities. No
institutional nor contextual references that collect technological developments, best
practices, and creative efforts related to the peculiarities of immersive VEs existed
before the SIVE workshop. The book follows a similar philosophy trying to give
an exhaustive view of those multidisciplinary topics already mentioned in our two
recent reviews.1It features state-of-the-art research on real-time auralization, sonic
1S. Serafin, M. Geronazzo, N. C. Nilsson, C. Erkut, and R. Nordahl, “Sonic interactions in virtual
reality: state of the art, current challenges and future directions,” IEEE Computer Graphics and
Applications, vol. 38, no. 2, pp. 31–43, 2018.
S. Serafin et al., “Reflections from five years of Sonic Interactions in Virtual Environments
workshops,” Journal of New Music Research, vol. 49, no. 1, pp. 24–34, Jan. 2020.
Preface ix
interaction design in VR, quality of the experience in multimodal environments, and
applications. We aim to provide an organized starting point on which to develop
a new generation of immersive experiences and applications. Since the editors are
aware of the very fast social transformation by the acceleration in the development
of digital technologies, all chapters should be read as entry points. Future scenarios
and solutions will necessarily evolve by combining emerging research areas such as
artificial intelligence, ubiquitous and pervasive computing, quantum technologies,
as well as continuous discoveries in the neuroscientific field and anthropological
reflections on the authenticity of the experience in VR.
For this reason, contributing authors and editors include interdisciplinary experts
from the fields of computer science, engineering, acoustics, psychology, design,
humanities, and beyond. So that we can give to the reader a broad view and a clear
introduction to the state-of-the-art technologies and design principles, and to the
challenges that might be awaiting us in the future.
Through an overview of emerging topics, theories, methods, tools, and practices
in sonic interactions in virtual environments research, the book aims to establish the
basis for further development of this new research area. The authors were invited to
contribute to specific topics according to their well-known expertise. They followed
a predefined structure outlined by the editors.
The book is divided into four parts:
Part I, Introduction: this theoretical part frames the background and the key
themes in SIVE. The editors address several phenomenological foundational issues
intending to shape a new research field from an archipelago of studies scattered in
different research communities.
Part II, Interactive and Immersive Audio: we cover the system requirement
part with four chapters introducing and analyzing audio-related technological aspects
and challenges. With some overlaps and connections, the four chapters deal with the
plausibility of an immersive rendering able to tackle the computational burden. To do
so, we deal with methods and algorithms for real-time rendering considering sound
production, propagation, and spatialization, respectively. Finally, the reproduction
and evaluation phase allows closing the development loop of new audio technologies.
Part III, Sonic Interactions: a sonic interaction design part devoted to empha-
sizing the peculiar aspects of sound in immersive media. In particular, spatial interac-
tions are important where we would like to produce and transform ideas and actions
to create meaning with VR, as well as the virtual auditory space is an informa-
tion container that could be shaped by users. As the VR systems enter people’s
lives, manufacturers, developers, and creators should carefully consider an embodied
experience ready to share a common space with peers, collaboratively.
Part IV, Sonic Experiences: the last part focuses on multimodal integration
for sonic experiences in VR with the help of several case studies. Starting from a
literature review of multimodal experiments and experiences with sound, this last
part offers some reflections on the concept of audio-visual immersion and audio-
haptic integration able to form our ecology of everyday or musical sounds. Finally,
the potentials of VR to transport artists and spectators into a world of imagination and
xPreface
unprecedented expression is taken as an exemplar of what multimodal and immersive
experiences can elicit in terms of emotional and rational engagement.
In the following, a summary for each chapter is provided to help the reader to
follow the proposed narrative structure.
Part I
Chapter 1illustrates the editors’ vision of the SIVE research field. The main concept
introduced here is the egocentric audio perspective in a technologically mediated
environment. The listeners should be entangled with their auditory digital twins in
a participatory and enacted exploration for sense-making characterized by a person-
alized and multisensory first-person spatial reference frame. Intra-actions between
humans and non-human agents/actors dynamically and fluidly determine immersion
and coherence of the experience, participatively. SID aims to facilitate the diffraction
of knowledge in different tasks and contexts.
Part II
Chapter 2addresses the first building block of SIVE, i.e., the modeling and synthesis
of sound sources, focusing on procedural approaches. Special emphasis is placed on
physics-based sound synthesis methods and their potential for improved interactivity
concerning the sense of presence and embodiment of a user in a virtual environment.
In Chap. 3, critical challenges in auralization systems in virtual reality and games
are identified, including progressing from modeling enclosures to complex, general
scenes such as a city block with both indoor and outdoor areas. The authors provide
a general overview of real-time auralization systems, their historical design and
motivations, and how novel systems have been designed to tackle the new challenges.
Chapter 4deals with the concepts of adaptation in a binaural audio context, consid-
ering first the adaptation of the rendering system to the acoustic and perceptual prop-
erties of the user, and second the adaptation of the user to the rendering quality of the
system. The authors introduce the topics of head-related transfer function (HRTF)
selection (system-to-user adaptation) and HRTF accommodation (user-to-system
adaptation).
Finally, Chap. 5concludes the second part of the book by introducing audio
reproduction techniques for virtual reality, the concepts of audio quality, and quality
of the experience in VR.
Preface xi
Part III
Chapter 6opens the third part of the book devoted to SID within virtual environments.
In particular, it deals with space, a fundamental feature of VR systems, and more
generally, human experience. In this chapter, the authors propose a typology of VR
interactive audio systems, focusing on the function of systems and the role of space
in their design. Spatial categories are proposed to be able to analyze the role of space
within existing interactive audio VR products.
Chapter 7promotes the following great opportunities offered by VR systems: to
bring experiences, technologies, and users’ physical and experiential bodies (soma)
together, and to study and teach these open-ended relationships of enaction and
meaning-making in the framework of soma design. In this chapter, the authors
introduce soma design and focus on design exemplars that come from physical
rehabilitation applied to sonic interaction strategies.
Then, Chap. 8investigates how to design the user experience without being detri-
mental to the creative output, and how to design spatial configurations to support
both individual creativity and collaboration. The authors examine user experience
design for collaborative music-making in shared virtual environments, giving design
implications for the auditory information and the collaborative facilitation.
Finally, Chap. 9explores the possibilities in content creation like spatial music
mixing, be it in virtual spaces or for surround sound in film and music, offered
by the development of VR systems and multimodal simulations. Authors present
some design aspects for mixing in VR, investigating existing virtual music mixing
products, and creating a framework for a virtual spatial-music mixing tool.
Part IV
Chapter 10 helps the reader to understand how sound enhances, substitutes, or modi-
fies the way we perceive and interact with the world. This is an important element
when designing interactive multimodal experiences. In this chapter, Stefania presents
an overview of sound in a multimodal context, ranging from basic experiments in
multimodal perception to more advanced interactive experiences.
Chapter 11 focuses on audiovisual experiences, by discussing the idea of immer-
sion, and by providing an experimental paradigm that can be used for assessing
immersion. The authors highlight the factors that can influence immersion and they
differentiate immersion from the quality of experience (QoE). The theoretical impli-
cations for conducting experiments on these aspects are presented, and the authors
provide a case study for subjective evaluation after assessing the merits and demerits
of subjective and objective measures.
Chapter 12 focuses on audio-haptic experiences, being concerned with haptic
augmentations having effects on auditory perception, for example, about how
different vibrotactile cues may affect the perceived sound quality. The authors
xii Preface
review the results of different experiments showing that the auditory and somatosen-
sory channels together can produce constructive effects resulting in a measurable
perceptual enhancement.
Finally, Chap. 13 examines the special case of virtual music experiences, with
particular emphasis on the performance with Immersive Virtual Musical Instruments
(IVMI) and the relation between musicians and spectators. The authors assess in
detail the several technical and conceptual challenges linked to the composition
of IVMI performances on stage (i.e., their scenography), providing a new critical
perspective.
We hope the reader finds this book informative and useful for both research and
practice with sound.
Udine, Copenhagen
September 2021
Michele Geronazzo
Stefania Serafin
Acknowledgements
We would like to thank all authors and people involved in the book for their time
and effort. In particular, the co-organizers of the IEEE Virtual Reality Workshop
SIVE participated, in different ways, in this book project. Special thanks to Helen
Desmond (Springer Computer Science Editor) and the Springer team for allowing
us to prepare this volume.
Michele: I take my place, I have been creating my place in SIVE. This book
project closes my first 10 years of academic activities, It has allowed me to reflect
on my path and interdisciplinary education, challenging my knowledge extraction
process. I went through the three “HCI waves” with my own time: ergonomics and
engineering in Padova, psychology, and cognition in Verona, embodied design and
UX in Copenhagen. I found recognition and maturity in London, and, finally, Udine
gave me the time to find my identity. I would like to thank all my mentors and peers
who made me grow in the research jungle, on numerous occasions and at different
moments in my life. A big thanks go to my beloved family who supports me and
always brings me back down to earth.
New challenges in VR are on the horizon and I am ready to make resonate our
audio perspective!
Stefania: This year I have been recognized with the Danish Sound Award, for
being pivotal in securing Denmark’s role as a leader in fields such as sonic interaction
design, sound, and music computing and developing the role of sound in international
virtual reality research. This award and this book would not have been possible
without the wonderful colleagues and students of the Multisensory Experience Lab
at Aalborg University in Copenhagen, who keep me motivated and inspired on a daily
basis. It was a pleasure to host Michele as a postdoc in the lab for 2 years, and this
book is a result of that. The lab is my second family, that wonderfully complements
my first beloved family, both in Italy and in Denmark, that I thank with all my heart
for everything they mean to me.
xiii
xiv Acknowledgements
List of expert readers: we thank the following researchers to be our first crit-
ical readers, providing valuable comments for specific chapters within their area of
expertise:
Roberto Barumerli, Acoustics Research Institute, Vienna, Austria
Braxton Boren, American University, Washington, DC, US
Enzo De Sena, University of Surrey, Guildford, Surrey, UK
Michele Ducceschi, University of Bologna, Bologna, Italy
Isaac Engel, Imperial College London, London, UK
Floriana Ferro, University of Udine, Udine, Italy
Amalia de Götzen, Aalborg University Copenhagen, Copenhagen, Denmark
Marcella Mandanici, Music Conservatory “Luca Marenzio”, Brescia, Italy
Raul Masu, Universidade NOVA de Lisboa, Lisbon, Spain
Catarina Mendonça, University of Azores, Ponta Delgada, Portugal
Fabio Morreale, University of Auckland, Auckland, New Zealand
Niels Christian Nilsson, Aalborg University Copenhagen, Copenhagen, Denmark
Dan Overholt, Aalborg University Copenhagen, Copenhagen, Denmark
Archontis Politis, Tampere University, Tampere, Finland
Sebastian Prepelita, Facebook Reality Labs, Redmond, WA, US
Giorgio Presti, University of Milano, Milano, Italy
Davide Rocchesso, University of Palermo, Palermo, Italy
Lauri Savioja, Aalto University, Espoo, Finland
Bernhard Seeber, Technische Universität München, Munich, Germany
Ana Tajadura-Jiménez, University College London, London, UK
Maarten Van Walstijn, Queen’s University Belfast, Belfast, UK
Silvin Willemsen, Aalborg University Copenhagen, Copenhagen, Denmark
Acknowledgements xv
Open Access Funding
This book project was supported by
•the Nordic Sound and Music Computing Network (NordicSMC)—University hub
by Nordforsk (Norway),
•the Department of Humanities and Cultural Heritage—University of Udine (Italy)
with the recognition of “Department of Excellence” by the Ministry of Education,
University and Research (MIUR) of Italy
that cover the majority of the open-access publishing costs. We are grateful to the
EU project SONICOM (grant number: 101017743, RIA action of Horizon 2020) for
its sponsorship that contributes to the full transition to open access.
Contents
Part I Introduction
1 Sonic Interactions in Virtual Environments:
The Egocentric Audio Perspective of the Digital Twin ............. 3
Michele Geronazzo and Stefania Serafin
Part II Interactive and Immersive Audio
2 Procedural Modeling of Interactive Sound Sources in Virtual
Reality ....................................................... 49
Federico Avanzini
3 Interactive and Immersive Auralization ......................... 77
Nikunj Raghuvanshi and Hannes Gamper
4 System-to-User and User-to-System Adaptations in Binaural
Audio ........................................................ 115
Lorenzo Picinali and Brian F. G. Katz
5 Audio Quality Assessment for Virtual Reality .................... 145
Fabian Brinkmann and Stefan Weinzierl
Part III Sonic Interactions
6 Spatial Design Considerations for Interactive Audio in Virtual
Reality ....................................................... 181
Thomas Deacon and Mathieu Barthet
7 Embodied and Sonic Interactions in Virtual Environments:
Tactics and Exemplars ......................................... 219
Sophus Béneé Olsen, Emil Rosenlund Høeg, and Cumhur Erkut
8 Supporting Sonic Interaction in Creative, Shared Virtual
Environments ................................................. 237
Liang Men and Nick Bryan-Kinns
xvii
xviii Contents
9 Spatial Audio Mixing in Virtual Reality ......................... 269
Anders Riddershom Bargum, Oddur Ingi Kristjánsson,
Péter Babó, Rasmus Eske Waage Nielsen,
Simon Rostami Mosen, and Stefania Serafin
Part IV Sonic Experiences
10 Audio in Multisensory Interactions: From Experiments
to Experiences ................................................ 305
Stefania Serafin
11 Immersion in Audiovisual Experiences .......................... 319
Sarvesh R. Agrawal and Søren Bech
12 Augmenting Sonic Experiences Through Haptic Feedback ........ 353
Federico Fontana, Hanna Järveläinen, and Stefano Papetti
13 From the Lab to the Stage: Practical Considerations
on Designing Performances with Immersive Virtual Musical
Instruments .................................................. 383
Victor Zappi, Dario Mazzanti, and Florent Berthaut
Index ............................................................. 425
Editors and Contributors
About the Editors
Michele Geronazzo Ph.D., is an Associate Professor
at the University of Padova—Dept. of Management
and Engineering, and part of the coordination unit of
the EU-H2020 project SONICOM at Imperial College
London. He received his M.S. degree in Computer
Engineering (2009) and his Ph.D. degree in Informa-
tion & Communication Technologies (2014) from the
University of Padova. Between 2014 and 2021, he
worked as an Assistant Professor in Digital Media at
the University of Udine and a postdoctoral researcher at
Imperial College London, Aalborg University, and the
University of Verona in the fields of neurosciences and
simulations of complex human–machine systems. His
main research interests involve binaural spatial audio
modeling and synthesis, virtual/augmented reality, and
sound in human–computer interaction.
He is an IEEE Senior Member and part of the orga-
nizing committee of the IEEE VR Workshop on Sonic
Interactions for Virtual Environments since 2015 (chair
of the 2018 and 2020 editions). From September 2019,
he has been appointed as an Editorial Board member for
Frontiers in Virtual Reality, and he served as guest editor
for Wireless Communications and Mobile Computing
(John Wiley & Sons and Hindawi publishers, 2019). He
is a co-recipient of six best paper/poster awards and co-
author of more than 70 scientific publications. In 2015,
his Ph.D. thesis was honored by the Acoustic Society of
Italy (AIA) with the “G. Sarcedote” award.
xix
xx Editors and Contributors
Stefania Serafin is a Professor of Sonic interaction
design at Aalborg University in Copenhagen and the
leader of the Multisensory Experience Lab together with
Rolf Nordahl. She was previously appointed as Asso-
ciate Professor (2006–2013) and Assistant Professor
(2003–2006) in the same University. She has been
visiting researcher at the University of Cambridge and
KTH in Stockholm (2003) and visiting professor at
the University of Virginia (2002). Since 2014, she
is the President of the Sound and Music Computing
association, and since 2018, the Project Leader of the
Nordic Sound and Music Computing network supported
by Nordforsk. She has been a part of the organizing
committee of the IEEE VR Workshop on Sonic Interac-
tions for Virtual Environments since the first edition. She
is also the coordinator of the Sound and music computing
Master at Aalborg University. She received her Ph.D.
entitled “The sound of friction: computer models, playa-
bility and musical applications” from Stanford Univer-
sity in 2004, supervised by Professor Julius Smith III.
She is the co-author of more than 300 papers in the
fields of sound and music computing, sound for virtual
and augmented reality, sonic interaction design, and new
interfaces for musical expression.
Editors and Contributors xxi
Contributors
Agrawal Sarvesh R. Bang & Olufsen, Struer, Denmark;
Technical University of Denmark, Department of Photonics Engineering, Lyngby,
Denmark
Avanzini Federico Laboratory of Music Informatics, Department of Computer
Science, University of Milano, Milano, Italy
Babó Péter Department of Architecture, Design, and Media Technology, Aalborg
University Copenhagen, Copenhagen, Denmark
Barthet Mathieu Centre for Digital Music, Queen Mary University of London,
London, United Kingdom
Bech Søren Bang & Olufsen, Struer, Denmark;
Department of Electronic Systems, Aalborg University, Aalborg, Denmark
Berthaut Florent University of Lille, Lille, France
Brinkmann Fabian Audio Communication Group, Technical University of Berlin,
Berlin, Germany
Bryan-Kinns Nick Queen Mary University of London, London, United Kingdom
Deacon Thomas Media and Arts Technology CDT, Queen Mary University of
London, London, United Kingdom
Erkut Cumhur Multisensory Experience Lab, Aalborg University Copenhagen,
Copenhagen, Denmark
Eske Waage Nielsen Rasmus Department of Architecture, Design, and Media
Technology, Aalborg University Copenhagen, Copenhagen, Denmark
Fontana Federico Department of Mathematics, Computer Science and Physics,
University of Udine, Udine, Italy
Gamper Hannes Microsoft Research, Redmond, USA
Geronazzo Michele Department of Engineering and Management, University of
Padova, Padova, Italy;
Dyson School of Design Engineering, Imperial College London, London, UK;
Department of Humanities and Cultural Heritage, University of Udine, Udine, Italy
Høeg Emil Rosenlund Multisensory Experience Lab, Aalborg University Copen-
hagen, Copenhagen, Denmark
Ingi Kristjánsson Oddur Department of Architecture, Design, and Media Tech-
nology, Aalborg University Copenhagen, Copenhagen, Denmark
Järveläinen Hanna Institute for Computer Music and Sound Technology, Zurich
University of the Arts, Zurich, Switzerland
xxii Editors and Contributors
Katz Brian F. G. Sorbonne Université, CNRS, UMR 7190, Institut Jean Le Rond
d’Alembert, Paris, France
Mazzanti Dario Independent researcher, Genoa, Italy
Men Liang Liverpool John Moores University, Liverpool, United Kingdom
Olsen Sophus Béneé Multisensory Experience Lab, Aalborg University Copen-
hagen, Copenhagen, Denmark
Papetti Stefano Institute for Computer Music and Sound Technology, Zurich
University of the Arts, Zurich, Switzerland
Picinali Lorenzo Imperial College London, London, UK
Raghuvanshi Nikunj Microsoft Research, Redmond, USA
Riddershom Bargum Anders Department of Architecture, Design, and Media
Technology, Aalborg University Copenhagen, Copenhagen, Denmark
Rostami Mosen Simon Department of Architecture, Design, and Media Tech-
nology, Aalborg University Copenhagen, Copenhagen, Denmark
Serafin Stefania Department of Architecture, Design, and Media Technology,
Aalborg University Copenhagen, Copenhagen, Denmark
Weinzierl Stefan Audio Communication Group, Technical University of Berlin,
Berlin, Germany
Zappi Victor Northeastern University, Boston, MA, United States
Part I
Introduction
Chapter 1
Sonic Interactions in Virtual
Environments: The Egocentric Audio
Perspective of the Digital Twin
Michele Geronazzo and Stefania Serafin
Abstract The relationships between the listener, physical world, and virtual envi-
ronment (VE) should not only inspire the design of natural multimodal interfaces
but should be discovered to make sense of the mediating action of VR technologies.
This chapter aims to transform an archipelago of studies related to sonic interactions
in virtual environments (SIVE) into a research field equipped with a first theoret-
ical framework with an inclusive vision of the challenges to come: the egocentric
perspective of the auditory digital twin. In a VE with immersive audio technolo-
gies implemented, the role of VR simulations must be enacted by a participatory
exploration of sense-making in a network of human and non-human agents, called
actors. The guardian of such locus of agency is the auditory digital twin that fosters
intra-actions between humans and technology, dynamically and fluidly redefining
all those configurations that are crucial for an immersive and coherent experience.
The idea of entanglement theory is here mainly declined in an egocentric spatial
perspective related to emerging knowledge of the listener’s perceptual capabilities.
This is an actively transformative relation with the digital twin potentials to create
movement, transparency, and provocative activities in VEs. The chapter contains an
original theoretical perspective complemented by several bibliographical references
and links to the other book chapters that have contributed significantly to the proposal
presented here.
M. Geronazzo (B
)
Department of Engineering and Management, University of Padova, Padova, Italy
e-mail: michele.geronazzo@unipd.it
Dyson School of Design Engineering, Imperial College London, London, UK
Department of Humanities and Cultural Heritage, University of Udine, Udine, Italy
S. Serafin
Department of Architecture, Design, and Media Technology, Aalborg University Copenhagen,
Copenhagen, Denmark
e-mail: sts@create.aau.dk
© The Author(s) 2023
M. Geronazzo and S. Serafin (eds.), Sonic Interactions in Virtual Environments,
Human—Computer Interaction Series, https://doi.org/10.1007/978-3-031- 04021-4_1
3
4 M. Geronazzo and S. Serafin
1.1 Introduction
Our daily auditory experience is characterized by immersion from the very beginning
of our life inside the womb, actively listening to sounds surrounding us from different
positions in space. Auditory information takes the form of a binaural continuous
stream of messages to the left and right ears, conveying a compact representation of
the omnidirectional source of learning for our existence [19,48]. Both temporal and
spatial activity of sounds of interest (e.g., dialogues, alarms, etc.) allow us to localize
and encode the contextual information and intentions of our social interaction [1].
The hypothesis that our daily listening experience of sounding objects with cer-
tain physical characteristics dynamically shapes the acoustic features for which we
ascribe meaning to our auditory world is supported by one of the key concepts
in Husserl’s phenomenology “Meaning-bestowal” (“Sinngebung” in German [73])
and by studies in ecological acoustics such as [48,54,96]. In particular, the idea of
acoustic invariant as a complex pattern of change for a real-world sound interaction is
strongly related to human perceptual learning and a socio-cultural mediation dictated
by the real world. For some surveys of classical studies on the topic of ecological
acoustics refer to [112].
From this perspective, acoustic invariants are learned on an individual basis
through experiential learning. Hence, there is the need to trace their development
over multiple experiences and to formalize a common ground for a dynamic expan-
sion of individual knowledge. Any emerging understanding should be transferred to
a technological system able to provide an immersive and interactive simulation of
a sonic virtual environment (VE). Such a process must be adaptive and dynamic to
ensure a level of coupling between user and technology in such a way that the active
listening experience is considered authentic.
Immersive virtual reality (here we generically referred to as VR) technologies
allow immense flexibility and increasing possibilities for the creation of VEs with
relationships or interactions that might be ontologically relevant even if radically
different from the physical world. This can be evident by referring to the distinction
between naturalistic and magical interactions, where the latter can be considered
observable system configurations in the domain of artificial illusions, incredibly
expanding the spectrum of possible digital experiences [13,127].
One of the main research topics in the VR and multimedia communities is ren-
dering. For decades, computer-aided design applications have favored—in the first
place—the development of computer graphics algorithms. Some of these approaches,
e.g., geometric ray-tracing methods, have been adapted to model sound propagation
in complex VEs (see Chap. 3for more details). However, there has been a clear
tendency to prioritize resources and research on the visual side of virtual reality,
confining auditory information to a secondary and ancillary role [158]. Although
sound is an essential component of the grammar of digital immersion, relatively
little compared to the visual side of things has been done to investigate the role of
auditory space and environments. Nowadays, there is increasing consensus toward
the essential contribution of spatial sound, also in (VR) simulations [9,102,145].
1 Sonic Interactions in Virtual Environments 5
Technologies for spatial audio rendering are now able to convey perceptually plau-
sible simulations with stimuli that are reconstructed from real-life recordings [18]or
historical archives, as for the Cathédrale Notre-Dame de Paris before and after the
2019 fire [79], getting closer to a virtual version indistinguishable from the natural
reality [77]. This is made possible by a high level of personalization in modeling
user morphology and acoustic transformations caused by the human body interact-
ing with the sound field generated in room acoustic computer simulations [17,78,
114].
Nowadays, the boundary between technology and humans has increasingly
blurred thanks to recent developments in research areas such as virtual and aug-
mented reality, artificial intelligence, cyber-physical systems, and neuro-implants.
It is not possible to easily distinguish where the human ends and the technology
begins. For this reason, we embrace the idea of [10] who sees technology as a lens
for the understanding of what it means to be human in a changing world. We can
therefore consider the phenomenal transparency [94] where technology takes on the
role of a transparent mediator for self-knowledge. According to Loomis [88], the
phenomenology of presence between physical and virtual environments places the
internal listener representation created by the spatial senses and the brain on the same
level. Human-technology-reality relations are thus created by enactivity that allows
a fluid and dynamic entanglement of all the involved actors.
In this chapter, we initially adopt Slater’s definition of presence for an immersive
VR system [135] embracing the recent revision by Skarbez [134]. The concepts of
plausibility illusion and place illusion are central to capturing the subjective internal
states. While the plausibility illusion determines the overall credibility of a VE in
terms of subjective expectations, the place illusion establishes the quality of having
sensations of being in a real place. They are both fundamental in providing credibility
to a digital simulation based on individual experience and expectations concerning
an internal frame of reference for scenes, environments, and events.1
We propose a theoretical framework for the new field of study, namely Sonic Inter-
actions in Virtual Environments (SIVE). We suggest from now on a unified reading
of this chapter with references and integrations from all chapters of the correspond-
ing book [49]. Each chapter provides state-of-the-art challenges and case studies
for specific SIVE-related topics curated by internationally renowned scientists and
their collaborators. The provided point of view focuses on the relations between real
auditory experience and technologically mediated experiences in immersive VR.
The first is characterized by individuality to confer immersiveness within a physical
world. It is important to emphasize the omnidirectionally of auditory information
that allows the listener to collect both the whole and the parts at 360◦. The indi-
vidualized auditory signals are the result of the acoustic transformations made by
the head, ear, and torso of the listener that act as a spatial fingerprint for a complex
spatio-temporal signal. Familiarity, and therefore previous experience with sounds,
shape spatial localization capabilities with high intersubjectivity. Finally, studies on
1For a dedicated discussion on the basic notions related to presence, please refer also to Chap. 11
in this volume.
6 M. Geronazzo and S. Serafin
neural plasticity of the human brain confirm continuous adaptability of listening with
impaired physiological functions, e.g., a hearing loss, and with electrical stimulation,
e.g., via cochlear implants [82].
The mediated VR experience is often characterized by the user’s digital coun-
terpart called avatar. It allows the creation of an embodied and situated experience
in digital VEs. The scientific literature supports the idea that the manipulation of
VR simulations can induce changes at the cognitive level [124], such as in educa-
tional [34] and therapeutic [106] positive effects. The ability of VR technologies to
mediate within the immersive environment in embodied and situated relations gives
immersive technologies the opportunities to change one’s self [151].
For these reasons, we believe it is time to coin, at the terminological level, a new
perspective that relates the two listening experiences (i.e., real and virtual), called
egocentric audio perspective. In particular, we refer to the term audio to identify
an auditory sensory component, implicitly recalling those technologies capable of
immersive and interactive rendering. The term egocentric refers to the perceptual
reference system for the acquisition of multisensory information in immersive VR
technologies as well as the sense of subjectivity and perceptual/cognitive individ-
uality that shape the self, identity, or consciousness. In accordance with Husserl’s
phenomenology, the human body can be philosophically defined as a “Leib”, a living
body, and a “Nullpunkt”, a zero-point of reference and orientation [73].
This perspective aims to extend the discipline of Sonic Interaction Design [44]by
taking into account not only the importance of sound as the main channel conveying
information, meaning, aesthetic, and emotional qualities, but rather an egocentric per-
spective of entanglement between the perceiving subject and the computer simulating
the perceived environment. In the first instance, this can be described by processes of
personalization, adaptation, and mutual relations to maintain the immersive illusion.
However in this chapter, we will try to argue that it is much more than that. We hope
that our vision will guide the development of new immersive audio technologies and
conscious use of sound design within VEs.
The starting point of this theoretical framework is an ecologically egocentric per-
spective. The foundational phenomenological assumption considers a self-propelled
entity with agency and intentionality [47]. It can interact with the VE being aware
of its activities in a three-dimensional space. The active immersion in a simulated
acoustic field provides it meaningful experiences through sound.
Therefore, it is important to introduce a terminological characterization of what
is the listener, not a user in this context, as a human being with prior experience
and subjective auditory perception. A closely related entity is the auditory digital
twin, which differs from the most common avatar. The idea of an avatar within a
digital simulation co-located with objects, places, and other avatars [126] requires a
user taking control of any form of virtual bodies which might be noticeably different
from that of the listener⣙s physical body. On the other hand, the digital twin
cannot disregard an egocentric perspective of the listener for whom it is created. This
means that the relations with the VEs should consider personalization techniques on
the virtual body closely linked to the listener’s biological body. This mediation is
1 Sonic Interactions in Virtual Environments 7
essential for the interactions between the listener and all the diegetic sounds, whether
they are produced by the avatar’s gestures or by sound sources in the VE.
In such a context, immersiveness is a dynamic relationship between physical
and meaningful actions by the listener in the VE. Specifically, having performed
bodily practices such as walking, sitting, talking, grasping, etc. provide meaning to
virtual places, objects, and avatars [59]. Accordingly, the sense of embodiment can
be considered a subjective internal feeling which is an expression of the relationship
between one’s self and such VE. In this regard, Kilteni et al. [80] identified the
sense of embodiment for an artificial body (i.e., avatar) in the mediation between the
avatar’s properties and their processing by the user’s biological properties.
We now introduce the technological mediation in the form of an auditory digital
twin which is a guardian and facilitator of (i) the sense of self-location, (ii) plau-
sibility, (iii) body ownership, and (iv) agency for the listener. In the first instance,
a performative view might make us see realities as “ a doing”, enacting practical
actions [6,104]. Similarly, the listener and the avatar cannot be considered fixed
and independent interacting entities, but constituent parts of emergent, multiple and
dynamic phenomena resulting from entangled social, cognitive, and perceptual ele-
ments. This intra-systemic action of entangled elements dynamically constructs
identities and properties of the immersive listening experience. The illusory perma-
nence of auditory immersion lies in the boundaries between situationally entangled
elements in fluid and dynamic situations. They can be seen as confrontations occur-
ring exactly in the auditory digital twin that facilitates the phenomenon. The auditory
digital twin is the meeting and shared place between the listener and a virtual body
identity, communicating in a non-discursive (performative) way according to the
quality level of the digital simulation.
In an immersive VE, the listeners cannot exist without their auditory digital twin
and vice versa. Through the digital twin characterization, the acoustic signals gen-
erated by the VE are filtered exclusively for the listeners, according to their ability
to extract meaningful information. It is worthwhile to mention the participatory
nature of such entanglement process between listener and digital twin, as a joint
exploration of the listener’s attentional process in selecting meaningful information,
e.g., the cocktails party effect [20]. We might speculate by considering a simulation
that interacts within the digital twin to provide the best pattern or to discover it in
order to attract the listener’s attention. The decision-making process will then be the
result of intra-action in and of the auditory digital twin.
This chapter has three main sections. Section 1.2 gathers the different souls that
characterize the research and artistic works in SIVE. Section 1.3 holds a central
position by defining the constitutive elements of our proposed egocentric audio per-
spective in SIVE: spatial centrality and entanglement between human and computer
in the digital twin. In Sect. 1.4, we attempt incorporating this theoretical framework
by adapting Milgram and Kishino’s well-known taxonomy for VR [95], with an
audio-first perspective. Finally, Sect. 1.5 concludes this chapter by encouraging a
new starting point for SIVE. We suggest an inclusive approach to the next paradigm
shift in the field of human-computer interaction (HCI) discipline.
8 M. Geronazzo and S. Serafin
1.2 SIVE: From an Archipelago to a Research Field
This chapter aims to provide an interpretation to an archipelago of researches from
different communities such as
•Sound and Music Computing (SMC) network, a point of convergence for different
research disciplines mainly related to digital processing of musical information.2
•International Community for Auditory Display (ICAD), a point of convergence
for different areas of research with digital processing of non-musical audio infor-
mation and the idea of sonification in common.3
•The Audio Engineering Society (AES), the main community for institutions and
companies devoted to the world of audio technologies.4
•The research community gathered by the International Conference on New Inter-
faces for Musical Expression (NIME), devoted to interactions with new interfaces
with the aim at facilitating the human creative process.5
•The Digital Audio Effects community (DAFX) aiming at designing technological-
based simulations of sonic phenomena.6
We employ here the metaphor of an archipelago because it well describes a context
in which all these communities address aspects of VR according to their specificities,
influencing each other. After all, they share the same “waters”. They are relatively
close to each other but feeling distant from a VR community at the same time, like
the islands of an archipelago in the open sea. Thus, we affirm the need to unify
the fragmentary and specificity of those studies and to fill the gap with their visual
counterpart’s aiming at developing immersive VR environments for sound and music.
To achieve this goal, the editors have pursued the following spontaneous path that is
characterized by three main steps.
1. The first review article related to SIVE topics, dated back to 2018 [128], focused
on the technological components characterizing an immersive potential for inter-
active sound environments. In that work, the editors and their collaborators pro-
duced a first compact survey including sound synthesis, propagation, rendering,
and reproduction with a focus on the ongoing development of headphone tech-
nologies.
2. Two years later, we published a second review paper together with all the organiz-
ers of the past five editions of the IEEE Virtual Reality’s SIVE workshop [129].
In this paper, we analyzed the contributions presented at the various editions
highlighting the emerging aspects of interaction design, presence, and evalua-
tion. An inductive approach was adopted, supported by a posteriori analysis of
the characterizing categories of SIVE so far.
2https://smcnetwork.org/
3https://icad.org/
4https://www.aes.org/
5https://nime.org
6https://www.dafx.de/
1 Sonic Interactions in Virtual Environments 9
MulƟmodality and applicaƟons
Sonic interacƟons
numerical simulaƟons,
digital signal processing, acousƟcs
human-computer interacƟon,
psychology, movement analysis
computer vision and graphics,
immersive VR/AR technologies
Immersive
audio
Fig. 1.1 The SIVE inverse pyramid. Arrows indicate high-level relational hierarchies
3. Finally, this book and, in particular, this chapter want to raise the bar further
with an organic and structured narrative of an emerging discipline. We aim to
provide a theoretical framework for interpreting and accompanying the evolution
of SIVE, focusing on the close relationship between physically real and virtual
auditory experiences described in terms of immersive, coherent, and entangled
features.
This chapter is the result of the convergence of two complementary analytical
strategies: (i) a top-down approach describing the structure given by the editors to
the book originated from the studies experienced by the editors themselves, and
(ii) a bottom-up approach drawing on the knowledgeable insights of the contributing
authors of this book on several specialist and interdisciplinary aspects. Consequently,
we will constantly refer to these chapters in an attempt to provide a unified and long-
term vision for SIVE.
Our proposal for the definition of a new research field starts from a simple layer
structure without claiming to be exhaustive. The graphical representation in Fig. 1.1
is capable of giving an overview and a rough inter-relation of the multidisciplinarity
involved in SIVE. We suggest a hierarchical structure for the various disciplines in
the form of an inverted pyramid representation. SIVE research can be conceptually
organized in three levels:
iImmersive audio concerns the computational aspects of the acoustical-space
properties of technologies. It involves the study of acoustic aspects, psychoa-
coustic, computational, and algorithmic representation of the auditory informa-
tion, and the development of enabling audio technologies;
ii Sonic interaction refers to human-computer interplay through auditory feed-
back in 3D environments. It comprises the study of vibroacoustic information
and its interaction with the user to provide abstract meanings, specific indicators
of the state for a process or activity in interactive contexts;
10 M. Geronazzo and S. Serafin
iii The integration of immersive audio in multimodal VR/AR systems impacts
different application domains. This third and final level collects all the studies
regarding the integration of virtual environments in different application domains
such as rehabilitation, health, psychology, music, to name but a few.
The immersive audio layer is a strongly characterizing element of SIVE. For such
a reason, it is placed as the tip of the inverse pyramid, where all SIVE development
opportunities originate. In other words, SIVE cannot exist without sound spatializa-
tion technologies, and the research built upon them is intrinsically conditioned by the
level of technological development (for more arguments on this issue see Sect. 1.3.2).
In particular, spatial audio rendering through headphones involves the computa-
tion of binaural room impulse responses (BRIRs) to capture/render sound sources in
space (see Fig. 1.2). BRIRs can be separated into two distinct components: the room
impulse response (RIR), which defines room acoustic properties, and the head-related
impulse response (HRIR) or head-related transfer function (HRTF, i.e., the HRIR in
the frequency domain), which acoustically describes the individual contributions of
the listener’s head, pinna, torso, and shoulders. The former describes the acoustic
space and environment, while the latter prepares this information into perceptually
relevant spatial acoustic cues for the auditory system, taking advantage of the flex-
ibility of immersive binaural synthesis through headphones and state-of-the-art
consumer head-mounted displays (HMDs) for VR. The perceptually coherent aural-
ization with lifelike acoustic phenomena, taking into account the effects of near-field
acoustics and listener specificity in user and headphones acoustics, is a key techno-
logical matter here [11,21,68].
Listener’s body
Head-Related Impulse
Response (HRIR)
Spatial Room Impulse
Response (SRIR)
Room Acoustics
Binaural Room Impulse Response (BRIR)
Headphone Impulse
Response (HpIR)
Headphones
Fig. 1.2 High-level acoustic components for immersive audio with a focus on spatial room acoustics
and headphone reproduction
1 Sonic Interactions in Virtual Environments 11
The visual component of spatial immersion is so evident that it may seem that the
sensation of immersion is exclusively dependent on it, butthe aural aspect has as much
or even more relevance. We can simulate an interactive listening experience within
VR using standard components such as headsets, digital signal processors (DSPs),
inertial sensors, and handheld controllers. Immersive audio technologies have the
potential to revolutionize the way we interact socially within VR environments and
applications. Users can navigate immersive content employing head motions and
translations in 3D space with 6 degrees of freedom (DoF). When immersive audi-
tory feedback is provided in an ecologically valid interactive multisensory experi-
ence, a perceptually plausible scheme for developing sonic interactions is practically
convenient [128], yet still efficient in computational power, memory, and latency
(refer to Chap. 3for further details). The trade-off between accuracy and plausibility
is complex and finding algorithms that can parameterize sound rendering remains
challenging [62]. The creation of an immersive sonic experience requires
•Action sounds: sound produced by the listener that changes with movement,
•Environmental sounds: sounds produced by objects in the environment, referred
to as soundscapes,
•Sound propagation: acoustic simulation of the space, i.e., room acoustics,
•Binaural rendering: user-specific acoustics that provides for auditory localization.
These are the virtual acoustics and auralization key elements [153] at the basis of
auditory feedback design that draws on user attention and enhances the sensation of
place and space in virtual reality scenarios [102].
The two upper layers of the SIVE inverse pyramid, i.e., sonic interactions and
multimodal experiences, are not clearly distinguishable and we propose the following
interpretation: we differentiate the interaction from the experience layer when we
intend to extrapolate design rules for the sonic component with a different meaning
for the designer, system, users, etc. . In both cases, embodiment and proprioception
are essential, naturally supporting multimodality in the VR presence. This leads us
to a certain difficulty in generalizations which is well-grounded by our egocentric
audio perspective. In our proposal of theoretical framework, the hierarchies initially
identified can change dynamically.
Ernst and Bülthoff’s theory [41] suggests how our brain combines and merges
different sources of sensory information. The authors described two main strategies:
sensory combination and integration. The former aims at maximizing the informa-
tion extraction from each modality in a non-redundant manner. The second aims
at finding congruence and reducing variability in the redundant sensory informa-
tion in search of greater perceptual reliability. Both strategies consider a bottom-up
approach to sensory integration. In particular, the concept of dominance is associ-
ated with perceptual reliability from each specific sensory modality given the specific
stimulus. This means that the main research challenge for SIVE is not only to foster
research aimed at understanding how humans process information from different
sensory channels (psychophysics and neuroscience domains), but especially how
multimodal VEs should distribute the information load to obtain the best experi-
ence for each individual. Accordingly, we assume that each listener has personal
12 M. Geronazzo and S. Serafin
optimization strategies to extract meaning from redundant sensory information dis-
tributions. The VR technology can improve if and only if it can have a sort of dialogue
with the listener to understand such a natural mixture of information.
The design process of multimodal VEs must also constantly take into account the
limitations, i.e., the characterization, of the VR technologies with the aim at creating
real-time interactions with the listener. According to Pai [108], interaction models
can be described as a trade-off between accuracy and responsiveness. Increasing the
descriptive power and thus the accuracy of a model for a certain phenomenon leads to
processing more information before providing an output in response to a parametric
configuration. It comes at the price of higher latency for the system. For multisensory
models that should synchronize different sensory channels, this is crucial and has to
be carefully balanced with many other concurrent goals.
Understanding interactions between humans and their everyday physical world
should not only inspire the design of natural multimodal interfaces but should be
directly explored into VE models and simulation algorithms. This message is strongly
supported by Chap. 10 and our theoretical framework fully integrates this vision by
trying to further extend this perspective to non-human agents. The role of the digital
simulation and the computer behind it is participation and discovery for the listener.
They constitute a complex system whose interactions contribute to the dynamic def-
inition of non-linear narratives and causal relationships that are crucial for immer-
sive experiences. The application contexts of the interactive simulations instruct the
trade-off between the accuracy and responsiveness models. Hence, the knowledge
of the perceptual-cognitive listener capabilities emerges as active transformations in
multimodal digital VR experiences.
1.3 Egocentric Audio
A large body of research in computational acoustics focused on the technical chal-
lenges of quantitative accuracy characterizing engineering applications, simulations
for acoustic design, and treatment in concert halls. Such simulations are very expen-
sive in terms of computational resources and memory, so it is not surprising that
the central role of perception in rendering has gradually come into play. The search
for lower bounds such as the perceptually authentic audio-visual renderings can
be achieved (see Chap. 5for a more detailed discussion). Continuous knowledge
exchange between psychophysical research and interactive algorithms development
allows to test new hypotheses and propose responsive VR solutions. It is worthwhile
to mention the topic of artificial reverberations and modeling of the reverberation
time aiming to provide a sense of presence through the main spatial qualities of a
room, e.g., its size [83,147].
In the context of SIVE, we could review and adapt the three paradigm shifts, or
“waves” in HCI mentioned by Harrison [64], which still coexist and are at the center
of research agendas for different scientific communities. The first wave considers
the optimization of interaction in terms of the human factor in an engineered system.
1 Sonic Interactions in Virtual Environments 13
We could mention as an example the ergonomic, but generic, “ one fits all” solutions
of dummy-heads and binaural microphones for capturing acoustic scenes [110]. The
second wave introduces a connection between man and machine in terms of infor-
mation exchange, looking for similarities and common ground in decision-making
processes, e.g., memory and cognition. The structural inclusion of non-linearities
and auditory Just-Noticeable Differences (JNDs) to determine the amount of infor-
mation to be encoded for gesture sonification is an example of this direction [38].
Finally, the third paradigm shift considers interaction as a situated, embodied, and
social experience, characterized by emotions and complex relations encountered in
everyday life. We could place here many of the case studies collected in this volume
(Parts III and IV). To this regard, the extracted patterns or best practices are often
very specific to each study and listeners’ groups, e.g., musician vs. non-musician
(Chap. 9).
From developments in phenomenological [93] and, more recently,
post-phenomenological thinking [74,150], we will therefore develop the egocen-
tric audio perspective. The key principle is the shift between interaction between
defined objects to intra-action within a phenomenon whose main actors are human
and non-human agents. Boundaries between actors are fluidly determined, similarly
to the Gibsonian ecological theory of perception [54,55]. Even though this is a
shift from an anthropocentric and user-centered view toward a system of enactive
relations and associations in the immersive world of sounds, we chose the term ego-
centric to emphasize the spatial anchoring between humans and technology in the
self-knowledge constitution.
It would be useful also referring to the concept of ambiguity by the philosopher
Maurice Merleau-Ponty that says that all experiences are ambiguous, composed of
things that do not have defined, identifiable essence, but rather by open or flexi-
ble styles or patterns of interactions and developments [93,123]. Starting from an
egocentric spatial perspective of immersive VR, the learning and transformation
processes of the listeners occur when their attention is guided toward external vir-
tual sounds, e.g., the out-of-the-head and externalized stimuli. This allows them to
achieve meaningful discoveries also for their auditory digital twins. Accordingly, the
experience mediated by a non-self, i.e., auditory simulation of VEs, is shaped (i) by
the past experience of the listener and the digital twin indistinctly acquired from a
physical or cybernetic world in a constructivist sense, (ii) by the physical-acoustic
imprinting induced or simulated by the body, head and ears, and (iii) by active
and adaptive processes of perceptual re-learning [57,160] induced by a symbiosis
with technology. Figure 1.3 schematizes and simplifies this relationship between
man-technology-world from which the listener acquires meaning. As pointed out by
Vindenes and Wasson [151], experiences are mediated in a situated way from the
subjectivity of the listener which constitutes herself in relation to the objectivity of
the VE. Having placed the physical and virtual worlds at the same level yields to
similar internal representations for the listener and her digital twin, allowing us to
promote the transformative role of VR experiences for a human-reality relationship
altered after exposure.
14 M. Geronazzo and S. Serafin
LISTENER – AUDITORY – REAL/VIRTUAL
DIGITAL TWIN ENVIRONMENT
objecƟvity
subjecƟvity
PercepƟon / Experience
AcƟons / PracƟces
Fig. 1.3 Technological mediation of the auditory digital twin (adapted from Hauser et al. [66])
The core of our framework is an ideal auditory digital twin: an essential mediator
and existential mirror for an egocentric audio perspective. Technology is the mediator
of this intentional relationship co-constituting both the listener and her being in the
world. From this post-phenomenological perspective of SIVE, we are interested in
understanding how the VE relates to the listener and what is the meaning of the VEs
for the listener, at the same time. Our main goal is to characterize the mediating
action between the listener and the VE by an auditory digital twin. This guardian can
reveal the listener’s ongoing reconfiguration through the human-world relationship
occurring outside the VR experience.
In the remainder of this chapter, we will motivate the opportunity to refer to this
non-human entity other than the self and aspiring to be the mediator for the self. This
first philosophical excursus of hermeneutical nature allows us to take a forward-
looking vision for the SIVE discipline, framing the current state of the art but also
including the rapid technological developments and ethical challenges due to the
digital transformation.
1.3.1 Spatial Centrality
The three-dimensionality of the action space is one of the founding characteristics of
immersive VE. Considering such space of transmission, propagation, and reception
of virtually simulated sounds, sonic experiences can assume different meanings and
open up to many opportunities.
Immersive audio in VR can be reproduced both through headphones and loud-
speaker arrays determining a differentiation between listener- and loudspeaker-
centric perspectives. The latter seems to decentralize the listener role in favor of
a strong correlation between virtual and physical (playback) space. In particular,
sound in VEs is decoded for the specific loudspeaker arrangements in the physi-
cal world (for a summary of the playback systems refer to Chap. 5). This setup
1 Sonic Interactions in Virtual Environments 15
allows the coexistence of several listeners in the controlled playback space, depend-
ing on the so-called sweet spot. However, the VE and the listener-avatar mapping is
intrinsically egocentric and multisensory, subordinating a loudspeaker-centric per-
spective for the simulation of the auditory field to a listener-centric one. Let us try
to clarify this idea with a practical example: head movements and the navigation
system, e.g., redirected walking [101], determine the spatial reference changes for
the real/virtual environment mapping corresponding to the listener’s dynamic explo-
ration. The tracking system could trigger certain algorithmic decisions to maintain
the place and plausibility illusions of the immersive audio experience.
1.3.1.1 First Person Point of View
In this theoretical framework, we focus on the listener’s perspective, where sound is
generated from the first-person point of view (generally referred to as 1PP). Virtual
sounds are shaped by spatial hearing models: auralization takes into account the
individual everyday listening experience both in physical-acoustic and non-acoustic
terms. Contextual information relate spatial positions between sound events and
objects with the avatar virtual body, creating a sense of proximity and meaningful
relations for the listener.
It is relevant to stress the connection between the egocentric audio perspective
and the research field of egocentric vision that has more than twenty-year history.
The latter is a subfield of computer vision that involves the analysis of images and
videos captured by wearable cameras, e.g., Narrative Clips7and GoPro8, considering
an approximation of the visual field due to a 1PP. From this source of information,
spatio-temporal visual features can be extracted to conduct various types of recogni-
tion tasks, e.g., of objects or activities [100], and analysis of social interactions [2].
The egocentric audio perspective originates from the same 1PP in which both space
and time of events play a fundamental role in the analysis and synthesis of sonic inter-
actions. Furthermore, we stress the idea that all hypotheses and evaluations in both
egocentric vision and audition are individually shaped around a human actor. How-
ever, our vision does not focus exclusively on the analysis of the listener behaviors
but includes generative aspects thanks to the technological mediation of the spatial
relations between humans and VEs (these aspects will be extensively discussed in
Sect. 1.3.2).
Using a simplification adopted in Chap. 2concerning the work by Stock-
burger [140] on sounds in video games, we can distinguish two categories for sound
effects: (i) those related to the avatar’s movements and actions (e.g., footsteps, knock-
ing on a door, clothing noises, etc.) and (ii) the remaining effects produced by the VE.
In this simple distinction, it is important to note that all events are echoic, i.e., they
produce delays and resonances imprinted by the spatial arrangements of the avatar-
VE configurations depending on the acoustical characteristics of the simulated space.
7http://getnarrative.com/
8https://gopro.com/
16 M. Geronazzo and S. Serafin
Moreover, all events should be interpreted by the listener’s memory which is shaped
by the natural everyday reality.
Finally, it is worthwhile to notice that egocentric 1PP poses novel challenges in
the field of cinematic VR narration or more generally of storytelling in VR. Gödde et
al. [56] identified immersive audio as an essential element able to capture attention
on events/objects outside the field of view. The distinction between the active role
of the listener interacting with the narrative or passive role as an observer raises
interesting questions about the spatial and temporal positioning of scenic elements.
The balance between environment, action, and narration is delicate. Citing Gödde and
collaborators, one “can only follow a narrative sufficiently when temporal and spatial
story density are aligned with each other”. Hence, the spatio-temporal alignment of
sound is crucial.
For most researchers interested in sound, from the neurological to the aesthetic-
communicative level, it is clear that while the visual object exists primarily in space,
the auditory stimulus occurs in time. Therefore, it is not surprising that in order to
speak of spatial centrality in audio we need to consider presence, the central attribute
for a VR experience. In his support of a representational view of it, Loomis [88]
cites two scientists with two opposite opinions: Willian Warren and Pavel Zahorik,
the first an expert in visual VR and the latter in acoustic VR. The former supports
a rationalist view of representational realism and direct perception [154], while the
latter supports the ecological perspective in the fluidity in perception-action [159].9
The second perspective supports the concept of enaction such that it is impossible to
separate perception from action in a systematic way. Perception is inherently active
and reflexive in the self. Recalling Varela, another leading supporter of this perspec-
tive [148], experience does not happen within the listener but is instead enacted by the
listener by exploring the environment. Accordingly, we consider an embodied, envi-
ronmentally situated perceiver where sensory and motor processes are inseparable
from the exploratory action in space. At first glance, such a view restricts experi-
ences to only those generated by specific motor skills which are in turn induced by
biological, psychological, and cultural context. However, it is generally not true in a
digital-twin-driven VE (see Sect. 1.4.3).
1.3.1.2 Binaural Hearing
The geometric and material features of the environment are constituent elements of
the virtual world that must be simulated in a plausible way for that specific listener.
First of all, the listener-environment coupling is unavoidable and must guarantee
as good sound localization performances as to maintain immersiveness. It has to
especially avoid the inside-the-head spatial collapse, i.e., when the virtual sound
stimuli are perceived inside the head, a condition opposite to the natural listen-
ing experience of outside-the-head localization for surrounding sound sources, also
9Atherton and Wang [4] recently developed a similar view point comparison and proposed a set of
design principles for VR, born from the contrast between “doing vs, being”.
1 Sonic Interactions in Virtual Environments 17
called externalization [131]. Externalization can be considered a necessary but not
sufficient condition for the place illusion, being immersed in that virtual acoustic
space. For a recent review of the literature on this topic, Best et al. [8] suggest that
ambient reverberation and sensorimotor contingencies are key indicators for elicit-
ing a sense of externalization, whereas HRTF personalization and consistent visual
information may reinforce the illusion under specific circumstances. However, the
intra-action between these factors is so complex that no univocal priority princi-
ples can be applied. Accordingly, we should explore dynamic relations depending
on specific links between evolving states of the listener-VE system during the VR
experiences. Moreover, huge individual-based differences in the perception of exter-
nalization require in-depth exploration of several individual factors such as monaural
and binaural HRTF spectral features, temporal processes of adaptation [27,65,146].
Binaural audio and spatial hearing have been well-established research fields for
more than 100 years and have received relevant contributions from information and
communications technologies (ICT) and in particular from digital signal process-
ing. Progress in digital simulations has made it possible to replicate with increasing
accuracy the acoustic transformation by the body of a specific listener with very high
spatial resolution up to sub-millimeter grids for the outer ear [113,114]. This pro-
cess generates acoustically personalized HRTFs so that the rendering of immersive
audio matches the listener’s acoustic characterization (System-to-User adaptation
in Chap. 4). On the opposite side, the VE can train and guide the listener in a pro-
cess of User-to-System adaptation by designing ad-hoc procedures for continuous
interaction with the VE to induce a persistent recalibration of the auditory sys-
tem to non-individual HRTFs.10 These two approaches can be considered two poles
between which one can define several mixed solutions. This dualism is brilliantly
exposed and analyzed in Chap. 4.
1.3.1.3 Quality of the Mediated Experience
Since our theoretical framework aims to go beyond user-centricity, we approach
the space issue from different perspectives, both user and technology perspectives,
respectively. However, all points of view remain ecologically anchored to the egocen-
tric 1PP of the listener giving rise to a fundamental question: how can we obtain high-
quality sonic interactions for a specific listener-technology relation? In principle,
many quality assessment procedures might be applied to immersive VR systems.
However, there is no adequately in-depth knowledge of the technical-psychological-
cognitive relationship regarding spatial hearing and multisensory integration pro-
cesses linked to plausibility and technological mediation.
On the other hand, a good level of standardization has been achieved for the per-
ceptual evaluation of audio systems. For instance, the ITU recommendations focus
on the technical properties of the system and signal processing algorithms. Chapter 5
introduces the Basic Audio Qualities used for telecommunications and audio codecs,
10 The HRTF selection process can potentially result from a random choice [139].
18 M. Geronazzo and S. Serafin
commonly adopted in the evaluation of spatial audio reproduction systems. On the
other hand, the evaluation of the listening experience quality, called Overall Lis-
tening Experience [125], is also introduced, considering not only system technical
performances but also listeners’ expectations, personality, and their current state.
All these factors influence the listening of specific audio content. A related measure
can be the level of audio detail (LOAD) [39] that attempts to manage the available
computational power, the variation of spatio-temporal auditory resolution in com-
plex scenes, and the perceptual outcome expected by the listener, in a dynamically
adaptive way.
Chapter 2provides an original discussion on audio “quality scaling” in VR simu-
lations, drawing the following conclusion: there is neither an unambiguous definition
nor established models for such issues. It suggests that understanding the listener-
simulation-playback relations is an open challenge, extremely relevant to SIVE. In
general, the most commonly used approach is the differential diagnosis, allowing
the qualities of VR systems to emerge from different quantitative and qualitative
measurements. Several taxonomies for audio qualities or sound spatialization have
given rise to several attribute collections, e.g., semantic analysis of expert surveys
and expert focus groups (see Chap. 5on this). It is worthwhile to mention that a sub-
stantial body of research in VR is devoted to explore the connections between VR
properties such as authenticity, immersion, sense of presence and neurophysiological
measurements, e.g., electroencephalogram, electromyography, electrocardiogram,
and behavioral measurements, e.g., reaction time, kinematic analysis.
To summarize, this differentiation tries to capture all those factors that lead to a
high level of presence: sensory plausibility, naturalness in the interactions, meaning
and relevance of the scene, etc. Moreover, the sense of presence in a VR will remain
limited if the experience is irrelevant to the listener. If the listener-environment rela-
tion is weak, the mediating action of the immersive technology might result in a
break in presence that can hardly be restored after a pause [136]. These cognitive
illusions depend, for example, on the level of hearing training, familiarity with a
stimulus/sound environment. All these aspects reinforce the term egocentric again,
grounding auditory information to a reference system that is naturally processed and
interpreted in 1PP. However, SIVE challenges go far beyond two opposing points
of view, i.e., user-centered and technology-centered. In this chapter, we offer a first
attempt at a systemic interpretation of the phenomenon.
1.3.2 Entanglement HCI
Heidegger’s phenomenology aims to overcome mind-body dualism by introducing
the notion of “Dasein” which requires an embodied mind to be in the world [67].
The concept of embodiment became central to the third wave of HCI, e.g., in rela-
tion to mobile and tangible user interfaces [64]. More recently, the bodily element
has been incorporated into the theoretical framework of somaesthetics to explain
aesthetic experiences of interaction and into design principles for bodily interac-
1 Sonic Interactions in Virtual Environments 19
tion [71]. Designers are encouraged to participate with their lived, sentient, subjec-
tive, purposive bodies in the process of creating human-computer interactions, either
by improving their design skills and sensibilities, or by providing an added value
of aesthetic pleasure, lasting satisfaction, and enjoyment to users. These elements
are summarized in Chap. 7, which provides a useful distinction of perspectives for
interaction design: the first-person, second-person, and third person design perspec-
tive. The latter is equivalent to an observer approach to design such as considering
the common practices, e.g., interview administration, subjective evaluations, and
data analysis acquired from a variety of sensors. The second-person is equivalent
to the user-centered and co-design approach between the user’s perspective and the
designer’s attempt to step into the shoes of someone else. On the other hand, soma
design principles embrace a first-person perspective, we would argue egocentric,
even for designers, who are actively involved with their bodies during each step of
the interaction design process of an artifact or simulation. They explicitly become
actors themselves with the result of shaping a felt and lived experience for other
actors.
In the movement computing work by Loke and Robertson [87], the authors intro-
duced another perspective distinction relevant here. The mover (first-person perspec-
tive) and the observer (third-person perspective) are explicitly joined by the machine
perspective. The role of technology is pivotal for the interactions with digital move-
ment information and, in particular, for the process of attributing meaning based on
user input. This perspective requires mapping data from sensing technologies into
meaningful representations for the observer and the mover. It is worthwhile to note
that machines capture the qualities of movement with considerable losses in terms of
spatial, temporal, or range resolution, making the comprehension of such limitations
on interaction design essential. We need to explore the various perspectives, not in a
mutually exclusive way, but dynamically managing the analysis of the various points
of view in every immersive experience.
According to Verbeek [150], human-world relations are enacted through technol-
ogy. Thus, man and technology constitute themselves as actors in a fluid reconfig-
uration. A practical example in the field of music perception considers a drummer
who changes her latency perception the more she plays the musical instrument [86].
The action of playing the drum changes the relationships that she has with the instru-
ment itself, with the self, and with temporal aspects of the world, e.g., reaction times
and synchronizations.
The recent proposal of a post-phenomenological framework by Vindenes [151]
is based on Verbeek’s concept of technological mediation, which identifies several
human-technology relationships including immersion in smart environments, ambi-
ent intelligence, or persuasive technologies. In particular, for the latter case, VR
plays a central role co-participating within a mixed intentionality between humans
and technology. Accordingly, Verbeek introduced the idea of composite intentional-
ity for cyborgs [149], a cooperation between human and technological intentionality
with the aim to reveal a (virtual) reality that can only be experienced by technologies,
by making accessible technological intentionalities to human intentionality.We
can argue that the world and the technology become one in the immersive simu-
20 M. Geronazzo and S. Serafin
lation that knows the listeners and actively interacts with them. This configuration
becomes bidirectional: humans are directed toward technology and technology is
directed toward them. Moreover, listeners have the opportunity to access reflective
relationships with themselves through VEs. For example, Osimo et al. provided
experience of the self through virtual body-swapping in the embodied perspective-
taking [106]. We must decentralize humans as the sole source of activity and attribute
to the material/technological world an active role in revealing new and unprecedented
relational actions.
This approach opens up new opportunities for “reflexive intentionality” of the
human beings about themselves through the active relation with simulations [5].
About this, Verbeek [150] classifies the technological influence on humans accord-
ing to two dimensions: visibility and strength. Some mediations can be hidden but
induce strong limitations, while others can be manifest but have a weak impact on
humans. There is a deep entanglement between humans and machines to the extent
that there is no human experience that is not mediated through some kind of technol-
ogy that shapes who we are and what we do in the world. Considering immersive VR
technologies, we must speculate on what is a locus of agency: the understanding of
the active contributions of each tool in the listener’s actions in VEs. Such an infras-
tructure must be enactive and re-interpretive of each actor in each circumstance. In
other words, there is the opportunity of becoming different actors depending on an
active inter-dependence.
At this point, recalling the work of Orlikowski [105] is twofold. First, she gave the
name of entanglement theories to those heterogeneous theories that have in common
the recognition of the active inter-dependence between socio-technological-material
configurations with the consequence of promoting studies of man and technology
in a unitary way. Secondly, Orlikowski supported her position with an experimental
example of social VR, the Sun Microsystems’ Project Wonderland developed more
than a decade ago and, nowadays, it seems more relevant than ever due to the COVID-
19 pandemic. We will analyze a similar case in SIVE, supporting our taxonomy in
Sect. 1.4. In this section, we focus on entanglement theories that are foundational
for our egocentric perspective.
The entanglement is the deep connection between men and their tools, having rel-
evant repercussions in the field of human-computer interaction. In [45], Frauenberger
provided the following interpretative key: we cannot design computers or interac-
tions, we can work on facilitating certain configurations that enact certain phenom-
ena. Both configurations and phenomena are situated and fluid, but not random. They
are causally connected within hybrid networks in which human and non-human
actors interact. However, it must be made clear that these actors do not possess fixed
representations of their entities, but they exist only in their situated intra-action. This
means that their relations and configurations are dynamically defined by the so-called
agential cuts that draw the boundaries between entities during phenomena. In this
network of associations, each configuration change is equivalent to a newly enacted
phenomenon where new agential cuts are redefined or create new actors. Hence, the
term agency refers to a performative mechanism of boundary definition and consti-
tution of the self. Together with the post-phenomenological notion of technological
1 Sonic Interactions in Virtual Environments 21
mediation, entangled HCI provides a lens able to interpret the increasingly fuzzy
boundaries between humans, machines, and their distribution of agency.
The sonic information from intentional active listening is anchored to an ego-
centric perspective of spatiality that allows the understanding of an acoustic scene
transformed by the listener’s actions/movements. This process can be mathematically
formalized with the active inference approach by Karl Friston and colleagues [46] and
their recent enactive interpretation [115]. Their computational framework quantita-
tively integrates sensation and prediction through probability and generative models
optimizing the so-called free-energy principle, i.e., an optimization problem of a
function of the beliefs and expectations. Following this line of thought both philo-
sophically and mathematically, we argue that immersive audio technologies are capa-
ble of contributing to the listener’s internal representation in both spatial and seman-
tic terms, eliciting a strong sense of presence in VR [12]. Just as we cannot clearly
distinguish between listener and real environment, the more we cannot distinguish
between listener and VE.
Therefore, the sonic interaction design in VEs is an intra-action between technol-
ogy, concepts, visions, designers, and listeners that produce certain configurations
and agential cuts. According to the sociological actor-network theory [28,85], the
network of associations characterizes the ways in which materials join together to
generate themselves. Prior knowledge also becomes an actor in such a network that
shapes, constrains, enables, or promotes certain activities. For example, modeling the
listener’s acoustic contribution with measurements from a dummy head induces a cut
that shapes the use cases and VR experiences. Similarly, agential cuts are performed
based on knowledge from other studies. For instance, the auditory feedback supports
the plausibility of footstep synthesis or the strategies employed in the definition of
time windows for synchronous and embodied sensory integration [122]. Moreover,
the physical and design features of the technology also contribute to determining
what is feasible: e.g., the differentiation of playback systems for spatial audio results
in differentiation in the quality of the experience (see Chap. 11).
In the entanglement within the relational network of listener-reality-simulation,
configurations and actors are dynamically defined in a situated and embodied manner.
In the process of configuring and reconfiguring actors, designing various aspects,
and operating agential cuts new knowledge is produced that causally links the
enactment of the technological design to the phenomenon created [45]. This means
that this knowledge has several forms, one resides in the technological artifact itself,
i.e., in the VR simulation. In a more general sense, we could argue that exploring the
evolution in the network configurations and actors enables an active search for the
egocentrically meaningful experience. In line with this, agency and its responsibilities
are not the prerogative of the listener or the technology but reside in their intra-actions.
22 M. Geronazzo and S. Serafin
1.3.3 Auditory Digital Twin
From entanglement theories, we inherit a series of open questions that guides our
reflection on the SIVE research field. Let’s consider the immersive VR simulation
as the digital artifact co-defining itself with the listener who experiences it.
How can certain transformative actions and interactions be programmed?
Who/what is the mediator, if any, in the relationship between the physical world and the VE?
How should such a mediator act?
Of particular interest here is Schultze’s interpretation of the avatar [126]: a
dynamic self-representation for the user, a form of situated presence that is variably
implemented. Sometimes the avatar is seen as a separate entity, behaving indepen-
dently of the user. Sometimes the listener inhabits the avatar, merging with it to such
an extent that they feel completely immersed and present in the virtual space. From
this variety of instances, definitions of identity (avatar vs. self), agency (technology
vs. human), and the world (physical vs. virtual) are fluid and enacted depending on the
situation. Moreover, we argue that avatars and listeners know very little about each
other. Such consideration strengthens the individual experience that determines one
tendency over the other (separation vs. union with an avatar) with difficult predictions
and poorly generalizable interpretations. Consequently, the user characterization in
human-centered design is somehow included here [76]. However, our view promotes
meaningful human-technology relationships in a bidirectional manner: not only per-
sonalized user experiences, but experiences able to shape who we really want
to be.
The communication between the avatar and the listener, the virtual and the physical
is challenging. Considering the avatar as part of a VE configuration, we can formulate
one of the initial questions: if we can handle mediation, where/who is in charge of
that?
Our performative perspective is questioning the a priori and fixed distinctions of
certain representationalism between avatar and self, technology agency and listener,
physical reality, and virtuality. These boundaries have to be drawn in situated and
embodied action, which makes them dynamic and temporary. The exploration of how,
when, and why agential cuts define boundaries of identity, agency, and environments
is the core of our theoretical framework.
We want to give a digital form to the philosophical question of the locus of
agency: we envision a meta-environment with technological-digital nature, which
is the guardian, careful observer, and lifeblood for the dialogue and participation of
each actor. Its name is the auditory digital twin. In an egocentric perspective, it
takes shape around the listener, i.e., the natural world that is meaningful to her. Why
twin? Because this term recalls the idea of the deep connection between two different
and distant entities or persons, commonly grounded by similarities, e.g., the DNA or
a close friendship. Although the adjective auditory would seem to restrict our idea
to the sound component, the framework ecologically extends to the multisensory
domain by considering the intrinsic multisensory nature of VR. For these reasons,
1 Sonic Interactions in Virtual Environments 23
we will provide an audio-first perspective, sometimes sacrificing the term auditory
in favor of a more readable and synthetic expression without loss of information, i.e.,
(auditory) digital twin.
Technical aspects of an artifact can be used to recreate a virtualized version or
digital simulation of the artifact itself in the so-called virtual prototyping process [90].
Similarly, perceptual and cognitive aspects might serve to obtain digital replicas of
biological systems, also referred to as a bio-digital twin in the field of personalized
medicine [23]. The real person/machine provides the data that gives shape to the
virtual one. In the case of humans, the process of quantified self [89] supports the
modeling of the virtual digital twin, an algorithmic assistant in decision-making.
Implications of the digital twin paradigm are already envisioned in [40]. They range
from the continuous monitoring of patient health to the management of the agency
in a potentially immortal virtual agent.
In the scientific literature, the most common definition of a digital twin is related to
a digital replica. However, we would like to provide a significant imprint to our idea
of the auditory digital twin as a psycho-socio-cultural-material objectified actor-
network with agential participation. As depicted in Fig. 1.4, all digitally objec-
tifiable configurations related to listener profile, VE, HW/SW technology, design,
ethical impact, etc. are made available to the digital twin so that it can actively
participate intra-acting with system states.
To understand the central role of the digital twin in SIVE, we provide some
practical examples:
•Links to setup configurations—Body movement tracking opens up numerous
opportunities for dynamic rendering and customization of the listener’s acoustic
contribution in harmony between the real and the virtual body, i.e., the avatar’s
body. Real-time monitoring of the motion sensors is crucial to avoid a negative
impact on responsiveness.
•Links to listener configurations—Adaptation and accommodation processes are
strongly situated in the task. Assuming the unavailability of individual HRTF mea-
surements, the best HRTF model requires a dynamic analysis of each task/context
in a mutual learning perspective between the listener and the digital twin.
•Links to environment configurations—Persuasion of a VE for a listener behav-
ioral change depends on social and cultural resonances within the listener. The
distribution of agency in a music-induced mood has to be analyzed with particular
attention. Again, certain immersive gaming experiences or role-playing may be
beneficial for some listeners, to be avoided for others.
•Links to configurations of others—Other entities, e.g., virtual agents or avatars
guided by other listeners, populate VEs. To manage confrontation and sharing
activities, the intra-action between a larger number of digital twins must be con-
sciously encouraged.
All these configurations are not independent but are always interconnected with
each other. Of particular relevance here, we can consider the externalization of sound
sources. The level of externalization depends on customization techniques of the
spatial audio rendering, the acoustic information of the virtual room, the sensory
24 M. Geronazzo and S. Serafin
Binaural rendering
Sound propagation
Auditory Digital Twin
psycho-socio-cultural-material objecfied
actor-network with agenal parcipaon
System config.
Environment config.
Listener obj config.
Action sound
Environmental sounds
Fig. 1.4 A schematic representation of the different sound elements needed to create an immersive
sonic experience. Colored lines identify the differences compared to the scheme proposed in [128]. In
particular, this representation focuses on the central role of the auditory digital twin as a quantifiable
locus of agency in an active relationship with all actors of a VR experience.The green arrow identifies
the participatory relationship between the listener and the digital twin in its performative formation
of individual self-knowledge
coherence and synchronicity, and the familiarity with the situation [8]. A coordi-
nation action of setup, environment, and listener(s) is needed. The presence in VR
experiences will be the result of all these fluid intra-connections.
Suchman posed a highly relevant question in [141]: how can we consider all these
configurations in such a way that we can act responsibly and productively with and
through them? To answer, we must deal with the participation issues for all involved
actors.
The egocentric perspective requires us to start from the listener and her experience.
The scientific literature already tells us that memory, comprehension, and human
performance benefit considerably from these VEs, especially in guided or supervised
tasks involving human or digital agents [29]. Let us focus on the series of actions
triggered by an active role of agents. In [31], Collins analyzed the player role in the
1 Sonic Interactions in Virtual Environments 25
audio design of video games. The participatory nature of video games potentially
leads to the creation of additional or completely new meanings compared to those
originally intended by the creators and their storytelling. Hence, there is a change not
only in the reception but also in the transmission in the communication of auditory
information. The player becomes a co-transmitter of information introducing non-
linearities in the experience that propagate throughout the agents’ chain of activity,
triggering feedback and generating further non-linearities.
In this respect, Frauenberger’s entanglement HCI (Sect. 1.3.2) suggests abandon-
ing a user-centered design of the digital artifact in favor of participatory, speculative,
and agonistic methods with the ultimate goal of obtaining meaningful relationships
and not merely optimized processes relating to the human or the machine pole, or their
interaction. It is useful to briefly recall these methods. The agonistic and adversarial
design employs processes and creates spaces to foster vigorous but polite disputes
involving designers’ participation in order to constructively identify inspiring ele-
ments of friction [36]. On the other hand, the participation in a speculative process
through designing provotypes aims to provoke a discussion about the technological
and cultural future by considering creative, political, and controversial aspects [117].
The more degrees of freedom in the network configurations, the more behaviors
can potentially be stimulated. The relational network should not be hardly controlled
because its expressive potential can be exploited through its differentiation. In our
opinion, the current immersive audio technologies are struggling to emerge, because
they often introduce static agential cuts, justified by audio quality assessments con-
ducted in a reductionistic way. On the contrary, the main goal of the digital twin is
to favor the participation of all available configurations. Specific configurations and
agential cuts emerge in a speculative, agonistic, and provocative manner so that all
actors can benefit from different attempts following knowledge diffraction [6]. The
learning in such fluid and dynamic evolution from one configuration to another is a
continuous flow of knowledge that informs the digital twin’s activity. In other words,
the digital twin continuously proposes new agential cuts to record and analyze the
overall results. A relevant example in SIVE is the co-determination of the attentional
focus in selecting the meaningful auditory information for a digital twin facing the
cocktail-party effect [20]. The digital twin must be able to guide an active participa-
tion with the VE considering listener’s available knowledge extracted by previously
experienced and stored scenarios (and agential cuts).
The continuous intra-action within the digital twin in relation to a shared and
immersive experience is of strong practical relevance within the proposed theoreti-
cal framework. This issue offers concrete possibilities for radically changing the way
we interact socially in the future, by using digital tools equipped with computational
intelligence and artificial intelligence (AI) algorithms able to manage complex sys-
tems [107]. The decision-making phase of intelligent algorithms will improve over
time, thanks to a dynamic identification and classification of configurations and links
in the actor-network. The knowledge can be continuously extracted as a result of com-
putational intra-actions of the human-in-the-loop type where the listener can be seen
as an agent directly involved in the learning phase, step-by-step influencing cost
functions and all other measures [69]. More in general, the reinforcement learning
26 M. Geronazzo and S. Serafin
paradigm focuses on long-term goals, defining a formal framework for the interac-
tion between a learning agent and its environment in terms of states, actions, and
rewards, hence no explicit definition of desired behavior might be required [35]. This
process can be accomplished during exposure to a continuous stream of multimodal
information like in the case of lifelong learning [109], or via interactive annotations
and labeling [81].
1.4 A Taxonomy for SIVE
An important contribution to the design in VEs comes from practice, e.g., professional
reports and testimonials, best practices, or reviews and interpretations of lessons
learned in the industry (see Chap. 6and [76]). Taking into account all these inputs,
academic studies, new technologies, and commercial user feedback, different com-
munities draw support for their specific users and domains of interest. Within the
SIVE field, there is still much work to be done. There is a lack of recommendations
and design analysis on creating interfaces, interactions, and environments that fully
exploit egocentric sonic information. To unlock such potential, our suggestion is to
start from a multi and interdisciplinary work resulting in these foundational ques-
tions: does a development path exist for the SIVE field? Is an ad-hoc theoretical
approach necessary? Without going into the details of the epistemological crisis that
is affecting the HCI field, we would try to avoid discussions on what is called in
the HCI community intermediate knowledge [72] where positivist and constructivist
perspectives are constantly clashing [45]. Examples of intermediate knowledge are
all patterns/best practices proposed for certain aspects of the immersive experience.
There exist several classifications attempting to describe virtual spaces for sound
and music purposes. The recent formulation in [4] distinguished three aspects:
•Immersive audio—the VE should provide the feeling of being surrounded by a
world of sounds.
•Interactive audio—the VE allows the user to influence the virtual world in some
meaningful way.
•Virtual audio—the virtual world must be dynamically simulated.
They have already been extensively discussed in the previous sections and many of
the existing taxonomies for VR [95,134,157] prioritizing the system (or simulation)
or the user, not the close relationship with the listener. In this section, we propose an
audio-centered taxonomy that does not distinguish between user and system, lis-
tener and simulation. Our theoretical framework uses an egocentric audio perspective
by emphasizing the situated, embodied, enactive dimensions of the listener’s experi-
ences with their different actors involved. An emphasis on the entanglement between
humans and technology assumes that the listener’s internal states are directly inac-
cessible to a non-intrusive and external technology, i.e., focused on exteroceptive
sense [134]. Accordingly, we will motivate the selection of three dimensions able
to describe a technological mediation in VR: immersion,coherence, and entan-
1 Sonic Interactions in Virtual Environments 27
glement. The qualitative description in this section leaves as a future challenge a
quantification of the performative processes introduced here.
Referring to the autobiographical element introduced in the book preface, the first
meeting of the two chapter authors at the ACM CHItaly 2011, the biennial confer-
ence of the Italian HCI community, has also a scientific meaning for the proposed
taxonomy. The paper by Geronazzo et al. [50] was presented more than 10 years
ago, as one of the first tasks of the first author’s doctoral program. He attempted to
adapt the virtuality continuum of Milgram and Kishino [95] in the context of spatial
audio personalization technologies for VR/AR. His main motivation was to over-
come his difficulty in fitting the strong acoustic relationship (i.e., HRTF customiza-
tion) between listener and technology into a taxonomy created for visual displays in
1994.
That paper proposes a characterization that uses a simplified two-dimensional
parameter space defined in terms of the degree of immersion (DI) and coordinate
system deviation (CSD) from the physical world. It is a simplification of Milgram’s
three-dimension space, summarized in the following:
•Extent of World Knowledge (EWK): knowledge held by the system about virtual
and physical worlds.
•Reproduction Fidelity (RF)—virtual object rendering: quality of the stimuli pre-
sented by the system, in terms of multimodal congruency with their real counter-
part.
•Extent of Presence Metaphor (EPM)—subject sensations: this dimension takes
into account the observer’s sense of presence.
CSD matches EWK with the distinction that a low CSD means a high EWK: the
system knows everything about the material world and can render the synthetic
environment in a unified mixed world. From an ecological perspective, the system
knows and dynamically fosters the overlap between real and virtual. On the other
hand, EPM and RF are not entirely orthogonal and the definition of DI follows this
idea: when a listener is surrounded by a real sound, all his/her body interacts with the
acoustic waves propagating in the environment, i.e., a technology with high presence
can monitor the whole listener’s embodiment and actions (high DI).
Recently Skarbez et al. [134] have proposed a revised version of Milgram’s vir-
tuality continuum introducing two distinctive elements. First, the consideration of
only two instead of three Milgram’s dimensions similarly to [50]: Immersion and
Extent of World Knowledge. In particular, Immersion is exactly based on the same
idea as DI. Second, they introduced a discontinuity in the RF and EPM dimensions
considering the absence of any display at the left side of the spectrum: the physical
world without mediation is inherently different from the highest level of realism
achievable through VR technologies that stimulate exteroceptive senses (i.e., sight,
hearing, touch, smell, and taste). The latter consideration propagates to Immersion.
The rough taxonomy of Geronazzo et. al. missed the idea of coherence between
simulation and human behavior, which is well identified as the third analytical dimen-
sion of Skarbez et al. [134]: coherence. It takes into account both plausibility and
expectation of technological behaviors for the user in cognitive, social, and cultural
28 M. Geronazzo and S. Serafin
terms. However, the three proposed dimensions cannot and do not claim to describe
such a relationship between the user and the system as emphasized by the authors in
their system-centered taxonomy. The work of Skarbez and colleagues is once again
anchored to the distinction between user and system which generates several issues
in framing the intra-actions of actors/factors in VR/AR sonic experiences.
To support the SIVE theoretical framework, we focus on purely VR only. This
means that our discussion will not consider the CSD/EWK dimension assuming that
there are no anchors to the physical world. However, since we are emphasizing the
influence of human-real-world relationships on experience in VE and vice versa, we
have decided not to make the world configurations explicit thus considering them
as a whole with the listener. Extensions to mixed reality will be an object of future
studies in a reviewed version of our theoretical framework.
Starting from the previously identified dimensions of Immersion [50] and Coher-
ence [134],we suggest three top-level categories that need to be addressed through
interdisciplinary design work. A schematic representation can be found in Fig. 1.5.
Immersion: the digital information related to the listener-digital twin relationship
supporting an increasing number of actions in VEs. It measures the technological
level and its enactive potential between listener and auditory digital twin.
Coherence: the digital information related to the digital-twin-VE relationship
that allows the plausible rendering of an increasing number of behaviors in VEs. It
measures the effectiveness of sonic interaction design in VEs.
Entanglement: represents the overall effectiveness of the actor-network and its
agential cuts that are dynamically, individually, and adaptively created. It measures
participation in the locus of agency and its consequent phenomenological description.
Fig. 1.5 Three-dimensional
taxonomy for SIVE, (a)
Immersion, (b) Coherence,
(c) Entanglement, and their
relations in (d) Immersion
Coherence
Every listener’s
acƟons supported
No virtual
enviroments
Every environment is
plausible
Limited acƟons
No virtual
enviroments
Virtual behaviours do not
correspond to expectaƟons
Entanglement
"Perfect" agenƟal cuts
in every configuraƟon
No
agenƟal cuts
Inadequacy of agenƟal cuts for the
experience
IM
ENT CO
(a)
(b)
(c)
(d)
1 Sonic Interactions in Virtual Environments 29
The auditory digital twin actively proposes new relations favoring redefinitions in the
agential cuts, i.e., the mutual transformative actions between listener and technology.
To support our proposed taxonomy for SIVE, we introduce a case study on a
fictitious and purely theoretical artifact along the lines of Flow [45]. It allows us to
decline the various facets of the framework in a flexible example.
Spritz! is an interactive and immersive VR simulation supported by full-body
tracking, stereoscopic vision, and headphone auralization. It is designed to address the
cocktail-party effect. The human selective attention requires different contributions
and levels of perception in supporting the ability to segregate signals-also referred
to as auditory signal analysis [15,20]. When confronted with multiple simultaneous
stimuli (speech or non-linguistic stimuli), it is necessary to segregate relevant auditory
information from concurrent background sounds and to focus the attention on the
source of interest. This action is related to the principles of auditory scene analysis that
require a stream of auditory information filtered and grouped into many perceptually
distinct and coherent auditory objects. In multi-talker situations, auditory object
formation and selection together with attentional allocation contribute to defining a
model of cocktail-party listening [75,132]. The design of Spritz! aims to give shape
to an auditory digital twin able to detect listener intent, i.e., identify the relevance
of a sound compared to other overlapping events. It can instantaneously determine
the attentional balance within an auditory space. Its main goal is to promote the
listener’s well-being through manipulations of the sound scene in a participatory
way respecting the listener’s desires.
1.4.1 Immersion
According to Murray [98], the term immersion comes from the physical experi-
ence of being immersed in water. In a psychologically immersive experience, one
aims at experiencing the feeling of being surrounded by a medium that is a reality
other than the physical one, able to capture our attention and all our senses. There-
fore, it has an important element of continuity with our framework by identifying a
mediating action of VR experiences. According to Slater and Wilbur [137], the term
immersion is tightly linked to the technology, the mediator, to elicit the sense of
presence. Technological systems for immersive VR count several combinations of
equipment and techniques, such as HMD, multimodal feedback, high frame rates,
and large tracking areas. Such a heterogeneous arsenal is a complex system of func-
tional elements that have an immediate impact on the listener’s experience. Initially,
technical specifications were reasonably identified as the main constraints for a VR
experience. However, other elements were considered with a large-scale diffusion
of VR technologies. The design of VEs became critical in those details that ensure
a plurality of actions with virtual objects, the surrounding virtual world, and their
representations. As discussed in [30], the effects of all these components are highly
interconnected with each other. Moreover, the absence or misuse of any of them can
30 M. Geronazzo and S. Serafin
produce immediate disruptions in the sense of presence or cybersickness [33], such
as low headset quality [16] or unfiltered noise caused by sound sources external to
the VR setup [136].
The strong connection between immersion and equipment means that different
VR solutions hold an intrinsic level of immersion regardless of the actual applications
performed with them [120]. This is evident when considering basic audio quality vs.
quality of the listening experience. For instance, considering projected screens offers
designers of VEs the opportunity to combine real and virtual elements in the tracked
area (Chap. 13 offers an interesting reflection on artistic performances mediated
by VR/AR technologies). However, the overall sense of presence experienced by
the listener depends on the specific combination of the HW/SW setup. Such setups
support a certain type of action within the VE. The Immersion “I” dimension takes
into account these features as the starting point of an enactive potential for the auditory
digital twin. Such a potential intrinsically limits the development and creation of new
actions.
Furthermore, the enactive egocentric perspective of Sect. 1.3.1 provides a solid
theoretical framework for considering the importance of ecologically valid auditory
information in eliciting a sense of presence in a VR-mediated experience. First of all,
it should be mentioned that there is a lack of research related to the effects of inter-
active sound on the sense of body ownership and agency (refer to the discussion in
Chap. 2). The vast majority of studies addressing presence from an auditory perspec-
tive focus on place illusion and spatial attributes. This should not come as a surprise,
since many of these binaural attributes are perceived by applying sensory-motor
contingencies and embodied multisensory integrations. A simple example in spatial
audio technologies is the importance of head movements data that are acquired by
three degree-of-freedom head-trackers, allowing listeners to exploit binaural cues for
resolving the so-called front-back confusion [22]. However, computational models
for binaural cues are usually parameterized by the head radius or circumference, or
ears position [52]. This example suggests that synchronization and plausible interac-
tive variations, i.e., occurring in reaction to the digital twin’s gestures in coherence
with sensorimotor contingencies, can positively influence the sense of agency. In
addition, other studies demonstrate how the sound of action and an active explo-
ration can support haptic sensations and vice versa in a co-located and simultaneous
manner. For instance, Chap. 12 analyzes the impact of sound in an audio-tactile
identification of everyday materials from a bouncing ball.
Regarding spatial hearing, there is a huge differentiation in accuracy between
more (experienced) and less (naive) reliable listeners [3,51]. More generally, the
distinction between categories of listeners is still challenging and is made based
on several factors such as multisensory calibration and integration (see Chap. 12
for audio-haptics), familiarity with immersive/spatial audio technologies, musical
background [152], or audio mixing experience (as in Chap. 9), etc. Both acoustic
(i.e., acoustic transformations of the body) and non-acoustic (i.e., everything else)
factors are highly individual and depend on the relationship between the listener and
1 Sonic Interactions in Virtual Environments 31
the real-world which is mediated by technology in a general sense (not only in the
digital domain, e.g., games, musical instruments, etc.).
All objectifiable information regarding the listener are known configurations.
For example, bottom-up approaches for modeling psychophysical phenomena of
spatial hearing and multisensory integration fall into this category. Such knowledge
has to be integrated into the immersive system, explicitly contributing to the actor-
network managed by the digital twin.
Coming back to our Spritz! simulation, the level of Iis expected to be high due
to state-of-the-art technological components. The digital twin can recognize and
manage several full-body skeletal configurations as well as near-field acoustics algo-
rithms that take into account the acoustic coupling of the main joints such as the head
and shoulders. This last aspect is usually largely underestimated in virtual acoustics
systems [17]. The customization based on anthropometry allows the digital twin to
guide the acoustic rendering of movements considering head tilt and torso shadow-
ing in real-time. Furthermore, binaural and spectral cues might be personalized and
weighted according to the listener’s level of uncertainty, allowing the digital twin to
predict which sound sources are most likely to be segregated based on an egocentric
direction-of-arrival perspective.
The contribution of the Idimension can be summarized as follow: Iis the digi-
tal information related to the listener-digital twin relationship limited by a specific
technological setup. The support of an increasing number of actions in VEs is a
consequence of technological improvements (both HW/SW) and/or an increasing
objectification of the listener’s configurations. Considering the idea of immersive
potential of Chap. 11, limitations in enaction determine which changes are signifi-
cant after technological manipulation. The level of reconfigurability within the digital
twin accounts for the constant dialogue with the listener to explore her state and ten-
dency to immersion in every moment of the experience (see also Sect. 1.4.3).
1.4.2 Coherence
The VR simulation must be able to make the digital twin freely interact with the
VE, eliciting a plausible experience for the listener who is always aware of the
mediated nature of the experience. In other words, the interaction design must support
functionally and plausible actions, the ‘doing’ in [43]. This means that possible
configurations of the technical setup and the listener (the objectification in the digital-
twin, see Sect. 1.4.1) constitute the enactive potential of immersion and must be
balanced within the sonic interactions.
In this section, we focus on the coherence of the digital-twin/environment rela-
tionship. On the other hand, Sect. 1.4.3 provides an interpretation of the dialogue
with the immersion dimension.
VE simulations can create fictional worlds, exploiting opportunities for both nat-
uralistic and magical interactions [13]. Designers can experiment with defining rules
that only apply in the virtual domain, such as scale, perspective, and time. The philo-
32 M. Geronazzo and S. Serafin
sophical discussion of the dualism “Doing vs. Being” in [4] provides interesting
insights into our egocentric auditory perspective: simulation can have different lev-
els of interactivity suggesting different action spaces for the digital twin in the virtual
worlds.
Interacting with the VE, avatar included, consists of altering the states of 3D
elements that have been created at different levels of proximity: the virtual body
(i.e., avatar), the foreground (i.e., peripersonal object manipulation space), and in
the background (i.e., extra-personal virtual world space). Existing researches on 3D
interaction focuses on the spatial aspects of the following main categories: selec-
tion, manipulation, navigation, and application control (the latter involving menus
and other VE configuration widgets). Selection techniques allow users to indicate an
object or a group of objects. According to the classification of Bowman et al. [14],
one can consider selection techniques based on object indication (occlusion, object
touch, pointing, indirect selection), activation method (event, gesture, voice com-
mand), and feedback type (text, acoustic, visual, force/tactile). Manipulation tech-
niques allow the digital twin to modify all virtual objects configurations that are
made accessible to it: e.g., the spatial transformation of objects, i.e., roto-translation
and scaling, surface properties such as material texture and acoustic properties, or
3D shape and structure manipulations. For the variety of interaction metaphors for
selection, we refer to a recent review in [92]. Finally, navigation techniques allow
digital twins to move within the VE to explore areas and virtual worlds. Typical
movements include walking and virtual transportation, including flight experiences.
In particular, walking is fundamental to humans, and supporting natural locomo-
tion is not always feasible on a limited tracked space. Accordingly, there are other
interaction metaphors such as walk-in-place [42], teleportation, or semi-automatic
movements between control points [61]. It is worthwhile to mention the self-motion
illusions. In circular vection [116], moving sounds surrounding the listeners facili-
tate the perception of being in motion when in fact they are not. For spatial design
considerations in sonic interactions, Chap. 6provides a comprehensive analysis and
a typology of VR interactive audio systems.
These configurations must be plausible and the digital twin should support a
dynamic transition from one to another. This is crucial to avoid irreparable breaks
in presence. Therefore, coherence “C” describes the degrees of freedom introduced
by the sonic interaction design in VEs based on the active dialogue between the
digital-twin and the VE, established experience after experience.
In this section, we are particularly interested in the plausibility illusion determined
by the overall credibility of a VE concerning subjective expectations. It is not only a
coherence between external events not directly caused by listeners but an objective
feature of the VE [134]. Its reconfigurability includes an internal logical coherence
and a behavioral consistency considering prior knowledge. Sound conveys eco-
logical information relevant to the expectation toward VE behaviors compared to
the listener’s everyday experience: embodied, and situated in a socio-cultural con-
text. The environment configurations (avatars and virtual worlds) intertwine with the
known listener configurations held in the digital twin. Once again, the digital twin
has a central and active role following an egocentric audio perspective (see Fig. 1.4
1 Sonic Interactions in Virtual Environments 33
for this foundational idea). Dimension Cadvocates a top-down approach to inter-
actions, constituted of cognitive and socio-cultural influences based on listener real
life.
Moreover, coherence does not presuppose physical realism. It fosters interactions
in coherent virtual magic worlds. The dynamic dialogue between VE and digital-
twin makes it possible. For example, let’s consider a cartoon world where simplified
descriptions of sound phenomena exaggerate certain features [118]. It may be plau-
sible as long as it conveys relevant ecological information. Audio procedural mod-
els are based on simplifications in properties and behavior of a the corresponding
real object, i.e., simplified configurations. Such parameterization can be informed by
auditory perception and cognition maintaining ecological validity of a fictional sonic
world while reinforcing the listener’s sense of agency. Digital information regarding
the relationship between the digital twin and VE allows the creation of an increasing
number of plausible behaviors in VR.
Considering once again the distinction among avatar, peri- and extra-personal
spaces, neurophysiological research on body ownership and multisensory integration
suggests the existence of a fluid boundary in the perceived space by subjects [60].
It is worth noticing that the neuronal activity sensitive to the appearance of stimuli
within the personal space is multisensory in nature and involves neurons located in
the frontoparietal area. In this area, neuronal activity is related to action preplanning
particularly for reacting to potential threats [130] and elicits defensive movements
when stimulated [32]; these multimodal neurons combine somatosensory with body
position information [58]. Bufacchi and Iannetti [24] suggested that the personal
space should be described as a series of action fields that spatially and dynamically
define possible responses and create contact-prediction functions with objects. Such
fields may vary in location and size, depending on the body interaction within the
environment and its actual and predicted location. Space is also modulated in response
to external stimuli and internal states of the subject, defining a relationship between
listener, environment, and tools [119].
Of particular interest for our framework are modulations due to the proxemics.
The term was introduced by Hall [63] and concerns implicit social rules of interper-
sonal distance among people conveying different social meanings. The cooperation
in a socially shared interpersonal space [144] requires to support the transition from
individual to collaborative spaces [142]. In Chap. 8, the design of sound intensity (or
sound attenuation) as a function of the proximity from a sound source is addressed.
Different configurations of personal and public spaces were tested in a shared VE
for collaborative music composition. Interestingly, rigid boundaries in the transition
between spaces forced listeners to take a social distance and isolate with a nega-
tive impact on the collaborative aspects of the composition process. Therefore, the
separation between public and personal space should be fluid rather than rigid. The
VE should be configurable in the social aspects that emerge from the strong inter-
connection between configurations made available to the digital twin, increasing the
fluidity and better supporting collaboration in shared experiences.
In Spritz!, we should identify the VE’s abilities in shaping the simulation within the
digital twin. First, Spritz! has multiple configurations accounting for different strate-
34 M. Geronazzo and S. Serafin
gies of the level of audio details. The radial distance with an egocentric reference can
drive the dynamic definition of three partially overlapping levels of detail associated
with proximity profiles: avatar, personal and public. The avatar’s movement sounds
are rendered through procedural approaches with individualized configurations based
on listener acoustics; in the personal space, Spritz! can manipulate sound behavior
with simplified models taking into account security and privacy levels required by
the situated and embodied states of the digital twin. Finally, sounds in the public
space can be clustered, grouped, or attenuated by implementing plausible statistical
behavior, e.g., using audio impostor replacement such as audio samples.
The Spritz! environment should facilitate resolutions of the cocktail-party prob-
lem in crowded situations. Accordingly, it should be able to apply noise suppression
of negligible information in the public space or vice versa to operate audio enhance-
ments supporting attentive focus. This dynamic connection between VE and digital
twin should be able to maintain coherence in the induced behaviors, supporting the
plausibility of actions while bending the space around the listener.
A meaningful manipulation of virtual spaces is crucial and creative. Since SIVE
naturally includes researches in music composition, a VE must foster the develop-
ment of individual or collaborative creative ideas through dynamic control of its
configurations within and by the digital twin. In particular, results in Chap. 8sup-
port VE spatial design as the creation of “magical” exploratory opportunities, adding
original dynamics to collaborative work in VEs. The digital twin has a pivotal role
in such space modulations that allow tracing boundaries performatively and eliciting
internal emotional states following the listener/composer’s expectations.
1.4.3 Entanglement
The listener’s susceptibility to immersive VE experiences is usually determined by
administering questionnaires [155,156]. The experimenters’ aim is usually to per-
form a screening test to distinguish who can and will be able to easily immerse in a
VR-mediated situation. Furthermore, this separation is assumed to remain constant
throughout a short-enough experiment. However, the immersive tendency can change
over time due to training, learning, experience, mood changes and personality, etc.
(see Chap. 11 for further details). For such reasons, common recommendations for
VR experiments suggest conducting single experimental sessions. However, study-
ing the impact of the aforementioned dynamic changes opens to the third and last
dimension of our taxonomy: entanglement, which is the knowledge extraction from
the evolution of an actor-network able to reveal multiple facets of the egocentric
experience in time, space, and intra-actions.
The first step requires describing the available configurations. Starting from the
idea of immersive tendency, VR simulation would benefit from the knowledge of
the listener’s susceptibility toward configurations of setup and environment to mod-
ify or avoid non-significant experiences, e.g., getting a break in illusion. In other
words, (quantifiable) listener configurations must be defined, discovered, and actively
1 Sonic Interactions in Virtual Environments 35
explored by the digital twin. For example, the way sound samples are engineered
is very interesting here. A sliding friction sample, e.g., squeaking, rubbing, etc.,
requires a large amount of data and randomization techniques to avoid repetition.
Sounds should be consistent with the listener’s expectations in response to com-
plex and continuous motor actions. For this reason, procedural audio approaches can
tightly connect the sound to complex and continuous motor actions.
The Entanglement dimension (“E” ) aims to provide a phenomenological char-
acterization of actors’ evolution and activities based on their performativity and par-
ticipation in a locus of agency. We realize the high complexity of such a descriptive
and formal process, but we believe that an attempt in capturing the transformative
potential of VR in mediated experiences is worthwhile to be conducted for the SIVE
discipline. Of great importance here is the idea of monad by sociologists Tarde and
Latour [84,143]: “A monad is not a part of a whole, but a point of view on all the
entities taken severally and not as a totality”.One can consider a monad as a rela-
tional perspective of each actor, shifting the emphasis from aggregation of the whole
to movement between different points of view. The main purpose of any perspective
is the structural analysis of the network and its configurations and, at a later stage,
to derive knowledge and understanding of its dynamics. The inherently egocentric
local perspective of the locus of agency, i.e., digital twin, is again emphasized as
opposed to a global view. Egocentric networks built around specific nodes such as
the listener configurations can support the exploration of intra-activated dynamics.
Configurations and links can be discovered and/or modified during different mediated
experiences.
The collaboration among actors is vital in integrating different points of view,
creating opportunities for meaningful experiences. In shared VEs (Chap. 8), listen-
ers are co-present with other human participants interacting in an interpersonal way.
Research interests of computing-supported cooperative work can provide interest-
ing insights into prioritizing collaboration [99]. The choice of collaborative models
fostering the design of active VEs for meaningful and creative experiences is of
particular relevance to entangled SIVE.
Intentionality and gesture support can be achieved through continuous network
reconfiguration. Identifying common goals through inter-actor communication are
fundamental requirements to increase the digital-twin enactive potential. We argue
that this area of research is absolutely new for SIVE, especially in these collabo-
rative aspects. Many fundamental and critical questions for SIVE are waiting to be
answered.
Digital transformation promotes ubiquitous and pervasive interconnected data
sets with the opportunity to offer new ways of navigating and extracting knowledge.
Dork et al. [37] explored the visualization of relational information spaces, incor-
porating both the individual and the whole in a monadic perspective. The authors’
goal was to exploit the rich semantic connections to design new exploration meth-
ods for interconnected elements. There is an increasing interest in more exploratory
forms of information retrieval without specific needs/constraints, sustained by the
desire to learn, play and discover openly [91]. In analogy with these practices, the
digital twin should curiously move between nodes, configurations, and connections
36 M. Geronazzo and S. Serafin
experimenting and manipulating the actor-network for sense-making. To encourage
surprising discoveries and interest within experiences, the digital twin should offer
unconventional and appealing views with the agency.
The auditory digital twin actively proposes new relationships and encourages
agential cuts under the mutual transformative action between listener and technol-
ogy. In the monadic perspective of the digital twin, the distinctive qualities of each
actor within a VE should emerge in each situated experience. Differentiation among
configurations is not an a priori actor property but it is identified by its uniqueness
in the network. Each actor imprints its particular identity on an ever-changing rela-
tional world. In other words, the digital twin is looking for differences in each actor
by considering different monadic perspectives. VR simulations allow us to take the
point of view of each element thanks to a shared virtual world knowledge.
In the area of AI agents, i.e., non-human entities capable of interacting with eco-
logical behaviors [109], intelligent algorithms would have the predictive potential
on the listener’s action program. Their ability in monitoring and predicting listeners’
behavioral responses could enable the digital twin to determine listeners’ expec-
tations and cognitive and psychological capabilities [25]. Moreover, AI algorithms
could propose exploration paths to the listener within VEs. Therefore, the capabilities
of safely navigating through temporary, transient, and overlapping configurations are
definitely complimentary to their predictive power.
In line with the emerging research area called immersive analytics, humans and
AI can support each other in decision-making based on the navigation in shared think-
ing spaces [133]. Meetings between the listener and her digital twin can take place in
a virtual meta-environment where configurations and connections of an experience
can be a posteriori analyzed, collaboratively. The unique personal supervision of the
AI algorithms implemented in the digital twin could reflect the listener’s traits and
interests. Understanding the listener’s preferences and assessing their impact on the
predictive performance of AI algorithms can help to propose adaptive and customiz-
able systems with a certain level of memory of past VR-mediated experiences [103].
Finally, how can we measure the overall effectiveness of an actor-network and its
agential cuts that the digital twin dynamically, individually, and adaptively creates?
This question corresponds to Latour et al.’s challenge to take into account long-term
features, indicative of a systemic order that might be learned navigating overlapping
perspectives (monads) [84]. Such an emphasis on navigation gives a unique role
to movement/exploration as a way of experiencing relationships and differences
between configurations. Therefore, we suggest that the digital twin should navigate
along with different and novel perspectives for sense-making. The dynamic relational
quality of each actor’s unique position in network space, i.e., agential cuts, reflects
the exploration potential shaping and creating meaning for the listener.
We argue that the VR-mediated experience is never solitary, considering both
human and non-human actors. Any actor cooperates within a shared VE, e.g., to
perform a musical performance (Chap. 13) or a spatialized audio mixing (Chap. 9).
Collaboration takes place on a common task, which has a huge impact on the intra-
action dynamics. In addition to the exploratory movements, technological trans-
parency introduced in Sect. 1.3 is a key factor influencing “E” measures. In analogy
1 Sonic Interactions in Virtual Environments 37
with the sense of presence, co-presence [26], i.e., the feeling of sharing a VE with
others, has been shown to strongly depend on avatar appearance and its realism, as
well as on the cooperation level in task completion [111]. Another aspect worth men-
tioning here is the awareness [7] which is the action understanding of other actors,
especially with non-human agents. This latter concept strongly relates to trustworthy
AI issues and explainable AI [70].
A further “E” measure in SIVE can be inspired by the River and MacTavish’s
framework [117]. They proposed to generate low-level prototypes of an artifact from
simplified attributes. The more extreme the change in such attributes, the more likely
the change will be to provoke and reveal hidden assumptions in the design process. In
our taxonomy, we call it generative potential in explorative movements and network
changes, and technological transparency.
The final example in our fictitious case study Spritz! considers the meaningful
prediction of the listener’s intentionality and the understanding of any sources of
interest, e.g., avatar’s gestures or other avatars’ action. Spritz! should be able to
support attentional focus. A virtual ray/cone pointer projected by the avatar through
the VE or a virtual cursor/hand mapped to the listener’s body movements might
facilitate the selection of points of interest. Gesture analysis could provide Spritz!
relevant information for a semi-automatic focus support. This scenario opens to the
experimentation and development of “magic” interactions of virtual superhuman
hearing tools such as a dual audio beamformer guided by the avatar’s body [52].
Spritz! should be free to propose novel ways of interaction and exploration within
VEs. This dynamic dialogue can be considered a form of virtual provotyping that
has to guarantee coherence with all available sensorimotor contingencies, having a
positive effect on the listener’s sense of agency in any proposed behavior.
1.5 Conclusion
This chapter aims at emphasizing how the SIVE book was born and developed in a
constantly evolving situation in the field of human-computer interaction. We invite
the reader to explore all its chapters with this shared and dynamic tension that we,
as editors, have tried to formalize in what we have called the egocentric perspective
of the auditory digital twin. The co-transformation of man and technology seems to
us a central theme that will surely help us to enter the 4th HCI wave, consciously.
The proposed taxonomy focuses on action, behavior, and sense-making because
we believe it is a meaningful way for authentic auditory experiences in VR. In
particular, the last aspect of sense-making turns out to be the most challenging.
The idea of diffraction and exploration of differences and discoveries requires novel
ways of scientific investigation in SIVE. The most crucial aspect might be the level
of personalization that future technologies will require to acquire from the listeners.
New paradigms for artificial and immersive interaction between humans and VE will
have to be proposed. The attribution of agency to a digital twin is a network effect
that will have relevant ethical implications, as well as complexity in its analysis.
38 M. Geronazzo and S. Serafin
How much would the listener trust her digital twin? Its intermediary role, some-
times provocative, in search of differences can elicit strong reactions in the listeners.
Will the listener accept and share this perspective? The affective information strongly
links sound to meaning [138], creating empathy between listener and her digital twin.
This aspect will be carefully considered for its ethical implications.
How can one quantify and classify the various actor networks in the proposed
three dimensions? Surely, this is an open challenge of this first proposed theoreti-
cal framework for SIVE. Visualizing and representing transitions and agential cuts
are relevant issues toward an objective description of any mediation phenomenon.
Creating multiple ontologies in “magical” interaction metaphors allows to transcend
reality and immerse into unique experiences within VEs. Since VR is not yet able
to fully replicate natural reality and may not be able to do so, its current features
actually allow listeners to do and be things that are impossible in the real world.
This is the very essence of knowledge diffraction: the digital twin should explore
such differences that are impossible to test in the physical world, extracting mean-
ing for the listener. Of particular interest here, the ideas of superhuman powers and
virtual prototyping [52] reflect human desire to increase her capabilities. They are
receiving increasing attention thanks to the post-humanism and human enhancement
manifestos [97]. Following this line of thought, Sadeghian et al. [121] proposed to
VR designers to explore new forms of interaction without necessarily imitating the
physical world. VR’s limitations in creating realistic interactions are replaced by a
focus on experiences that are impossible to have in the real world, such as superhu-
man powers of flying, X-ray vision, shape-shifting, super memory, etc. Limitations
obviously occurred while differentiating VEs before confusion invades the listener.
Indeed, a balance in ecological and familiar stimulation should guide the creation of
a “safety net” or “comfort zone” for the listener—the digital twin’s exploration of
agonistic and provocative knowledge opportunities without drawbacks.
This chapter aims to shape the SIVE research field, sonic interactions in VEs,
that is now ready to welcome wide-ranging reflections on what might be called
sonic intra-actions in VEs.
References
1. Adavanne, S., Politis, A., Nikunen, J., Virtanen, T.: Sound Event Localization and Detection
of Overlapping Sources Using Convolutional Recurrent Neural Networks. IEEE Journal of
Selected Topics in Signal Processing 13, 34-48 (2019).
2. Alletto, S., Serra, G., Calderara, S., Cucchiara, R.: Understanding social relationships in
egocentric vision. en. Pattern Recognition 48, 4082-4096 (2015).
3. Andéol, G., Simpson, B. D.: Editorial: How, and Why, Does Spatial-Hearing Ability Differ
among Listeners? What is the Role of Learning and Multisensory Interactions? Frontiers in
Neuroscience 10 (2016).
4. Atherton, J., Wang, G.: Doing vs. Being: A philosophy of design for artful VR. Journal of
New Music Research 49, 35-59 (2020).
5. Aydin, C., González Woge, M., Verbeek, P.-P.: Technological Environmentality: Conceptual-
izing Technology as a Mediating Milieu. en. Philosophy & Technology 32, 321-338 (2019).
1 Sonic Interactions in Virtual Environments 39
6. Barad, K.: Meeting the Universe Halfway: Quantum Physics and the Entanglement of Matter
and Meaning en (Duke University Press, 2007).
7. Benford, S., Bowers, J., Fahlén, L. E., Greenhalgh, C.: Managing mutual awareness in col-
laborative virtual environments in Proceedings of the conference on Virtual reality software
and technology (World Scientific Publishing Co., Inc., USA, 1994), 223-236.
8. Best, V., Baumgartner, R., Lavandier, M., Majdak, P., Kop?o, N.: Sound Externalization: A
Review of Recent Research. en. Trends in Hearing 24 (2020).
9. Bharitkar, S., Kyriakakis, C.: Immersive audio signal processing English (Springer,New York,
NY, 2006).
10. Blackwell, A.: Interacting with an inferred world: the challenge of machine learning for
humane computer interaction. en (2015).
11. Boren, B., Geronazzo, M., Brinkmann, F., Choueiri, E.: Coloration metrics for headphone
equalization in Proc. of the 21st Int. Conf. on Auditory Display (ICAD 2015) (Graz, Austria,
2015), 29-34.
12. Bormann, K.: Presence and the Utility of Audio Spatialization. Presence 14, 278-297 (2005).
13. Bowman, D. et al.: 3D User Interfaces: New Directions and Perspectives. Computer Graphics
and Applications, IEEE 28, 20-36 (2008).
14. Bowman, D. A., Hodges, L. F.: Formalizing the Design, Evaluation, and Application of
Interaction Techniques for Immersive Virtual Environments. Journal of Visual Languages &
Computing 10, 37-53 (1999).
15. Bregman, A. S.: Auditory scene analysis: the perceptual organization of sound (MIT Press,
Cambridge, Mass., 1990).
16. Breves, P., Dodel, N.: The influence of cybersickness and the media devices’ mobility on the
persuasive effects of 360◦commercials. en. Multimedia Tools and Applications 80, 27299-
27322 (2021).
17. Brinkmann, F., Roden, R., Lindau, A., Weinzierl, S.: Audibility and interpolation of head-
above-torso orientation in binaural technology. IEEE Journal of Selected Topics in Signal
Processing PP, 1-1 (2015).
18. Brinkmann, F., Lindau, A., Weinzierl, S.: On the authenticity of individual dynamic binaural
synthesis. en. The Journal of the Acoustical Society of America 142, 1784-1795 (2017).
19. Broadbent, D. E.: Perception and Communication en (Scientific Book Guild,1958).
20. Bronkhorst, A. W.: The cocktail-party problem revisited: early processing and selection of
multi-talker speech. Attention, Perception & Psychophysics 77, 1465-1487 (2015).
21. Brungart, D. S.: Near-Field Virtual Audio Displays. Presence 11, 93-106 (2002).
22. Brungart, D. S. et al.: The interaction between head-tracker latency, source duration, and
response time in the localization of virtual sound sources en. In In Proc. International Con-
ference on Auditory Display 2004 (2004), 7.
23. Bruynseels, K., Santoni de Sio, F., van den Hoven, J.: Digital Twins in Health Care: Ethical
Implications of an Emerging Engineering Paradigm. Frontiers in Genetics 9, 31 (2018).
24. Bufacchi, R. J., Iannetti, G. D.: An Action Field Theory of Peripersonal Space. Trends in
Cognitive Sciences 22, 1076-1090 (2018).
25. Cadet, L. B., Chainay, H.: Memory of virtual experiences: Role of immersion, emotion and
sense of presence. en. International Journal of Human-Computer Studies 144, 102506 (2020).
26. Casanueva, J., Blake, E.: en. in Virtual Environments 2000 (eds Hansmann, W., Purgathofer,
W., Sillion, F., Mulder, J., van Liere, R.) 85-94 (Springer Vienna, Vienna, 2000).
27. Catic, J., Santurette, S., Buchholz, J. M., Gran, F., Dau, T.: The effect of interaural-level-
difference fluctuations on the externalization of sound. The Journal of the Acoustical Society
of America 134, 1232-1241 (2013).
28. in. Advances in Social Theory and Methodology (RLE Social Theory) (eds Cetina, K. K.,
Cicourel, A. V.) (Routledge, 2014).
29. Understanding learning in virtual worlds en (eds Childs, M., Peachey, A.) (Springer, London,
2013).
30. Cho, D. et al.: The dichotomy of presence elements: the where and what in IEEE Virtual
Reality, 2003. Proceedings. (2003), 273-274.
40 M. Geronazzo and S. Serafin
31. Collins, K. in Essays on Sound and Vision (eds Richardson, J., Hawkins, S.) 263-298 (Helsinki
University Press, Helsinki, 2007).
32. Cooke, D. F., Taylor, C. S. R., Moore, T., Graziano, M. S. A.: Complex movements evoked
by microstimulation of the ventral intraparietal area. Proceedings of theNationalAcademy of
Sciences of theUnited States of America 100, 6163-6168 (2003).
33. Davis,S., Nesbitt, K., Nalivaiko, E.: A Systematic Review of Cybersickness en. in Proceedings
of the 2014 Conference on Interactive Entertainment - IE2014 (ACM Press, Newcastle, NSW,
Australia, 2014), 1-9.
34. Degli Innocenti, E. et al.: Mobile virtual reality for musical genre learning in primary educa-
tion. Computers & Education 139, 102-117 (2019).
35. Den Hengst, F., Grua, E. M., el Hassouni, A., Hoogendoorn, M.: Reinforcement learning for
personalization: A systematic literature review. en. Data Science 3, 107-147 (2020).
36. DiSalvo, C.: Adversarial Design en (eds Friedman, K., Stolterman, E.) (MIT Press, Cam-
bridge, MA, USA, 2012).
37. Dörk, M., Comber, R., Dade-Robertson, M.: Monadic exploration: seeing the whole through
its parts in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
(Association for Computing Machinery, New York, NY, USA, 2014), 1535-1544.
38. Dubus, G., Bresin, R.: A Systematic Review of Mapping Strategies for the Sonification of
Physical Quantities. PLoS ONE 8, e82491 (2013).
39. Durr, G., Peixoto, L., Souza, M., Tanoue, R., Reiss, J. D.: Implementation and Evaluation of
Dynamic Level ofAudio Detail English. in (Audio Engineering Society, 2015).
40. El Saddik, A.: Digital Twins: The Convergence of Multimedia Technologies. IEEE MultiMe-
dia 25, 87-92 (2018).
41. Ernst, M. O., Bülthoff, H. H.: Merging the senses into a robust percept. Trends in Cognitive
Sciences 8, 162-169 (2004).
42. Feasel, J., Whitton, M. C., Wendt, J. D.: LLCM-WIP: Low-latency, continuous-
motionwalking-in-place in 3D User Interfaces, 2008. 3DUI 2008. IEEE Symposium on (IEEE,
2008), 97-104.
43. Flach, J. M., Holden, J. G.: The Reality of Experience: Gibson’s Way. en. Presence: Teleop-
erators and Virtual Environments 7, 90-95 (1998).
44. Franinovic, K., Serafin, S.: Sonic Interaction Design en (MIT Press, 2013).
45. Frauenberger, C.: Entanglement HCI The NextWave? ACM Transactions on Computer-
Human Interaction 27, 2:1-2:27 (2019).
46. Friston, K., FitzGerald, T., Rigoli, F., Schwartenbeck, P., Pezzulo, G.: Active Inference: A
Process Theory. Neural Computation 29, 1-49 (2017).
47. Gallagher, S., Zahavi, D.: The Phenomenological Mind 3rd ed. (Routledge, London, 2020).
48. Gaver, W. W.: What in the World Do We Hear?: An Ecological Approach to Auditory Event
Perception. Ecological Psychology 5, 1-29 (1993).
49. Sonic Interactions in Virtual Environments (eds Geronazzo, M., Serafin, S.) (Springer Inter-
national Publishing, 2022).
50. Geronazzo, M., Spagnol, S., Avanzini, F.: Customized 3D Sound for Innovative Interaction
Design in Proc. SMC-HCI Work., CHItaly 2011 Conf. (Alghero, Italy, 2011).
51. Geronazzo, M., Spagnol, S., Avanzini, F.: Do we need individual head-related transfer func-
tions for vertical localization? The case study of a spectral notch distance metric. IEEE/ACM
Transactions on Audio, Speech, and Language Processing 26, 1243-1256 (2018).
52. Geronazzo, M., Tissieres, J. Y., Serafin, S.: A Minimal Personalization of Dynamic Binaural
Synthesis with Mixed Structural Modeling and Scattering Delay Networks in Proc. IEEE Int.
Conf. on Acoust. Speech Signal Process. (ICASSP 2020) (Barcelona, Spain, 2020), 411-415.
53. Geronazzo, M., Vieira, L. S., Nilsson, N. C., Udesen, J., Serafin, S.: Superhuman Hearing -
Virtual Prototyping of Artificial Hearing: a Case Study on Interactions and Acoustic Beam-
forming. IEEE Transactions on Visualization and Computer Graphics 26, 1912-1922 (2020).
54. Gibson, E. J., Pick, A. D.: An Ecological Approach to Perceptual Learning and Development
en (Oxford University Press, New York, NY, 2000).
1 Sonic Interactions in Virtual Environments 41
55. Gibson, J. J.: The Ecological Approach to Visual Perception: Classic Edition (Psychology
Press, New York, 2014).
56. Gödde, M., Gabler, F., Siegmund, D., Braun, A.: Cinematic Narration in VR - Rethinking
Film Conventions for 360 Degrees en. in Virtual, Augmented and Mixed Reality: Applica-
tions in Health, Cultural Heritage, and Industry (eds Chen, J. Y., Fragomeni, G.) (Springer
International Publishing, Cham,2018), 184-201.
57. Goldstone, R. L.: Perceptual Learning. Annual Review of Psychology 49, 585-612 (1998).
58. Graziano, M. S., Yap, G. S., Gross, C. G.: Coding of visual space by premotor neurons. en.
Science 266, 1054-1057 (1994).
59. Graziano, M. S. A., Taylor, C. S. R., Moore, T.: Complex Movements Evoked by Microstim-
ulation of Precentral Cortex. Neuron 34, 841-851 (2002).
60. Grivaz, P.,Blanke, O., Serino, A.: Common and distinct brain regions processing multisensory
bodily signals for peripersonal space and body ownership. NeuroImage 147, 602-618 (2017).
61. Hachet, M., Decle, F., Knodel, S., Guitton, P.: Navidget for Easy 3D Camera Positioning from
2D Inputs in 2008 IEEE Symposium on 3D User Interfaces (2008), 83-89.
62. Hacihabiboglu, H., De Sena, E., Cvetkovic, Z., Johnston, J., Smith III, J. O.:Perceptual Spa-
tialAudioRecording, Simulation, andRendering: An overview of spatial-audio techniques
based on psychoacoustics. IEEE Signal Processing Magazine 34, 36-54 (2017).
63. Hall, E. T. et al.: Proxemics [and Comments and Replies]. Current Anthropology 9, 83-108
(1968).
64. Harrison, S., Tatar, D., Sengers, P.: The Three Paradigms of HCI. en, 22 (2007).
65. Hartmann, W. M., Wittenberg, A.: On the externalization of sound images. The Journal of
theAcoustical Society of America 99, 3678-3688 (1996).
66. Hauser, S., Oogjes, D.,Wakkary, R.,Verbeek, P.-P.: AnAnnotated Portfolio on Doing Postphe-
nomenology Through Research Products in Proceedings of the 2018 Designing Interactive
Systems Conference (Association for Computing Machinery, New York, NY, USA, 2018),
459-471.
67. Heidegger, M.: Being and Time en (Blackwell, 1967).
68. Hiipakka, M., Kinnari, T., Pulkki, V.: Estimating head-related transfer functions of human sub-
jects from pressure-velocity measurements. The Journal of the Acoustical Society of America
131, 4051-4061 (2012).
69. Holzinger, A.: Interactive machine learning for health informatics: when do we need the
human-in-the-loop? en. Brain Informatics 3, 119-131 (2016).
70. Holzinger, A.: From Machine Learning to Explainable AI in 2018 World Symposium on
Digital Intelligence for Systems and Machines (DISA) (2018), 55-66.
71. Höök, K.: Designing with the Body: Somaesthetic Interaction Design en (MIT Press, 2018).
72. Höök, K., Löwgren, J.: Strong concepts: Intermediate-level knowledge in interaction design
research. ACM Transactions on Computer-Human Interaction 19, 23:1-23:18 (2012).
73. Husserl, E.: Ideas Pertaining to a Pure Phenomenology and to a Phenomenological Philosophy
en (Springer Netherlands, 1982).
74. Ihde, D.: Technology and the Lifeworld: From Garden to Earth Inglese (Indiana Univ Pr,
Bloomington, 1990).
75. Ihlefeld, A., Shinn-Cunningham, B.: Disentangling the effects of spatial cues on selection and
formation of auditory objects. J. Acoust. Soc. Am. 124, 2224-2235 (2008).
76. Jerald, J.: The VR Book: Human-Centered Design for Virtual Reality (Association for Com-
puting Machinery and Morgan & Claypool, New York, NY, USA, 2016).
77. Kanade, T., Rander, P., Narayanan, P. J.: Virtualized reality: constructing virtual worlds from
real scenes. IEEE MultiMedia 4, 34-47 (1997).
78. Katz, B. F. G.: Boundary Element Method Calculation of Individual Head- Related Transfer
Function. I. Rigid Model Calculation. The Journal of Acoustical Society of America 110,
2440-2448 (2001).
79. Katz, B. F. G.,Weber, A.: An Acoustic Survey of the Cathédrale Notre-Dame de Paris before
and after the Fire of 2019. en. Acoustics 2, 791-802 (2020).
42 M. Geronazzo and S. Serafin
80. Kilteni, K., Groten, R., Slater, M.: The Sense of Embodiment in Virtual Reality. Presence 21,
373-387 (2012).
81. Kim, B., Pardo, B.: A Human-in-the-Loop System for Sound EventDetection and Annotation.
ACMTransactions on Interactive Intelligent Systems 8, 13:1-13:23 (2018).
82. Laback, B., Majdak, P.: Binaural jitter improves interaural time-difference sensitivity of
cochlear implantees at high pulse rates. en. Proceedings of the National Academy of Sci-
ences 105, 814-817 (2008).
83. Larsson, P., Västfjäll, D., Kleiner, M.: Effects of auditory information consistency and room
acoustic cues on presence in virtual environments. en. Acoustical Science and Technology
29, 191-194 (2008).
84. Latour, B., Jensen, P., Venturini, T., Grauwin, S., Boullier, D.: ’The whole is always smaller
than its parts’ - a digital test of Gabriel Tardes’ monads. en. The British Journal of Sociology
63, 590-615 (2012).
85. Law, J.: Notes on the theory of the actor-network: Ordering, strategy, and heterogeneity. en.
Systems practice 5, 379-393 (1992).
86. Lester, M., Boley, J.: The effects of latency on live sound monitoring in Proc. 123 Audio
Engin. Soc. Convention (New York, 2007).
87. Loke, L., Robertson, T.: Moving and making strange: An embodied approach to movement-
based interaction design. ACM Transactions on Computer-Human Interaction 20, 7:1-7:25
(2013).
88. Loomis, J. M.: Presence in Virtual Reality and Everyday Life: Immersion within a World of
Representation. en. Presence: Teleoperators and Virtual Environments 25, 169-174 (2016).
89. Lupton, D.: The Quantified Self en (John Wiley & Sons, 2016).
90. Virtual Reality & Augmented Reality in Industry en (eds Ma, D., Gausemeier, J., Fan, X.,
Grafe, M.) (Springer-Verlag, Berlin Heidelberg, 2011).
91. Marchionini, G.: Exploratory search: from finding to understanding. Communications of the
ACM 49, 41-46 (2006).
92. Mendes, D., Caputo, F. M., Giachetti, A., Ferreira, A., Jorge, J.: A Survey on 3D Virtual
Object Manipulation: From the Desktop to Immersive Virtual Environments. en. Computer
Graphics Forum 38, 21-45 (2019).
93. Merleau-Ponty, M.: Phenomenology of Perception 1st edition. Inglese (Routledge, Abingdon,
Oxon ; New York, 2013).
94. Metzinger, T. K.: Why Is Virtual Reality Interesting for Philosophers? Frontiers in Robotics
and AI 5, 101 (2018).
95. Milgram, P., Kishino, F.: A Taxonomy of Mixed Reality Visual Displays. en. undefined (1994).
96. M?ynarski, W., McDermott, J. H.: Ecological origins of perceptual grouping principles in the
auditory system. en. Proceedings of the National Academy of Sciences (2019).
97. Moore, P.: Enhancing Me: The Hope and the Hype of Human Enhancement 1st edition.
English (Wiley, Chichester, England ; Hoboken, NJ, 2008).
98. Murray, J. H.: Hamlet on the Holodeck: The Future ofNarrative in Cyberspace Updated
Edition. en (MIT Press, Cambridge, MA, USA, 2017).
99. Nassiri, N., Powell, N., Moore, D.: Human interactions and personal space in collaborative
virtual environments. Virtual Reality 14, 229-240 (2010).
100. Nguyen, T.-H.-C., Nebel, J.-C., Florez-Revuelta, F.: Recognition of Activities of Daily Living
with Egocentric Vision: A Review. en. Sensors 16, 72 (2016).
101. Nilsson,N. C. et al.: 15Years ofResearch onRedirectedWalking in Immersive Virtual Envi-
ronments. IEEE Computer Graphics and Applications 38, 44-56 (2018).
102. Nordahl, R., Nilsson, N. C.: The Sound of Being There: Presence and Interactive Audio in
Immersive Virtual Reality. en. The Oxford Handbook of Interactive Audio (2014).
103. Ntoutsi, E. et al.: Bias in data-driven artificial intelligence systems-An introductory survey.
en. WIREs Data Mining and Knowledge Discovery 10, e1356 (2020).
104. Nyberg, D.: Computers, Customer Service Operatives and Cyborgs: Intraactions in Call Cen-
tres. en. Organization Studies 30, 1181-1199 (2009).
1 Sonic Interactions in Virtual Environments 43
105. Orlikowski, W. J.: The sociomateriality of organisational life: considering technology in man-
agement research. Cambridge Journal of Economics 34, 125-141 (2010).
106. Osimo, S. A., Pizarro, R., Spanlang, B., Slater, M.: Conversations between self and self as
Sigmund Freud-A virtual body ownership paradigm for self counselling. en. Scientific Reports
5, 13899 (2015).
107. Computational Interaction (eds Oulasvirta, A., Kristensson, P. O., Bi, X., Howes, A.) (Oxford
University Press, Oxford, New York, 2018).
108. Pai, D. K.: Multisensory Interaction: Real and Virtual en. in Robotics Research. The Eleventh
International Symposium (eds Dario, P., Chatila, R.) (Springer, Berlin, Heidelberg, 2005),
489-498.
109. Parisi, G. I., Kemker, R., Part, J. L., Kanan, C., Wermter, S.: Continual lifelong learning with
neural networks: A review. en. Neural Networks 113, 54-71 (2019).
110. Paul, S.: Binaural Recording Technology: A Historical Review and Possible Future Develop-
ments. Acta Acustica united with Acustica 95, 767-788 (2009).
111. Pinho, M. S., Bowman, D. A., Freitas, C. M.: Cooperative object manipulation in immersive
virtual environments: framework and techniques in Proceedings of the ACM symposium on
Virtual reality software and technology (Association for Computing Machinery, New York,
NY, USA, 2002), 171-178.
112. Polotti, P., Rocchesso, D., Editors, D. R.: Sound to Sense , Sense to Sound A State of the Art
in Sound and Music Computing (eds Polotti, P., Rocchesso, D.) (Logos Verlag Berlin, 2008).
113. Prepelit˘a, S. T., Gómez Bolaños, J., Geronazzo, M., Mehra, R., Savioja, L.: Pinna-related
transfer functions and lossless wave equation using finitedifference methods: Verification
and asymptotic solution. The Journal of the Acoustical Society of America 146, 3629-3645
(2019).
114. Prepelit,?, S. T., Gómez Bolaños, J., Geronazzo, M., Mehra, R., Savioja, L.: Pinna-related
transfer functions and lossless wave equation using finitedifference methods:Validation with
measurements. The Journal of theAcoustical Society of America 147, 3631-3645 (2020).
115. Ramstead, M. J., Kirchhoff, M. D., Friston, K. J.: A tale of two densities: active inference is
enactive inference. en. Adaptive Behavior 28, 225-239 (2020).
116. Riecke, B. E., Väljamäe, A., Schulte-Pelkum, J.: Moving Sounds Enhance the Visually-
induced Self-motion Illusion (Circular Vection) in Virtual Reality. ACM Trans. Appl. Percept.
6, 7:1-7:27 (2009).
117. River, J., MacTavish, T.: Research through provocation: a structured prototyping tool using
interaction attributes of time, space and information. The Design Journal 20, S3996-S4008
(2017).
118. Rocchesso, D., Bresin, R., Fernstrom, M.: Sounding objects. IEEE MultiMedia 10, 42-52
(2003).
119. Ronga, I.et al.: Seeming confines: Electrophysiological evidence of peripersonal space remap-
ping following tool-use in humans. en. Cortex (2021).
120. Rose, T., Nam, C. S., Chen, K. B.: Immersion of virtual reality for rehabilitation - Review.
en. Applied Ergonomics 69, 153-161 (2018).
121. Sadeghian, S., Hassenzahl, M.: From Limitations to Superpowers: A Design Approach to Bet-
ter Focus on the Possibilities of Virtual Reality to Augment Human Capabilities in Designing
Interactive Systems Conference 2021 (Association for Computing Machinery, New York, NY,
USA, 2021), 180-189.
122. Sankaran, N., Hillis, J., Zannoli, M., Mehra, R.: Perceptual thresholds of spatial audio update
latency in virtual auditory and audiovisual environments.The Journal of the Acoustical Society
of America 140, 3008-3008 (2016).
123. Sapontzis, S. F.: A Note on Merleau-Ponty’s “Ambiguity”. Philosophy and Phenomenological
Research 38, 538-543 (1978).
124. Sauzéon, H. et al.: The use of virtual reality for episodic memory assessment: effects of active
navigation. eng. Experimental Psychology 59, 99-108 (2011).
125. Schoeffler, M., Herre, J.: About the different types of listeners for rating the overall listening
experience in In Proc. of ICMC|SMC|2014 (Athens, 2014), 886-892.
44 M. Geronazzo and S. Serafin
126. Schultze, U.: The Avatar as Sociomaterial Entanglement: A Performative Perspective on
Identity, Agency and World-Making in Virtual Worlds. ICIS 2011 Proceedings (2011).
127. Serafin, S., Erkut, C.,Kojs, J., Nilsson,N. C.,Nordahl, R.: VirtualReality Musical Instruments:
State of the Art, Design Principles, and Future Directions. Computer Music Journal 40, 22-40
(2016).
128. Serafin, S., Geronazzo, M., Nilsson, N. C., Erkut, C., Nordahl, R.: Sonic interactions in virtual
reality: state of the art, current challenges and future directions. IEEE Computer Graphics
and Applications 38, 31-43 (2018).
129. Serafin, S. et al.: Reflections from five years of Sonic Interactions in Virtual Environments
workshops. Journal of New Music Research 49, 24-34 (2020).
130. Serino, A.: Peripersonal space (PPS) as a multisensory interface between the individual and
the environment, defining the space of the self. Neuroscience & Biobehavioral Reviews 99,
138-159 (2019).
131. Shilling, R. D., Shinn-Cunningham, B. in Handbook of virtual environments: Design, imple-
mentation, and applications 65-92 (Lawrence Erlbaum Associates Publishers, Mahwah, NJ,
US, 2002).
132. Shinn-Cunningham, B. G., Best, V.: Selective Attention in Normal and Impaired Hearing.
Trends in Amplification 12, 283-299 (2008).
133. Skarbez, R., Polys, N. F., Ogle, J. T., North, C., Bowman, D. A.: Immersive Analytics: Theory
and Research Agenda. English. Frontiers in Robotics and AI 6(2019).
134. Skarbez, R., Smith, M., Whitton, M. C.: Revisiting Milgram and Kishino’s Reality-Virtuality
Continuum. Frontiers in Virtual Reality 2, 27 (2021).
135. Slater, M.: Place illusion and plausibility can lead to realistic behaviour in immersive virtual
environments. Philosophical Transactions of the Royal Society B: Biological Sciences 364,
3549-3557 (2009).
136. Slater, M., Brogni, A., Steed, A.: Physiological Responses to Breaks in Presence: A Pilot
Study. en. Presence 2003: The 6th annual international workshop on presence 157, 4 (2003).
137. Slater, M., Wilbur, S.: A Framework for Immersive Virtual Environments (FIVE): Specula-
tions on the Role of Presence in Virtual Environments. en. Presence: Teleoperators and Virtual
Environments 6, 603-616 (1997).
138. Stevenson, R. A., James, T. W.: Affective auditory stimuli: Characterization of the Interna-
tional AffectiveDigitized Sounds (IADS) by discrete emotional categories. Behavior Research
Methods 40, 315-321 (2008).
139. Stitt, P., Picinali, L., Katz, B. F. G.: Auditory Accommodation to Poorly Matched Non-
Individual Spectral Localization Cues Through Active Learning. En. Scientific Reports 9,
1063 (2019).
140. Stockburger, A.: The game environment from an auditory perspective in Proc. Level Up:
Digital Games Research Conference (eds Copier, M., Raessens, J.) (Utrecht, 2003).
141. Suchman, L.: Human/Machine Reconsidered. Cognitive Studies: Bulletin of the Japanese
Cognitive Science Society 5, 1_5-1_13 (1998).
142. Sugimoto, M., Hosoi, K., Hashizume, H.: Caretta: a system for supporting face-to-facecollab-
oration by integrating personal and shared spaces in Proceedings of the SIGCHI Conference
on Human Factors in Computing Systems (Association for Computing Machinery, New York,
NY, USA, 2004), 41-48.
143. Tarde, G.: Monadology and Sociology Illustrated edition. English. Trans. By Lorenc, T.
(re.press, 2012).
144. Teneggi, C., Canzoneri, E., di Pellegrino, G., Serino, A.: Social Modulation of Peripersonal
Space Boundaries. Current Biology 23, 406-411 (2013).
145. Tsingos, N., Gallo, E., Drettakis, G.: Perceptual audio rendering of complex virtual environ-
ments. ACM Transactions on Graphics 23, 249-258 (2004).
146. Udesen, J., Piechowiak, T., Gran, F.: The Effect of Vision on Psychoacoustic Testing with
Headphone-Based Virtual Sound. Journal of the Audio Engineering Society 63, 552-561
(2015).
1 Sonic Interactions in Virtual Environments 45
147. Välimäki, V., Parker, J. D., Savioja, L., Smith, J. O., Abel, J. S.: Fifty Years of Artificial
Reverberation. IEEE Transactions on Audio, Speech, and Language Processing 20, 1421-
1448 (2012).
148. Varela, F., Thompson, E., Rosch, E.: The Embodied Mind (MIT Press, Cambridge, MA,
1991).
149. Verbeek, P.-P.: Cyborg intentionality: Rethinking the phenomenology of human-technology
relations. Phenomenology and the Cognitive Sciences 7, 387-395 (2008).
150. Verbeek, P.-P.: Beyond interaction: a short introduction to mediation theory. Interactions 22,
26-31 (2015).
151. Vindenes, J., Wasson, B.: A Postphenomenological Framework for Studying User Experience
of Immersive Virtual Reality. Frontiers in Virtual Reality 2, 40 (2021).
152. Von Berg, M., Steffens, J., Weinzierl, S., Müllensiefen, D.: Assessing room acoustic listening
expertise. The Journal of the Acoustical Society of America 150, 2539-2548 (2021).
153. Vorländer, M.: Virtual Acoustics. Archives of Acoustics 39 (2015).
154. Warren,W. H.: Direct Perception: The View from Here. Philosophical Topics 33, 335-361
(2005).
155. Weibel, D.,Wissmath, B., Mast, F.W.: Immersion in mediated environments: the role of per-
sonality traits. eng. Cyberpsychology, Behavior and Social Networking 13, 251-256 (2010).
156. Witmer, B. G., Singer, M. J.: Measuring Presence in Virtual Environments: A Presence Ques-
tionnaire. Presence: Teleoperators and Virtual Environments 7, 225-240 (1998).
157. Xiangyu Wang, P. S. D.: A user-centered taxonomy for specifying mixed reality systems for
aec industry. ITcon Vol. 16, 493-508 (2011).
158. Zacharov, N.: Sensory Evaluation of Sound en (CRC Press, 2019).
159. Zahorik, P., Jenison, R. L.: Presence as Being-in-the-World. Presence: Teleoperators and
Virtual Environments 7, 78-89 (1998).
160. Zonooz, B., Opstal, A. J.V.: DifferentialAdaptation in Azimuth and Elevation to Acute Monau-
ral Spatial Hearing after Training with Visual Feedback. en. eNeuro 6(2019).
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Part II
Interactive and Immersive Audio
Chapter 2
Procedural Modeling of Interactive
Sound Sources in Virtual Reality
Federico Avanzini
Abstract This chapter addresses the first building block of sonic interactions in
virtual environments, i.e., the modeling and synthesis of sound sources. Our main
focus is on procedural approaches, which strive to gain recognition in commercial
applications and in the overall sound design workflow, firmly grounded in the use of
samples and event-based logics. Special emphasis is placed on physics-based sound
synthesis methods and their potential for improved interactivity. The chapter starts
with a discussion of the categories, functions, and affordances of sounds that we listen
to and interact with in real and virtual environments. We then address perceptual
and cognitive aspects, with the aim of emphasizing the relevance of sound source
modeling with respect to the senses of presence and embodiment of a user in a virtual
environment. Next, procedural approaches are presented and compared to sample-
based approaches, in terms of models, methods, and computational costs. Finally,
we analyze the state of the art in current uses of these approaches for Virtual Reality
applications.
2.1 Introduction
Takala and Hahn [86] were possibly the first scholars who proposed a sound rendering
pipeline, in analogy with the image rendering pipeline, aimed at producing an overall
“soundtrack” starting from a description of the objects in an audio-visual scene.
Their pipeline included sound modeling and sound rendering stages, running in
parallel with the image rendering pipeline. Figure 2.1 proposes an updated picture,
which considers several aspects investigated by researchers throughout the last three
decades and may represent a general pipeline for sound simulation in Virtual Reality
(hereinafter, VR).
Much of recent and current research is concerned with aspects related to the
“Propagation” and “Rendering” blocks represented in this figure, as well as the
F. Avanzini (B
)
Laboratory of Music Informatics, Department of Computer Science, University of Milano,
Via G. Celoria 18, IT-20135 Milano, Italy
e-mail: federico.avanzini@di.unimi.it
© The Author(s) 2023
M. Geronazzo and S. Serafin (eds.), Sonic Interactions in Virtual Environments,
Human—Computer Interaction Series, https://doi.org/10.1007/978-3-031- 04021-4_2
49
50 F. Avanzini
Fig. 2.1 A general pipeline for sound simulation in Virtual Reality (figure based on [51])
geometrical and material properties of acoustic enclosures in the “Modeling” block.
This chapter focuses instead on the remaining balloon of the “Modeling” block, the
modeling of sound sources.
One obvious motivation for looking into sound source modeling is that all sounds
occurring in a virtual (and in a real) environment originate from some sources,
before propagating into the environment and finally reaching the listener. Secondly,
many of the sonic interactions occurring in a virtual environments are interactions
between the subject’s avatar and sound sources. Here, our definition of interactive is
analogous to the one given by Collins [20] for video-game audio: whereas adaptive
audio generically refers to audio that reacts appropriately to events and changes
occurring in the simulation, interactive audio refers to sound events occurring directly
in reaction to avatar’s gestures (ranging from pressing a button to walking or hitting
objects in the virtual scene).
The current dominant paradigm in VR audio, largely based on sound samples1
triggered by specific events generated by the avatar or the simulation, is minimally
adaptive and interactive. This is the main motivation for looking into procedural
approaches to sound generation.
2.2 What to Model
The first question that should be asked is as follows: what are the sound sources that
need to be modeled in a virtual environment, and how can these be organized into
a coherent and comprehensive taxonomy? Such a taxonomy would provide a useful
tool to analyze in a systematic way the state of the art of the research in this field and
possibly to spot research directions that are still under-explored.
1For the sake of clarity, in this chapter, we use the term “sample” in its commonly accepted meaning
of pre-recorded/pre-processed sound excerpt, rather than that of a single value of a digital signal.
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 51
2.2.1 Diegetic Sounds
One first possible and often used distinction can be mutated from narrative theory.
The term diegesis has been used in film theory to refer to the fictional world of the
film story, and correspondingly the adjective diegetic refers to elements that are part
of the depicted fictional world. By contrast, non-diegetic elements are those which
should be considered non-existent in the fictional world.
As far as sound in particular is concerned, three main categories are traditionally
used in films: speech and dialogue, sound effects, and music [80]. The first two
categories comprise diegetic sounds, while music is a non-diegetic element having
mostly an affective and emotional role, a distinction that may be related to the motto
“Sound effects make it real, music makes you feel” [49].
Several taxonomies for sounds in video-games have been proposed and are typ-
ically based on similar categories [42]. These may be employed in the context of
VR as well, with the additional caveat that VR applications only partly overlap with
video-games. In particular, VR, and immersive VR specifically, may be defined as
“a medium in which people respond with their whole bodies, treating what they
perceive as real” [77]. In light of this definition, in this chapter, we focus on diegetic
sounds, those that “make it real”: in other words, those that contribute most to the
overall sense of the presence of a user within a virtual environment, which we will
discuss in Sect. 2.3.
An interesting example of a taxonomy for sound in games is provided by Stock-
burger [84], who considers five different types of sound objects. Non-diegetic ele-
ments include (i) music, but also (ii) interface sounds, which may sometimes be
included into the diegetic part of the game environment; proper diegetic elements
instead comprise the three categories of (iii) speech and dialogue, (iv) ambience (or
“zone” sounds in Stockburger’s definition), and (v) effects.
Speech and dialogue are very relevant components of a virtual environment; how-
ever, our focus in this chapter is on non-verbal sound. The distinction between ambi-
ence and effect sounds is mainly a perspectival one: the former are background
sounds, connected to locations or zones (understood both as different spatial loca-
tions in an environment and different levels in a game) and having distinct auditory
qualities; the latter are instead foreground sounds other than speech, that are cog-
nitively linked to objects or events, and are therefore perceived as being produced
by such objects and events. Sound-producing objects may be moving or static ele-
ments, may be directly interactable by the avatar or just synchronized to the visual
simulation, or may be even outside the visual field of view.
Stockburger [84] proceeds in distinguishing effect subcategories, depending on
the elements of the environment they are linked to. His classification is heavily
tailored to games, but serves as an inspiration to further inspect and subdivide effect
sounds. For the purpose of the present discussion, we only make a distinction between
two subcategories: (i) effects linked to the avatar, and (ii) all remaining effects in
the environment. Effects linked to the avatar are related to sounds produced by the
avatar’s movement or object manipulation: footsteps, swishing of an object cutting
52 F. Avanzini
Fig. 2.2 Categories and
interactivity of diegetic
sounds in a virtual
environment
through the air, knocking on a wall, clothes, etc. They can also include sounds
produced by the avatar’s own body, such as breathing or scratching. The remaining
effects in the environment may include non-verbal human sounds, sounds produced
by human activities, machine sounds, and so on. A visual summary is provided in
Fig. 2.2. The categories and subcategories identified here can be usefully mapped
into interactive and adaptive sound sources.
2.2.2 Everyday Sounds
An orthogonal approach with respect to the previous one amounts to characterizing
sound sources in terms of the physical mechanisms and events that are associated to
those sources.
Typical lists of audio assets for games or VR include, at the second level of clas-
sification (after the branch between ambience and sound effects), such categories as
footsteps, doors, wind and weather, and cars and engines, with varying degrees of
detail. These categories in fact refer to objects and events that are physically respon-
sible for the corresponding sounds; however, such classifications follow common
practices rather than a standardized taxonomy. A more systematic categorization can
be found in the classic works by Gaver [33,34], who proposed an “ecological” cat-
egorization of everyday sounds (the ecological approach to auditory perception will
be discussed in more detail in Sect. 2.3.2). Gaver derived a tentative map of everyday
sounds, which is shown in Fig. 2.3 and discussed in the remainder of this section.
At the highest level, Gaver’s taxonomy considers three broad classes of sounds:
those involving vibrating solids, liquids, and aerodynamics in sound generation,
respectively. Sounds generated by solid objects have patterns of vibrations structured
by a number of physical attributes: those of the interaction that has produced the
vibration, those of the material of the vibrating objects, and those of the geometry and
configuration of the objects. Sounds involving liquids (e.g., dripping and splashing)
also depend on an initial deformation that is counter-acted by restoring forces in
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 53
Fig. 2.3 A taxonomy of everyday sounds that may be present in a virtual environment. Within each
class (solids, liquids, and gases), rectangles, rounded rectangles, and ellipses represent basic, pat-
terned, and compound sounds, respectively. Intersections between classes represent hybrid sounds.
Figure based on the taxonomy of everyday sounds by Gaver [34,Fig.7]
the material, but no audible sound is produced by the vibrations of the liquid and
instead the resulting sounds are created by the resonant cavities (bubbles) that form
and oscillate in the liquid. Aerodynamic sounds are caused by the direct modification
of atmospheric pressure differences from some source, such as those created by an
exploding balloon or by the noise of a fan, or even events in which such changes
in pressure transmit energy to objects and set them into vibration (e.g., when wind
passes through a wire).
At the next level, sounds are classified along layers of complexity, defined as
follows. “Basic” sound-producing events are identified for solids, liquids, and gases:
sounds made by vibrating solids may be caused by impacts, scraping, or other inter-
actions; liquid sounds may be caused by discrete drips, or by more continuous splash-
ing, rippling, or pouring events; and aerodynamic sounds may be made by discrete,
sudden changes of pressure (explosions), or by more continuous introductions of
pressure variations (gusts and wind). “Patterned” sounds are situated at a higher level
of complexity, as they are produced through temporal patterning of basic events. As
an example, walking, breaking, bouncing, and so on are all complex events involv-
ing patterns of simpler impacts. Similarly, crumpling or crushing are examples of
patterned deformation sounds. “Compound” sounds occupy the third level of com-
plexity and involve more than one type of basic and patterned events. An example
may be provided by the sound of a door slam, which involves the squeak of scraping
hinges and the impact of the door on its frame, or a complex activity such as writing,
54 F. Avanzini
which involves irregular temporal patterns of both impacts and scrapes. Compound
sounds involve mutual constraints on their building components: as an example,
concatenating the creak of a heavy door closing slowly with the slap of a light door
slammed shut would arguably not sound natural.
Finally, Gaver’s taxonomy also considers “hybrid” events, in which two or three
types of material are involved. An example of a hybrid sound involving solids and
liquids is the one produced by raindrops hitting a window glass, which involves
attributes of both liquid and vibrating solid sounds.
A taxonomy such as the one discussed here has at least two very attractive features.
First, it provides a comprehensive framework for classifying any everyday sound
potentially encountered in our world (and thus in a virtual world as well), with a fine
level of detail. Secondly, its hierarchical structure provides a theoretical framework
that can aid not only the sound design process but also the development of sound
design tools. An example of an ecologically inspired software library for procedural
sound design will be discussed in Sect. 2.5.3.
2.3 Perceptual and Cognitive Aspects
In this section, we critically review and discuss some relevant aspects related to
the perception and cognition of sonic interactions and provide links between these
aspects and central concepts of VR, such as the plausibility illusion, the place illusion,
the sense of embodiment, and the sense of agency. Nordahl and Nillson [57]also
consider how sound production and perception relate to plausibility illusion, place
illusion, and the sense of body ownership, although from a somewhat different angle.
Our main claim is that interactive sound sources in a virtual environment contribute
in particular to the plausibility illusion, the sense of agency, and the sense of body
ownership. In addition, our analysis of perceptual and cognitive aspects provides
requirements and guidelines for the development and the implementation of sound
models.
2.3.1 Latency, Causality, and Multisensory Integration
In any interactive system, latency and its associated jitter have a major perceptual
impact. High latency or jitter may impair the user’s performance or, at least, provide
a frustrating and tiring experience. Perceptually acceptable limits for latency and
jitter in an interactive system should therefore be determined. However, such limits
depend on several factors which are not easily disentangled.
Characterizing latency and jitter in the sound rendering pipeline can be restated as
a problem of perceived synchronization between pairs of events [46], which in turn
may be divided into three categories: (i) an external and an internal temporal pattern
(such as those occurring in a collaborative activity, e.g., music playing, between two
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 55
persons in a virtual environment); (ii) pairs of external events (which may or may
not pertain to the same sensory modality, such as pairs of sounds or a visual flash
and a sound); (iii) actions of the user and their effects (e.g., the pressing of a button
and the corresponding feedback sound).
The latter case in particular is tightly connected to the definition of interactive
sound adopted in this chapter. It is inherently a problem of multimodal synchroniza-
tion, as it involves a form of extrinsic (auditory) feedback and a form of intrinsic (tac-
tile, proprioceptive, and kinesthetic) feedback generated by the user’s action [53]. The
complex interaction occurring between these modalities influences their perceived
synchronization (and thus the acceptable latency). High latencies can deteriorate the
quality of the interaction, impair the performance on a given task, and even disrupt the
perceived link of causality between the user’s action and the resulting sonic outcome.
The task at hand also influences the acceptable latency. As an example, it has
been traditionally accepted that music performance is a task requiring extremely low
(≤10 ms) latencies between the player’s actions and the response of a digital musical
instrument [99]. Similarly, it has been shown that even small amounts of jitter can
be detrimental to the perceived quality of the interaction [41]. In this respect, music
provides a good “worst case” and a lower bound for latency in other, non-musical
tasks, where various studies suggest that higher latencies may be acceptable or even
unperceivable [43,93].
The type of interaction must be considered as well. Impulsive interactions (either
musical, such as playing a drum, or non-musical, such as knocking on a door) are
likely to require lower latencies than continuous ones (bowing a violin string, or
accompanying a closing door). As an example, it has been shown that the continuous
interaction involved in playing a theremin allows for relatively high (>30 ms) laten-
cies, despite this being a musical task [54]. Finally, cognitive aspects also play a role:
humans create expectations for the latency between their actions and the resulting
feedback, detect disturbances to such expectations, and compensate for them. A study
on the latency in live musical sound monitoring [48] showed significant discrepan-
cies between different instruments, suggesting that certain players (e.g., pianists) are
more tolerant to latency as they are accustomed to the inherent mechanical latency
of their instrument, while others (e.g., drummers) are less so.
We conclude this section with a hint at the second type of synchronization men-
tioned at the beginning, i.e., that between pairs of external (possibly multimodal)
events. Humans achieve robust perception through both the combination and the
integration of information from multiple sensory modalities: the former strategy
refers to interactions between non-redundant and complementary sensory signals
aimed at disambiguating the sensory estimate, while the latter describes interactions
between redundant signals aimed at reducing the variance in the sensory estimate
and increasing its reliability [28]. The temporal relationships between inputs from
different senses play an important role in multisensory combination and integra-
tion, which can be realized only within a window of synchrony between different
modalities (e.g., auditory and visual, or auditory and haptic feedbacks) where a sin-
gle percept is produced. Many studies [19,83,96] report quantitative results about
“integration windows” between modalities, which can be used as constraints for the
56 F. Avanzini
synchronization of the sound simulation pipeline with the visual (and possibly the
haptic) modality. For more details regarding these issues, please refer to Part IV in
this book, and in particular to Ch. 10.
2.3.2 Everyday Listening and the Plausibility Illusion
Human listeners are extremely good at interpreting sounds in terms of the events
that produced them. The patterns of mechanical or aeroacoustic vibrations generated
by sound-producing events depend on (and thus carry information about) contact
forces, duration of contact, time-variations of the interaction, sizes, shapes, materials,
and textures of the involved objects. We are immersed in a landscape of everyday
sounds since the day we are born, and we have learned to extract meaning from this
continuous and omnidirectional flow of information.
Gaver [34] introduced the concept of everyday listening, as opposed to musical
listening. When a listener hears a sound, she might concentrate on attributes like
pitch, loudness, and timbre, or she might notice its masking effect on other sounds.
These are examples of musical listening, meaning that the considered perceptual
dimensions and attributes have to do with the sound itself, and are those used in the
creation of music. On the other hand, the listener might concentrate on the char-
acteristics of the sound source and possibly the surrounding environment. When
hearing an approaching car, she might notice that the engine is powerful, that the
car is approaching quickly from behind, or even that the road is a narrow alley with
echoing walls on each side. This is an example of everyday listening.
The two perceptual processes associated to musical and everyday listening cannot
be completely disentangled and may occur simultaneously. Still, the idea that in
our everyday listening experience the physical characteristics of sound-producing
objects can be linked to the corresponding acoustic features is a powerful one. The
literature of ecological acoustics provides several quantitative results on such links.
The underlying assumption is that the flow of acoustic energy reaching our ears, the
acoustic array, contains specific patterns, or invariants, which the listener exploits
to infer information about the environment and guide her action. These concepts and
terminology originate in the framework of ecological perception, rooted in Gibson’s
works on visual perception in the 1950s [35,55].2
Acoustic invariants associated to sound events may include several attributes of
a vibrating solid, such as its size, shape, and density, as these attributes contribute
differently to characteristics of the resulting sound such as pitch, spectrum, amplitude
envelope, and so on. In patterned sounds (see Sect. 2.2.2), the relevant information
is also carried by the timing of successive events: footstep sounds must occur within
2In this context, the label “ecological” is associated to two main concepts: first, perception is an
achievement of animal-environment systems, not simply animals, or their brains; second, the main
purpose of perception is to guide action, so a theory of perception cannot ignore what animals do.
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 57
a range of rates and regularities in order to be perceived as walking; the regularity
in the temporal pattern of a bouncing sound provides information about the shape of
the object (e.g., a sphere versus a cube).
The mapping between physical parameters and acoustic features is in general
many-to-many. A single physical parameter can influence simultaneously many
characteristics of the sound, and different physical parameters influence the same
characteristics in different ways. As an example, changing the size of an object will
scale the sound spectrum, i.e., will change the frequencies of the sound but not their
pattern. On the other hand, changing the object’s shape results in a change in both the
frequencies and their relationships. Acoustic invariants are thus the result of these
complex patterns of change. Surveys of classic studies in ecological acoustics and
acoustic invariants have been provided in previous works [5,36].
The above discussion provides a solid theoretical framework to reason on the
importance of ecologically valid acoustic information in eliciting the qualia of pres-
ence [72] in an immersive VR system. Among the many definitions proposed in the
literature, we follow Skarbez et al. [76] in defining presence broadly as “the per-
ceived realness of a mediated or virtual experience”. Slater et al. [77] introduced the
concepts of plausibility illusion and place illusion, to refer to two distinct subjective
internal feelings, both of which contribute to eliciting the sense of presence in a
subject experiencing an immersive VR scenario. This conceptual model of presence
is depicted in Fig. 2.4.3
In this section we are particularly interested in the plausibility illusion, i.e., the
illusion that the scenario being depicted is actually occurring (we will discuss the
place illusion in Sect. 2.3.3 next). This is determined by the overall credibility of a
virtual environment in comparison with subjective expectations. Slater argued that an
important component of the plausibility illusion is “for the virtual reality to provide
correlations between external events not directly caused by the participant and his/her
own sensations” [77]. Skarbez et al. [76] proposed the construct of coherence, an
objective characteristic of a virtual scenario that gives rise to the plausibility illusion
(see Fig. 2.4, right) and depends on the internal logical and behavioral consistency of
the virtual experience, with respect to prior knowledge. Building on these definitions,
we argue that sound will contribute to the plausibility illusion of a virtual scenario as
long as coherence is ensured for the auditory modality, i.e., as long as sound carries
relevant ecological information expected by the user’s everyday listening experience.
It shall be noted that coherence makes no assumptions about the high fidelity of
a virtual environment to the real world. Consequently, the plausibility illusion “does
not require physical realism” [77]: several studies show that virtual characters or
objects displayed with low visual fidelity in the virtual environment do not disrupt
the illusion. With regard to the auditory domain, this observation may be related to
the concept of cartoon sounds [69], i.e., simplified descriptions of sounding phe-
nomena with exaggerated features. We argue that cartoon sounds do not disrupt the
3Skarbez et al. [76] consider a third component, the social presence illusion, which we do not
address here.
58 F. Avanzini
Fig. 2.4 A conceptual model of presence: cloud boxes represent subjective internal feelings
(qualia), circles represent functions affected by individual differences, and rounded rectangles rep-
resent objective characteristics of the virtual experience. Figure based on Skarbez [76,Fig.2]
plausibility illusion as long as they still carry relevant ecological information. This
is fundamentally the same principle exploited in the empirical science of Foley Art
for creating ecologically plausible sound effects [2].
2.3.3 Active Perception, Place Illusion, Embodiment
The “enactive” approach to experience posits that it is not possible to disassociate
perception and action schematically and that every kind of perception is intrinsically
active and thoughtful. One of the most influential contributions in this direction is due
to Varela et al. [94]. In the authors’ conception, experience does not occur inside the
perceiver, but rather it is enacted by the perceiver while exploring the environment.
In this view, the subject of mental states is the embodied, environmentally situated
perceiver. The term “embodied” highlights two main points: (i) perception depends
upon the kinds of experience that are generated from specific motor capabilities,
and (ii) these capabilities are themselves embedded in a biological, psychological,
and cultural context. Sensory and motor processes are fundamentally inseparable,
and perception consists in exercising an exploratory skill. As an example [58], the
sensation of softness experienced when holding a sponge consists in being aware
that one can exercise certain skills: one can press the sponge, and it will yield under
the pressure. The experience of the softness of the sponge is characterized by a
variety of such possible patterns of interaction. Sensorimotor dependencies, or con-
tingencies, are the laws that describe these interactions. When a perceiver knows that
he is exercising the sensorimotor contingencies associated with softness, then he is
experiencing the sensation of softness.
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 59
Embodied theories of perception provide the ground for discussing further central
concepts for VR, such as immersion, place illusion, sense of embodiment, and their
relation to interactive sound. As depicted in Fig. 2.4 (left), immersion is an objective
property of a VR system. Research has concentrated largely on characteristics such
as latency, rendering frame rate, and tracking [22]. However, immersive systems can
be also characterized in relation to the supported sensorimotor contingencies, which
in turn define a set of valid actions that are perceptually meaningful (for instance,
with a head-mounted display and head-tracking, it is possible to turn your head or
bend forward producing changes in the rendered visual images). When a system
supports sensorimotor contingencies that approximate those of physical reality, it
can give rise to the place illusion, a specific subjective internal feeling which is the
illusion of being located inside the rendered virtual environment, of “being there”
[77]. Whereas the plausibility illusion is based on what a subject perceives in the
virtual environment, the place illusion is based on how she is able to perceive it.
The great majority of studies addressing explicitly the effect of sound on the
place illusion are concerned with spatial attributes: this is not entirely surprising,
since many of these attributes are perceived by exercising specific motor actions
(e.g., rotating the head to perceive the distance or the direction of a sound source
or a reflecting surface). In this respect, directivity is possibly the only sound source
attribute contributing to the place illusion, while other ecological attributes are more
likely to contribute to the plausibility illusion only, as discussed in Sect. 2.3.2.In
accordance with this picture, over the years, various authors [11,38,60] found that
spatialized sound positively influences presence as being there when compared to no-
sound or non-spatialized sound conditions, but does not affect the perceived realism
of the environment. A comprehensive survey up to 2010 is provided by Larsson [47].
The sense of embodiment refers to yet another subjective internal feeling. Specif-
ically, the sense of embodiment in an immersive virtual environment is concerned
with the relationship between one’s self and one’s body, whereas the sense of pres-
ence refers to the relationship between one’s self and the environment (and may
occur even without the sensation of having a body). Kilteni et al. [45] provide a
working definition of a sense of embodiment toward an artificial body, as the sense
that emerges when that artificial body’s properties are processed as if they were the
properties of one’s own biological body. Further, the authors associate it to three main
components: (i) the sense of self-location, (ii) of body ownership, and (iii) of agency,
the latter being investigated as an independent construct by other researchers [17].
The sense of self-location refers to one’s spatial experience of being inside a body,
rather than being inside a world (with or without a body), and is highly determined by
the visuospatial perspective, proprioception, and vestibular signals, as well as tactile
sensations at the border between our body and the environment. The sense of body
ownership refers to one’s self-attribution of an artificial body perceived as the source
of the experienced sensations and emerges as a complex combination of afferent
multisensory information and cognitive processes that may modulate the processing
of sensory stimuli, as demonstrated by the well-known rubber hand illusion [13].
The sense of agency refers to the sense of having global motor control in relation
to one’s own body and has been proposed to result from a comparison between the
60 F. Avanzini
predicted and the actual sensory consequences of one’s actions [24]: when the two
match by, for example, the presence of synchronous visuomotor correlations under
active movement, one feels oneself to be the agent of those actions.
The above discussion suggests that interactive sounds occurring directly in reac-
tion to the avatar’s gestures in a virtual scenario, and coherently with the available
sensorimotor contingencies, can positively affect the sense of agency in particular.
One relevant example is provided by footsteps: several studies have addressed the
issue of generating footstep sounds [14,85,95] however without assessing their
specific relevance to the sense of agency. Other studies have shown that interactively
generated sound can support haptic sensations, as in the case of impact sounds rein-
forcing or modulating the perceived hardness of an impulsive contact [6], or friction
sounds affecting the perceived effort in dragging an object [4] (refer to Chap. 12
for other audio-haptic case studies). Yet, no attempt was made in these studies to
specifically address the issue of agency.
Even less research seems to have been conducted on the effects of interactive
sound on the sense of body ownership. Footsteps provide a relevant example also
in this case, as the sound of steps can be related to the perceived weight of one’s
own body [85] or that of an avatar [74]. Sikström et al. [73] evaluated the role of
self-produced sounds in participants’ sensation of ownership of virtual wings in an
immersive scenario. A related issue is that of the sound of one’s own voice in a virtual
environment [61].
2.4 Events Versus Processes
Having discussed the perceptual and cognitive aspects involved in interactive sound
generation, we now jump back to the pipeline of Fig. 2.1 and look specifically at the
“source modeling” box.
When creating sound sources in a virtual environment, approaches based on sam-
ple playback are still the most common ones [12], taking advantage of sound design
techniques that have been refined through a long history, and being able to yield
perfect realism, “at least for single sounds triggered only once” [21]. From a com-
pletely different perspective, procedural approaches defer the generation of sound
signals until runtime, when information on sound-producing event is available and
can be used to yield more interactive sonic results. This section discusses these two
dichotomical approaches.
2.4.1 Event-Driven Approaches
Approaches based on sample playback follow an event-driven logics, in which a
specific sound exists as a waveform stored in a file or a table in memory and is
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 61
Fig. 2.5 Event-driven logics for VR sound design using samples and audio middleware software
bound to some event occurring in the virtual world. Borrowing an example from
Farnell [31]: if (moves(gate)) play(scrape.wav).
One immediate consequence of this is that the playback and the post-processing
of samples are dissociated from the underlying physics engine and the graphical
rendering engine. In the case of a sound played back once, the length of the sound is
predetermined and thus any timing relationship between auditory and visual elements
must also be predefined. In the case of a looped sound, the endpoint must be explicitly
given, e.g., as a response to a subsequent event. More in general, the playback of
sound is controlled by a finite and small set of states (such as in the case of an elevator
that can be starting, moving, stopping, or stopped). Correspondingly, any event is
bound to a sound “asset”, or to some post-processing of that asset.
Current practices of sound design for VR are deeply and firmly rooted in such
event-driven logics, depicted in Fig. 2.5. One clear example of this is provided by
“audio middleware” software [12], which are tools that facilitate the work of the
sound designer by reducing programming time and testing the sound design in real
time along with the game engine. The most commonly adopted middleware solutions,
such as FMOD Studio (Firelight Technologies)4and Wwise (Audiokinetic),5largely
follow the traditional paradigm of DAWs (Digital Audio Workstations) and include
GUIs for adding, controlling, and processing samples; linking them to objects, areas,
and events of the virtual environment; and imposing rules for triggering and playback.
One of the main acknowledged limitations of samples is that they are static, and
they are just single, atomic instances of events. The repetitiveness involved in multiple
playbacks of the same sounds has the potential to disrupt many of the perceptual and
cognitive effects discussed in Sect. 2.3, and even to lead to fatigue. Partial remedies
to this problem include the use of multiple samples for the same event, as well as the
4https://www.fmod.com/.
5https://www.audiokinetic.com/products/wwise/.
62 F. Avanzini
use of various post-processing operations, the most common being modifications to
pitch, time, dynamics, as well as sound layering and looping [75].
Well-established time-stretching and pitch-shifting algorithms exist; however, the
quality of the processing is in general guaranteed only for relatively small shifting and
stretching factors. Concerning dynamics, typical approaches are based on blending,
cross-fading, and mixing of different samples, similarly to a musical sampler (and
with similar limitations as well). Layering and looping are especially useful for the
construction of ambiences: multiple sounds can be individually looped and played
concurrently to create complex and layered ambiences. Repetitiveness can be reduced
by assigning different lengths to different loops, and immersion can be enhanced by
rendering individual layers at different virtual spatial locations. All this requires
manual operations by the sound designer, such as splitting, cross-fading, and so on.
Further countermeasures to repetition and listener fatigue include the use of tech-
niques based on randomization. These can be applied to many aspects of sound,
including, but not limited to (i) pitch and amplitude variations, (ii) sample selection,
(iii) sample concatenation, (iv) looping, and (v) location of sound sources. As an
example, randomized sample selection amounts to performing randomizations of
alternative samples associated to the same event, e.g., a collision: a different sample
is played back at each occurrence of the event, mimicking the differences occurring
due to slightly different contact points and velocities. In randomized concatenation,
different samples are concatenated to build a composite sound in response to a repet-
itive sequence of events, such as in the case of footsteps, weapon sounds, and so on.
Triggering different points with different probabilities can also be used to reduce the
repetitiveness of looped layers in ambience sounds. The audio middleware solutions
mentioned above typically implement several of these techniques.
Randomization techniques hint at another issue with samples, which is the need
for very large amounts of data. Putting together a large sample library is a slow and
labor-intensive process. Moreover, data need to be stored in memory, possibly in
secondary storage, from which they then have to be prefetched before playback.
2.4.2 Procedural Approaches
Techniques based on the randomization of several sample-processing parameters,
such as those discussed above, are sometimes loosely referred to as procedural in the
sound design practice [75, Chap. 2]. Here, we favor a stricter definition. In Farnell’s
words [30], procedural audio is “sound as a process, rather than sound as data”.
This definition shifts the focus onto the creation of audio assets, as opposed to the
manipulation of existing ones.
Procedural audio is thus synthetic sound, is real time, and most importantly is
created according to a set of programmatic rules and live input. This implies that
procedurally generated sound is synthesized at runtime, when all the needed input
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 63
$$$
Fig. 2.6 Procedural sound design: amodel building, and bmethod analysis stages (figures loosely
based on Farnell [30, Figs. 16.4–5])
and contextual information are available, whereas in a sample-based approach, most
of the work is performed offline prior to execution, implying that “many decisions
are made in advance and cast in stone” [31].
The stages involved in the process of procedural sound design may be loosely
based on those of software life-cycle, including (i) requirements analysis, (ii) research
and acquisition, (iii) model building, (iv) method analysis, (v) implementation, (vi)
integration, (vii) test and iteration, and (viii) maintenance. Figure 2.6 provides a
graphical summary of the two central stages, i.e., model building and method analysis.
Building a model (Fig. 2.6a) provides a simplification of the properties and behav-
iors of a real object, which starts from the analysis of sound data (including time-
and/or frequency-domain analysis, extraction of relevant audio features, etc.), as well
as a physical analysis of the involved sound-generating mechanisms, and results into
a set of parametric controls and behaviors. The hierarchy of everyday sounds depicted
in Fig. 2.3 provides a useful reference framework: the model at hand can be positioned
inside this hierarchy. Moreover, following the discussion on everyday listening of
Sect. 2.3.2, the choice of the model parametrization can be informed by the knowl-
edge of relevant acoustic invariants carrying information about sound-generating
objects and events.
The method analysis stage is where the most appropriate sound synthesis and
processing techniques are chosen, starting from a palette of available ones, and based
on the model at hand. Figure 2.6b shows a set of commonly employed sound synthesis
techniques (in Sect. 2.4.3, we will explore physics-based techniques in particular).
As a result of this stage, an implementation plan is produced that includes a set of
techniques and corresponding low-level synthesis parameters, as well as the involved
audio streams.
64 F. Avanzini
Based on this discussion, we can identify two main qualities of procedural
approaches with respect to sample playback. The first one is their intrinsic adaptabil-
ity and interactivity (according to the definitions given in Sect. 2.1), which derive
from the deferring of sound generation at runtime based on programmatic rules and
user input, and result in ever-changing sonic results in response to real-time control.
The second one is flexibility, where a single procedural model can be parametrized
to produce a variety of sound events within a given class of sounds: this contrasts
with sample-based, event-driven approaches, where ever-increasing amounts of data
and assets are needed in order to cope with the needs of complex virtual worlds.
2.4.3 Physics-Based Methods
Looking back at Fig. 2.6b, one of the available paints in the palette of sound synthesis
techniques is that of physics-based methods.
The boundaries between what can be considered physical (or physics-based, or
physics-informed) sound synthesis are somewhat blurry in the scientific literature.
Here, we adopt the definition given by Smith [78] and refer to synthesis techniques
where “ […] there is an explicit representation of the relevant physical state of the
sound source. For example, a string physical model must offer the possibility of
exciting the string at any point along its length. […] All we need is Newton.” The
last claim refers to the idea that physical modeling always starts with a quantitative
description of the sound sources based on Newtonian mechanics. Such description
may be approximate and simplified to various extents, but the above definition pro-
vides an unambiguous—albeit broad—characterization in terms of physical state
access. Resorting to a simple (yet historically relevant [68]) example, we can say
that additive synthesis of bell sounds is not physics-based, as additive sinusoidal
partials only describe the time-frequency characteristics of the sound signal without
any reference to the physical state of the bell. On the other hand, modal synthesis [1]
of the same bell, with modal oscillators tuned to the sound partials, is only apparently
a similar approach: a linear combination of the modes can provide the displacement
and the velocity at any point of the bell, and each modal shape defines to what extent
an external force applied at a given point affects the corresponding mode.
The history of physics-based synthesis is rooted in studies on the acoustics of the
vocal apparatus [44] and of musical instruments [39,40], where numerical models
were initially used for simulation rather than synthesis purposes. Current techniques
are based on several alternative formulations and methods, including ordinary or par-
tial differential equations, equivalent circuit representations, modal representations,
finite-difference and finite-element schemes, and so on [78]. Comprehensive surveys
of physical modeling approaches have been published [79,89]. Although these deal
with musical sound synthesis mostly, much of what has been learned in that domain
can be applied to the physical modeling of any sounding object.
Although physics-based synthesis is sometimes made synonymous with proce-
dural audio, Fig. 2.6b provides a clear picture of the relation between the two. In this
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 65
perspective, “procedural audio is more than physical modeling,” [31] and the latter
can be seen as one of the tools at the disposal of the sound designer to reduce a sound
to its behavioral realization. Combining physics-based approaches with knowledge
of auditory perception and cognition often results in procedural models in which the
physical description has been drastically simplified while retaining the ecological
validity of sounds and the realism of the interactions, thus preserving the plausibility
illusion of the resulting sonic world and the sense of agency of the subject (see related
discussions in Sects. 2.3.2 and 2.3.3).
2.4.4 Computational Costs
Event-driven and procedural approaches must be analyzed also in terms of the
involved computational requirements. In case of insufficient resources, excessive
computational costs may introduce artifacts in the rendered sound or in alternative
may require to increase the overall latency of the rendering up to a point where the
perception of causality and multisensory integration are disrupted (see Sect. 2.3.1).
With reference to Fig. 2.1, it can be stated that one main computational bottleneck
in the sound simulation rendering pipeline [51] is the “per sound source” cost. This
relates in particular to the sound propagation stage (see Chap. 3), as reflections,
scattering, occlusions, Doppler effects, and so on must be computed for each sound
source involved in the simulation. But it also includes the source modeling stage,
with particular reference to the generation of the sound source signals.
Sample playback has a fixed cost, irrespective of the sound being played. More-
over, the cost of playback is very small. However, samples must be loaded in memory
before being played. As a consequence, when a sound is triggered, the playback may
involve a prefetch phase where a soundbank is loaded from the secondary memory.
Moreover, some management of polyphony must be set in place in order to pri-
oritize the playback in case of several simultaneously active sounds. This can use
policies similar to those employed in music synthesizers: typically, sounds falling
below a certain amplitude threshold are dropped, leaving place for other sounds. The
underlying assumption is that louder sounds mask softer ones, so that dropping the
latter has no or minimal perceptual consequences. Although modern architectures
allow for the simultaneous playback of hundreds of audio assets, generating complex
soundscapes may exceed the amount of available channels.
On the other hand, procedural sound has variable costs, which depend on the
complexity of the corresponding model and on the employed methods. This is par-
ticularly evident in the case of physics-based techniques: for large-scale, brute-force
approaches, like higher dimensional finite-element or finite-difference methods, real
time is still hard to achieve. On the other hand, techniques like modal synthesis can
be implemented very efficiently, albeit at the cost of reduced flexibility of the models
(e.g., interaction with sounding objects limited to single input-output), which in turn
can have a detrimental effect on the plausibility illusion. Some non-physical methods
are very cheap in terms of computational requirements, as in the case of subtractive
66 F. Avanzini
synthesis for generating wind or fire sounds. Section 2.5.1 provides several examples
of procedural methods for various classes of everyday sounds.
Although it is generally true that sample-based methods outperform procedural
audio for small amounts of sounds, it has been noted [30] that this is not necessarily
true in the limit of larger numbers: whereas the fixed cost of sample playback results
in a computational complexity that is linear in the number of rendered sources, the
availability of very cheap procedural models can produce the result that for high
numbers of sources the situation reverses and procedural sound starts to outperform
sample-based methods.
2.5 Procedural and Physics-Based Approaches in VR Audio
Given these premises, what is the current development of procedural and physics-
based approaches in audio for VR? In this section, we show that, despite a substantial
amount of research, these approaches are still struggling to gain popularity in real-
world products and practices.
2.5.1 Methods
Far from providing a comprehensive survey of previous literature in the field, which
would go way beyond the scope of this chapter, this section aims at assessing towhat
extent the taxonomy of everyday sounds provided in Fig. 2.3 has been covered by
existing procedural approaches. This exercise also serves as a testbed to verify the
generality of that taxonomy. For a recent and broad survey, see Liu and Manocha [51].
Solid sounds are by far the most investigated category. For basic models, modal
synthesis [1] is the dominant approach. There are several works investigating the
use of modal methods for the procedural generation of contact sounds between solid
objects, including the optimization of modal synthesis through specialized numer-
ical schemes and/or perceptual criteria, as in the work by Raghuvanshi et al. [63].
Procedural models of surface textures have been proposed by several scholars [66,
91] and applied to scraping and rolling sounds [64]. Basic interaction forces (impact
and sliding friction) can be modeled with a variety of approaches that range from
qualitative approximations of temporal profiles of impulsive force magnitudes [92]
to the physical simulation of stick-slip phenomena in friction forces [7].
At the next level of complexity, models of patterned solid sounds have also been
widely studied. Stochastic models of crumpling phenomena have been proposed,
with applications to cloth sound synthesis [3], crumpling paper sounds, or sounds
produced by deformations of aggregate materials, such as sand, snow, or gravel [15].
The latter have also been used in the context of walking interactions [81] (see also
Sect. 2.3.3) to simulate the sound of a footstep onto aggregate grounds. Breaking
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 67
sounds have been modeled especially with the purpose of synchronizing animations
of brittle fractures produced by a physics engine [59,100].
The category of aerodynamic sounds is less studied. Within the basic level of
complexity, the sound produced by wind includes those resulting from interaction
with large static obstructions, aeolian tones, and cavity tones: these have been pro-
cedurally modeled with techniques ranging from computationally intensive fluid-
dynamics simulations [26] to simple (yet efficient and effective) subtractive schemes
using noisy sources and filters [30]. These can be straightforwardly employed to
construct patterned and compound sonic events, including windy scenes, swinging
objects, and so on [71]. Other basic aeroacoustic events include turbulences, most
notably explosions, which are a key component of more complex sounds such as gun-
shots [37] and fire [18]. Yet another relevant patterned sonic event is that produced
by combustion engines [10].
Liquid sounds appear to be the least addressed category. Basic procedural models
include sounds produced by drops in a liquid [90] or by pouring a liquid [65], whereas
patterned and compound sonic events have been more often simulated usingconcate-
native approaches relying on the output of the graphical procedural simulation [98].
A relevant example of hybrid solid-liquid sounds is that of rain [50].
2.5.2 Optimizations
We have provided in Sect. 2.4.4 a general discussion on computational costs asso-
ciated to procedural approaches, in comparison to sample-based methods. Since the
former typically results in higher “per sound source” costs than the latter, various
studies have proposed strategies for reducing the load of complex procedural audio
scenes in virtual environments.
One attractive feature of procedural sound in terms of computational complexity is
the possibility of dynamically adapting the level of detail (LOD) of the synthesized
audio. The concept of LOD is a long-established one in computer graphics and
encompasses various optimization techniques for decreasing the complexity of 3D
object rendering [52]. The general goal of LOD techniques is to increase the rendering
speed by reducing details while minimizing the perceived degradation of quality.
Most commonly, the LOD is varied as a function of the distance from the camera,
but other metrics can be used, including size, speed of motion, priority, and so on.
Reducing the LOD may be achieved by simplifying the 3D object mesh, or by
using impostors (i.e., replacing mesh-based with image-based rendering), and other
approaches can be used to dynamically control the LOD of landscape rendering,
crowd simulation, and so on.
Similar ideas may be applied to procedural sound, achieving further reductions
of computational costs for complex sound scenes with respect to sample playback.
However, very few studies explored the concept of LOD in the auditory domain,
and there is not even a commonly accepted definition in the related literature: some
scholars have coined the term Sound Level Of Detail (SLOD) [70], while others use
68 F. Avanzini
Fig. 2.7 Example of dynamic LOAD based on the radial distance from the listener, where levels
of details are associated to three overlapping proximity profiles. Figure partly based on Schwarz
et al. [70,Fig.3]
Level Of Audio Detail (LOAD) [27], both generically referring to varying sound
resolution according to the required perceived precision. Here, we stick to the latter
definition (LOAD), since this seems to be more frequently adopted in recent literature.
Strategies for dynamic LOAD can be partly derived from graphics. Simple
approaches amount to fade out and turn off distant sounds based on radial distance
or zoning. Depending on their distance, sound sources may be also clustered or acti-
vated according to some predefined behavior. Techniques based on impostors can be
used as well: as an example, when rendering the sound of a crowd, individual sounds
emitted by several characters can be replaced by a global sample-based ambience
sound. However, one should be aware of the differences between visual and audi-
tory perception and exploit the peculiarities of the latter to develop more advanced
strategies for dynamic LOAD. Figure 2.7 depicts an example of a dynamic LOAD
strategy based on radial distance, in which levels of details are associated to three
overlapping proximity profiles around the listener (foreground, middle ground, and
background): sounds in the foreground are rendered individually through procedural
approaches; those that fall into the middle ground can be rendered through some
simplifying approaches (clustering, grouping, and statistical behaviors); and finally,
sounds in the background may be substituted by audio impostors such as audio files.
Pioneering work in this direction was carried out by Fouad et al. [32], although
the authors did not explicitly refer to the concept of LOD. This work proposes a
set of “perceptually based scheduling algorithms”, that allows a scheduler to assign
execution time to each sound in the scene minimizing some perceptually motivated
error metric. In particular, sounds are prioritized depending on the listener’s gaze,
the loudness, and the age of the sound. Tsingos and coworkers [56,88] proposed
an approach to reduce the number of (sample-based) sound sources in a complex
scenario, by combining perceptual culling and perceptual clustering. The culling
stage removes perceptually inaudible sources based on a global masking model, while
the clustering stage groups the remaining sound sources into a predefined number of
clusters: as a result, a representative point source is constructed for each cluster and
a set of equivalent source signals is generated. Schwarz et al. [70] proposed a design
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 69
with three LOADs based on proximity and smooth transitions between proximity
levels, very much like those depicted in Fig. 2.7: (i) foreground, i.e., individually
driven sound events (e.g., individual raindrops on tree leaves); (ii) middle ground,
i.e., group-driven sound events, at the point where individual events cannot be isolated
and can be replaced by stochastical behaviors; (iii) background, i.e., sound sources
that are further away and can be rendered by audio impostors such as audio files
or dynamic mixing of groups of procedural impostors. More recently, Dall’Avanzi
et al. [23] analyzed the effect on player’s immersion in response to soundscapes with
two applied LOADs. Two groups of participants played two different versions of the
same game, and the player’s immersion was measured through two questionnaires.
However, results in this case showed no considerable difference between the two
groups.
Other researchers proposed or evaluated LOAD techniques specifically tailored
to certain synthesis methods. Raghuvanshi et al. [63] addressed modal synthesis and
investigated various perceptually motivated techniques for improving the efficiency
of the synthesis. These include a “quality scaling” technique that effectively controls
the dynamic LOAD: briefly, in a scene involving many sounding objects, the number
of modes assigned to individual objects scales with objects location from foreground
to background, without significant losses in perceived quality. Durr et al. [27] evalu-
ated through subjective tests various procedural models of sound sources with three
applied LOADs. Specifically, three procedural models proposed by Farnell [30] (see
also Sect. 2.5.1) were chosen for investigation: (i) fire sounds employ subtractive syn-
thesis to generate and combine hissing, crackling, and lapping features; (ii) bubbles
sounds use a form of additive synthesis with frequency- and amplitude-controlled
sinusoidal components representing single bubbles; (iii) wind sounds are again pro-
duced using subtractive synthesis (amplitude-modulated noise and various filtering
elements to represent different wind effects). A different approach to applying LOAD
was implemented for each model. Correspondingly, listening tests provided different
results for each model in terms of perceived quality at different LOADs.
The reader interested in further discussion about audio quality should also refer
to Chap. 5.
2.5.3 Tools
In spite of all the valuable research results produced so far, there is still a lack of
software tools that assist the sound designer in using procedural approaches.
Designers working with procedural audio use a variety of audio program-
ming environments. Popular choices include (but are not limited to) Pure Data,6
Max/MSP,7or CSound.8The first two in particular implement a common, dataflow-
6https://puredata.info/.
7https://cycling74.com/.
8https://csound.com/.
70 F. Avanzini
oriented paradigm [62] and use a visual patch language where “the diagram is the
program”: Farnell [31] argues that this paradigm is particularly suited for proce-
dural audio as it has a natural congruence with the abstract model resulting from
the design process. On the other hand, integrating these environments into the most
widespread gaming/VR engines is not straightforward: at the time of writing, some
active open-source projects include libpd [16], a C library that turns Pure Data into an
embeddable audio synthesis library and provides wrappers for a range of languages,
and Cabbage [97], a framework for developing audio plugins in Csound, includ-
ing plugins for the FMOD middleware. Commercial gaming/VR engines typically
provide limited functionalities to support procedural sound design, although some
recent developments may hint at an ongoing change of perspective: as an example,
the Blueprint visual scripting system within the Unreal Engine has been used for
dataflow-oriented procedural audio programming, also using some native synthesis
(subtractive, etc.) capabilities.
All of the tools mentioned above still require to work at a low level of abstraction,
implying that the sound designer must have the technical skills needed to deal with
low-level synthesis methods and parameters, and at the same time limiting produc-
tivity. There is a clear need for tools that allow the designer to work at higher levels
of abstraction. One instructive example is provided by the Sound Design Toolkit
(SDT), an open-source software package developed over several years [9,25] which
provides a set of sound models for the interactive generation of several acoustic phe-
nomena. In its current embodiment, SDT is composed of a core C library exposing
an API, plus a set of wrappers for Max and Pure Data, and a related collection of
patches and help files. Interestingly, the collection is based on a hierarchical taxon-
omy of everyday sound events which follows very closely the one depicted in Fig. 2.3
and implements a rich subset of its items. The designer has access to both low-level
parameters (e.g., the modal frequencies of a basic solid resonator) and to high-level
ones (e.g., the initial height of a bouncing object).
Commercial products facilitating the designer’s workflow are also far from abun-
dant: Lesound9(formerly Audiogaming) sells a set of plugins for FMOD and Wwise
that include procedural simulations of wind, rain, motor, and weather sounds, while
for its part AudioKinetic (developer of Wwise) develops the soundseed plugin series,
which include procedural generation of wind and whooshing sounds as well as
impact sounds. Nemisindo10 provides a web-based platform for real-time synthe-
sis and manipulation of procedural audio, which stems from the FXive academic
project [8], but no plugin-based integration with VR engines or audio middleware
software is available at the time of writing.
A much-needed facilitating tool for the sound designer is one that automates part
of the design process, allowing in particular for automatic tuning of the parameters of
a procedural model starting from a target (e.g., recorded) sound. This would provide
a means to recreate procedurally a desired sound and more in general to ease the
design by providing a starting set of parameter values that can be further edited.
9https://lesound.io/.
10 https://nemisindo.com.
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 71
In the context of modal synthesis, various authors have proposed automatic analy-
sis approaches for determining modal parameters from a target signal (e.g., an impact
sound). In this case, the parametrization of the model is relatively simple: every mode
at a given position is fully characterized by a triplet of scalars representing its fre-
quency, decay coefficient, and gain. This generalizes to an array of gains if multiple
points on the object are considered, or to continuous modal shapes as functions of
spatial coordinates on the object. Ren et al. [67] proposed a method that extracts
perceptually salient features from audio examples and a parameter estimation algo-
rithm searches for the best material parameters for modal synthesis. Based on this
work, Sterling et al. [82] added a probabilistic model for the damping parameters
in order to reduce the effect of external factors (object support, background noise,
etc.) and non-linearities on the estimate of damping. Tiraboschi et al. [87]alsopre-
sented an approach to the automatic estimation of modal parameters based on a target
sound, which employs a spectral modeling algorithm to track energy envelopes of
detected sinusoidal components and then performs linear regression to estimate the
corresponding modal parameters.
While the case of solid objects and modal synthesis is a relatively simple one,
the issue of automatic parameter estimation has been largely disregarded for other
classes of sounds and models.
2.6 Conclusions
Our discussion in this chapter has hopefully shown that procedural approaches
offer extensive possibilities for designing sonic interactions in virtual environments.
And yet as of today the number of real-world applications and tools utilizing these
approaches is very limited. In fact, not much has changed since ten or fifteen years
ago, when other researchers observed a similar lack of interest from the industry [12,
29], with the same technical and cultural obstacles to adoption still in place. In a way
recent technological developments have further favored the use of sample-based
approaches: in particular, decreasing costs of RAM and secondary storage, as well
as optimized strategies to manage caching and prefetching of sound assets, have
made it possible to store ever larger amounts of data. This state of affairs mimics
closely what happened in the music industry during the last three decades: physics-
based techniques in particular have been around for a long time, but the higher sound
quality and accuracy of samples are still preferred over the flexibility of physical
models for the emulation of musical instruments.
Perhaps then the question is not whether procedural approaches can overcome
sample-based audio, but when, i.e., under what specific circumstances. In this chapter,
we have provided some elements, particularly links to a number of relevant percep-
tual and cognitive aspects, such as the plausibility and place illusions, the sense of
embodiment, and the sense of agency. We argue that procedural audio can compete
with samples in cases where either (i) very large amounts of data are needed to min-
imize repetition and support the plausibility illusion, or (ii) interactivity is needed
72 F. Avanzini
beyond an event-driven logics, in order to provide tight synchronization and plausible
variations with user actions, and to support her sense of agency and body ownership.
One example of the first circumstance is provided by wind sounds: good record-
ings of real wind effects are technically difficult to come by and long recordings are
required to create convincing ambiences of windy scenes using looping, while on
the other hand procedurally generated wind sounds achieve high levels of realism.
It is therefore no surprise that the few commercially available tools for procedural
sound all include wind (see Sect. 2.5.3) and have been successfully employed also in
large productions.11 While wind falls in the category of adaptive, rather than interac-
tive sounds, two relevant examples for the second circumstance may be provided by
footsteps and sliding friction (bike breaking, hinges squeaking, rubbing, etc.): beside
requiring large amounts of data and randomization to avoid repetition, these sounds
arise in response to complex and continuous motor actions by the user, which cannot
be fully captured by an event-driven logics.
Future research and development should therefore focus on cases where proce-
dural models can compete with samples, looking more deeply into the effects on the
plausibility illusion, sense of agency, and sense of body ownership. From a more tech-
nical perspective, promising directions for future research include the development
of dynamic LOAD techniques, as well as high-level authoring tools and automation.
Acknowledgements This chapter is partly based on the ideas and materials which I developed for
my course “Sound in Interaction”, held at the University of Milano for the MSc degree in Computer
Science.
References
1. Adrien, J.-M. in Representations of Musical Signals (eds De Poli, G., Piccialli, A., Roads, C.)
269-297 (MIT Press, Cambridge, MA, 1991).
2. Ament, V. T.: The Foley grail: The art of performing sound for film, games, and animation
Second edition (CRC Press, New York, 2014).
3. An, S. S., James, D. L., Marschner, S.: Motion-driven Concatenative Synthesis of Cloth
Sounds. ACM Trans. Graphics 31 (July 2012).
4. Avanzini, F., Rocchesso, D., Serafin, S.: Friction sounds for sensory substitution, in Proc. Int.
Conf. Auditory Display (ICAD04) (Sidney, July 2004).
5. Avanzini, F. in Sound to Sense, Sense to Sound. A State of the Art in Sound and Music
Computing (eds Rocchesso, D., Polotti, P.) 345–396 (Logos Verlag, Berlin, 2008).
6. Avanzini, F., Crosato, P. in Haptic and audio interaction design (eds Mc-Gookin, D.,
Brewster, S.) 24–35 (Lecture Notes in Computer Science 4129/2006, Springer Verlag,
Berlin/Heidelberg, 2006).
11 As an example, the procedural wind simulator by Lesound has been reportedly used for generating
ambiences in Quentin Tarantino’s Django Unchained, see http://lesound.io/product/audiowind-
pro/.
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 73
7. Avanzini, F., Serafin, S., Rocchesso, D.: Interactive simulation of rigid body interaction with
friction-induced sound generation. IEEE Trans. Speech Audio Process. 13, 1073–1081 (2005).
8. Bahadoran, P., Benito, A., Vassallo, T., Reiss, J. D.: FXive: A web platform for procedural
sound synthesis, in Proc. 144 Audio Engin. Soc. Conv. (Milano, 2018).
9. Baldan, S., Delle Monache, S., Rocchesso, D.: The sound design toolkit. SoftwareX 6, 255–
260 (2017).
10. Baldan, S., Lachambre, H., Delle Monache, S., Boussard, P.: Physically informed car engine
sound synthesis for virtual and augmented environments, in Proc. IEEE Int. Workshop on
Sonic Interactions for Virtual Environments (SIVE2015) (Arles, 2015), 21–26.
11. Bormann, K.: Presence and the utility of audio spatialization. Presence: Teleoperators and
Virtual Environment 14, 278–297 (2005).
12. Böttcher, N.: Current problems and future possibilities of procedural audio in computer games.
Journal of Gaming & Virtual Worlds 5, 215–234 (2013).
13. Botvinick, M., Cohen, J.: Rubber hands ’feel’ touch that eyes see. Nature 391, 756–756
(1998).
14. Bresin, R., Papetti, S., Civolani, M., Fontana, F.: Expressive sonification of footstep sounds,
in Proc. Interactive Sonification Workshop (Stockholm, 2010), 51–54.
15. Bresin, R. et al.: Auditory feedback through continuous control of crumpling sound synthesis,
in Proc. Workshop Sonic Interaction Design (CHI2008) (Firenze, 2008), 23–28.
16. Brinkmann, P., Wilcox, D., Kirshboim, T., Eakin, R., Alexander, R.: Libpd: Past, Present, and
Future of Embedding Pure Data, in Proc. Pure Data Convention (New York, 2016).
17. Caspar, E. A., Cleeremans, A., Haggard, P.: The relationship between human agency and
embodiment. Consciousness and cognition 33, 226–236 (2015).
18. Chadwick, J. N., James, D. L.: Animating Fire with Sound. ACM Trans. Graphics 30 (2011).
19. Chen, L., Vroomen, J.: Intersensory binding across space and time: a tutorial review. Attention,
Perception, & Psychophysics 75, 790–811 (2013).
20. Collins, K. in Essays on Sound and Vision (eds Richardson, J., Hawkins, S.) 263–298 (Helsinki
University Press, Helsinki, 2007).
21. Cook, P. R.: Real sound synthesis for interactive applications (CRC Press, 2002).
22. Cummings, J. J., Bailenson, J.N.: How immersive is enough? Ameta-analysis of the effect of
immersive technology on user presence. Media Psychology 19, 272–309 (2016).
23. Dall’Avanzi, I., Yee-King, M.: Measuring the impact of level of detail for environmental
soundscapes in digital games, in Proc. 146 Audio Engin. Soc. Conv. (London, 2019).
24. David, N., Newen, A., Vogeley, K.: The “sense of agency” and its underlying cognitive and
neural mechanisms. Consciousness and cognition 17, 523–534 (2008).
25. Delle Monache, S., Polotti, P., Rocchesso, D.: A toolkit for explorations in sonic interaction
design, in Proc. Int. Conf. Audio Mostly (AM2010) (Piteå, 2010), 1–7.
26. Dobashi, Y.,Yamamoto, T., Nishita, T.: Real-time Rendering of Aerodynamic Sound using
Sound Textures based on Computational Fluid Dynamics, in Proc. ACM SIGGRAPH 2003
(San Diego, 2003), 732–740.
27. Durr, G., Peixoto, L., Souza, M., Tanoue, R., Reiss, J. D.: Implementation and evaluation of
dynamic level of audio detail, in Proc. 56th AES Int. Conf. Audio for Games (London, 2015).
28. Ernst, M. O., Bülthoff, H. H.: Merging the senses into a robust percept. TRENDS in Cognitive
Sciences 8, 162–169 (2004).
29. Farnell, A.: An introduction to procedural audio and its application in computer games (2007).
URL http://obiwannabe.co.uk/html/papers/proc-audio/proc-audio.pdf. Accessed March 29,
2021.
30. Farnell, A.: Designing sound (MIT Press, 2010).
31. Farnell, A. in Game sound technology and player interaction: Concepts and developments
(ed Grimshaw, M.) 313–339 (Information Science Reference, 2011).
32. Fouad, H., Hahn, J. K., Ballas, J. A.: Perceptually Based Scheduling Algorithms for Real-time
Synthesis of Complex Sonic Environments, in Proc. Int. Conf. Auditory Display (ICAD97)
(Palo Alto, 1997).
74 F. Avanzini
33. Gaver, W. W.: How do we hear in the world? Explorations of ecological acoustics. Ecological
Psychology 5, 285–313 (1993).
34. Gaver, W. W.: What in the world do we hear? An ecological approach to auditory event
perception. Ecological Psychology 5, 1–29 (1993).
35. Gibson, J. J.: The ecological approach to visual perception (Lawrence Erlbaum Associates,
Mahwah, NJ, 1986).
36. Giordano, B., Avanzini, F. in Multisensory Softness (ed Luca, M. D.) 49–84 (Springer Verlag,
London, 2014).
37. Hacıhabibo˘glu, H. in Game Dynamics: Best Practices in Procedural and Dynamic Game
Content Generation (eds Korn, O., Lee, N.) 47–69 (Springer International Publishing, Cham,
2017).
38. Hendrix, C., Barfield, W.: The Sense of Presence within Auditory Virtual Environments.
Presence: Teleoperators and Virtual Environment 5, 290–301 (1996).
39. Hiller, L., Ruiz, P.: Synthesizing Musical Sounds by Solving the Wave Equation for Vibrating
Objects: Part I. J. Audio Eng. Soc. 19, 462–470 (1971).
40. Hiller, L., Ruiz, P.: Synthesizing Musical Sounds by Solving the Wave Equation for Vibrating
Objects: Part II. J. Audio Eng. Soc. 19, 542–551 (1971).
41. Jack, R. H., Stockman, T., McPherson, A.: Effect of latency on performer interaction and
subjective quality assessment of a digital musical instrument, in Proc. Int. Conf. Audio Mostly
(AM’16) (Norrköping, 2016), 116–123.
42. Jørgensen, K. in Game sound technology and player interaction: Concepts and developments
(ed Grimshaw, M.) 78–97 (Information Science Reference, 2011).
43. Kaaresoja, T., Brewster, S., Lantz, V.: Towards the temporally perfect virtual button: touch-
feedback simultaneity and perceived quality in mobile touchscreen press interactions. ACM
Trans. Applied Perception 11, 1–25 (2014).
44. Kelly, J. L., Lochbaum, C. C.: Speech synthesis, in Proc. 4th Int. Congr. Acoustics (Copen-
hagen, 1962), 1–4.
45. Kilteni, K., Groten, R., Slater, M.: The sense of embodiment in virtual reality. Presence:
Teleoperators and Virtual Environments 21, 373–387 (2012).
46. Lago, N. P., Kon, F.: The quest for low latency, in Proc. Int. Computer Music Conf.
(ICMC2004) (Miami, 2004).
47. Larsson, P., Väljamäe, A., Västfjäll, D., Tajadura-Jiménez, A., Kleiner, M. in The engineering
of mixed reality systems (eds Dubois, E., Gray, P., Nigay, L.) 143–163 (Springer, 2010).
48. Lester, M., Boley, J.: The effects of latency on live sound monitoring, in Proc. 123 Audio
Engin. Soc. Convention (New York, 2007).
49. Liljedahl, M. in Game sound technology and player interaction: Concepts and developments
(ed Grimshaw, M.) 22–43 (Information Science Reference, 2011).
50. Liu, S., Cheng, H., Tong, Y.: Physically-Based Statistical Simulation of Rain Sound. ACM
Trans. Graphics 38 (2019).
51. Liu, S., Manocha, D.: Sound Synthesis, Propagation, and Rendering: A Survey. arXiv preprint.
2020.
52. Luebke, D. et al.: Level of detail for 3D graphics (Morgan Kaufmann, 2003).
53. Magill, R. A., Anderson, D. I.: Motor learning and control: Concepts and applications.
Eleventh edition (McGraw-Hill New York, 2017).
54. Mäki-Patola, T., Hämäläinen, P.: Latency tolerance for gesture controlled continuous sound
instrument without tactile feedback, in Proc. Int. Computer Music Conf. (ICMC2004) (Miami,
2004).
55. Michaels, C. F., Carello, C.: Direct Perception (Prentice-Hall, Englewood Cliffs, NJ, 1981).
56. Moeck, T. et al.: Progressive perceptual audio rendering of complex scenes, in Proc. Symp.
on Interactive 3D Graphics and Games (I3D’07) (Seattle, 2007), 189–196.
57. Nordahl, R., Nilsson, N. C. in The Oxford handbook of interactive audio (eds Collins, K.,
Kapralos, B., Tessler, H.) (Oxford University Press, 2014).
58. O’Regan, J. K., Noë, A.: A sensorimotor account of vision and visual consciousness. Behav-
ioral and Brain Sciences 24, 883–917 (2001).
2 Procedural Modeling of Interactive Sound Sources in Virtual Reality 75
59. Picard, C., Tsingos, N., Faure, F.: Retargetting Example Sounds to Interactive Physics-Driven
Animations, in Proc. AES Conf. Audio in Games (London, 2009).
60. Poeschl, S., Wall, K., Doering, N.: Integration of spatial sound in immersive virtual environ-
ments an experimental study on effects of spatial sound on presence, in Proc. IEEE Conf.
Virtual Reality (Orlando, 2013), 129–130.
61. Pörschmann, C.: One’s own voice in auditory virtual environments. Acta Acustica un. w.
Acustica 87, 378–388 (2001).
62. Puckette, M.: Max at seventeen. Computer Music J. 26, 31–43 (2002).
63. Raghuvanshi, N., Lin, M. C.: Physically Based Sound Synthesis for Large-Scale Virtual
Environments. IEEE Computer Graphics and Applications 27, 14–18 (2007).
64. Rath, M., Rocchesso, D.: Continuous sonic feedback from a rolling ball. IEEE MultiMedia
12, 60–69 (2005).
65. Rath, M., Fontana, F. in The Sounding Object (eds Rocchesso, D., Fontana, F.) 173–204
(Mondo Estremo, Firenze, 2003).
66. Ren, Z., Yeh, H., Lin, M. C.: Synthesizing contact sounds between textured models, in Proc.
IEEE Conf. Virtual Reality (Waltham, 2010), 139–146.
67. Ren, Z., Yeh, H., Lin, M. C.: Example-guided physically based modal sound synthesis. ACM
Trans. on Graphics 32, 1 (2013).
68. Risset, J.-C., Wessel, D. L. in The psychology of music (ed Deutsch, D.) Second edition,
113–169 (Elsevier, 1999).
69. Rocchesso, D., Bresin, R., Fernstrom, M.: Sounding objects. IEEE MultiMedia 10, 42–52
(2003).
70. Schwarz, D., Cahen, R., Brument, F., Ding, H., Jacquemin, C.: Sound level of detail in interac-
tive audiographic 3D scenes, in Proc. Int. Computer Music Conf. (ICMC2011) (Huddersfield,
2011), 312–315.
71. Selfridge, R., Moffat, D., Reiss, J. D.: Sound synthesis of objects swinging through air using
physical models. Applied Sciences 7, 1177 (2017).
72. Sheridan, T. B., Furness, T. A. (eds.): Premier Issue, Presence: Teleoperators and Virtual
Environment, vol. 1 (1992).
73. Sikström, E., De Götzen, A., Serafin, S.: The role of sound in the sensation of ownership of a
pair of virtual wings in immersive VR, in Proc. Int. Conf. Audio Mostly (AM’14) (Aalborg,
2014), 1–6.
74. Sikström, E., De Götzen, A., Serafin, S.: Self-characterstics and sound in immersive virtual
reality - Estimating avatar weight from footstep sounds, in Proc. IEEE Conf. Virtual Reality
(Arles, 2015), 283–284.
75. Sinclair, J.-L.: Principles of Game Audio and Sound Design: Sound Design and Audio Imple-
mentation for Interactive and Immersive Media (CRC Press, 2020).
76. Skarbez, R., Brooks Jr, F. P., Whitton, M. C.: A survey of presence and related concepts. ACM
Computing Surveys 50, 1–39 (2017).
77. Slater, M.: Place illusion and plausibility can lead to realistic behaviour in immersive virtual
environments. Phil. Trans. R. Soc. B 364, 3549–3557 (2009).
78. Smith, J. O.: Physical Audio Signal Processing. Online book. 2010. URL http://ccrma.
stanford.edu/Ëœjos/pasp/. Accessed March 11, 2021.
79. Smith, J. O.: Virtual acoustic musical instruments: Review and update. J. New Music Res.
33, 283–304 (2004).
80. Sonnenschein, D.: Sound design: The expressive power of music, voice, and sound effects in
cinema (Michael Wiese Productions, 2001).
81. Human Walking in Virtual Environments: Perception, Technology, and Applications (eds
Steinicke, F., Visell, Y., Campos, J., Lecuyer, A.) (Springer Verlag, New York, 2013).
82. Sterling, A., Rewkowski, N., Klatzky, R. L., Lin, M. C.: Audio-Material Reconstruction for
Virtualized Reality Using a Probabilistic Damping Model. IEEE Trans. on Visualization and
Comp. Graphics 25, 1855–1864 (2019).
83. Stevenson, R. A. et al.: Identifying and quantifying multisensory integration: a tutorial review.
Brain Topography 27, 707–730 (2014).
76 F. Avanzini
84. Stockburger, A.: The game environment from an auditory perspective, in Proc. Level Up:
Digital Games Research Conference (eds Copier, M., Raessens, J.) (Utrecht, 2003).
85. Tajadura-Jiménez, A. et al.: As light as your footsteps: altering walking sounds to change
perceived body weight, emotional state and gait, in Proc. ACM Conf. on Human Factors in
Computing Systems (Seoul, 2015), 2943–2952.
86. Takala, T., Hahn, J.: Sound Rendering. Computer Graphics 26, 211–220 (1992).
87. Tiraboschi, M., Avanzini, F., Ntalampiras, S.: Spectral Analysis for Modal Parameters Linear
Estimate, in Proc. Int. Conf. Sound and Music Computing (SMC2020) (Torino, 2020), 276–
283.
88. Tsingos, N., Gallo, E., Drettakis, G.: Perceptual audio rendering of complex virtual environ-
ments. ACM Trans. on Graphics (TOG) 23, 249–258 (2004).
89. Välimäki, V., Pakarinen, J., Erkut, C., Karjalainen, M.: Discrete-time modelling of musical
instruments. Rep. Prog. Phys. 69, 1–78 (2006).
90. Van den Doel, K.: Physically based models for liquid sounds. ACM Trans. Applied Perception
2, 534–546 (2005).
91. Van den Doel, K., Kry, P. G., Pai, D. K.: FoleyAutomatic: Physically-based Sound Effects for
Interactive Simulation and Animation, in Proc. ACM SIGGRAPH 2001 (Los Angeles, 2001),
537–544.
92. Van den Doel, K., Pai, D. K. in Audio Anecdotes (ed Greenebaum, K.) (AK Peters, Natick,
MA, 2004).
93. Van Vugt, F. T., Tillmann, B.: Thresholds of auditory-motor coupling measured with a simple
task in musicians and non-musicians: was the sound simultaneous to the key press? PLoS
One 9, e87176 (2014).
94. Varela, F., Thompson, E., Rosch, E.: The Embodied Mind (MIT Press, Cambridge, MA,
1991).
95. Visell, Y. et al.: Sound design and perception in walking interactions. Int. J. Human-Computer
Studies 67, 947–959 (2009).
96. Vroomen, J., Keetels, M.: Perception of intersensory synchrony: a tutorial review. Attention,
Perception, & Psychophysics 72, 871–884 (2010).
97. Walsh, R.: Audio plugin development with cabbage, in Proc. Linux Audio Conf. (Maynooth,
2011), 47–53.
98. Wang, K., Liu, S.: Example-based synthesis for sound of ocean waves caused by bubble
dynamics. Comput. Anim. and Virtual Worlds 29, e1835 (2018).
99. Wessel, D.,Wright, M.: Problems and prospects for intimate musical control of computers.
Computer Music J. 26, 11–22 (2002).
100. Zheng, C., James, D. L.: Rigid-body fracture sound with precomputed soundbanks. ACM
Trans. Graphics 29 (2010).
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 3
Interactive and Immersive Auralization
Nikunj Raghuvanshi and Hannes Gamper
Abstract Real-time auralization is essential in virtual reality (VR), gaming, and
architecture to enable an immersive audio-visual experience. The audio rendering
must be congruent with visual feedback and respond with minimal delay to interactive
events and user motion. The wave nature of sound poses critical challenges for
plausible and immersive rendering and leads to enormous computational costs. These
costs have only increased as virtual scenes have progressed away from enclosures
toward complex, city-scale scenes that mix indoor and outdoor areas. However, hard
real-time constraints must be obeyed while supporting numerous dynamic sound
sources, frequently within a tightly limited computational budget. In this chapter, we
provide a general overview of VR auralization systems and approaches that allow
them to meet such stringent requirements. We focus on the mathematical foundation,
perceptual considerations, and application-specific design requirements of practical
systems today, and the future challenges that remain.
3.1 Introduction
Audition and vision are unique among our senses: they perceive propagating waves.
As a result, they bring us detailed information not only of our immediate surroundings
but of the world much beyond as well. Imagine talking to a friend in a cafe, the door is
open, and outside is a bustling city intersection. While touch and smell give a detailed
sense of our immediate surroundings, sight and sound tell us we are conversing with a
friend, surrounded by other people in the cafe, immersed in a city, its sounds streaming
in through the door. Virtual reality ultimately aims to re-create this sense of presence
and immersion in a virtual environment, enabling a vast array of applications for
society, ranging from entertainment to architecture and social interaction without
the constraints of distance.
N. Raghuvanshi (B
)·H. Gamper
Microsoft Research, Redmond, USA
e-mail: nikunjr@microsoft.com
H. Gamper
e-mail: hannes.gamper@microsoft.com
© The Author(s) 2023
M. Geronazzo and S. Serafin (eds.), Sonic Interactions in Virtual Environments,
Human—Computer Interaction Series, https://doi.org/10.1007/978-3-031- 04021-4_3
77
78 N. Raghuvanshi and H. Gamper
Rendering. To reproduce the audio-visual experience given in the example above,
one requires a dynamic, digital 3D simulation of the world describing how both light
and sound would be radiated, propagated, and perceived by an observer immersed
in the computed virtual fields of light and sound. The world model usually takes
the form of a 3D geometric description composed of triangulated meshes and sur-
face materials. Sources of light and sound are specified with their 3D positions and
radiative properties, including their directivity and the energy emitted within the
perceivable frequency range. Given this information as input, special algorithms
produce dynamic audio-visual signals that are displayed to the user via screens and
speaker arrays or stereoscopic head-mounted displays and near-to-ear speakers or
headphones. This is the overall process of rendering, whose two components are
visualization and auralization (or visual- and audio-rendering).
Rendering has been a central problem in both the graphics and audio communities
for decades. While the initial thrust for graphics came from computer-aided design
applications, within audio, room acoustic auralization of planned auditoria and con-
cert halls was a central driving force. The technical challenge with rendering is that
modeling propagation in complex worlds is immensely compute-intensive. A naïve
implementation of classical physical laws governing optics and acoustics is found
to be many orders of magnitude slower than required (elaborated in Sect. 3.2.1).
Furthermore, the exponential increase in compute power governed by Moore’s law
has begun to stall in the last decade due to fundamental physical limits [97]. These
two facts together mean that modeling propagation quickly enough for practical use
requires research into specialized system architectures and simulation algorithms.
Perception and Interactivity. A common theme in rendering research is that quanti-
tative accuracy as required in engineering applications is not the primary goal. Rather,
perception plays the central role: one must find ways to compute those aspects of
physical phenomena that inform our sensory system. Consequently, initial graphics
research in the 1970s focused on visible-surface determination [54] to convey spa-
tial relations and object silhouettes, while initial room acoustics research focused
on reverberation time [60] to convey presence in a room and indicate its size. With
that foundation, subsequent research has been devoted toward increasing the amount
of detail to reach “perceptually authentic” audio-visual rendering: one that is indis-
tinguishable from an audio-visual capture of a real scene. Research has focused on
the coupled problems of increasing our knowledge of psycho-physics, and designing
fast techniques that leverage this knowledge to reduce computation while providing
the means to test new psycho-physical hypotheses.
The interactivity of virtual reality and games adds an additional dimension of
difficulty. In linear media such as movies, the sequence of events is fixed, and com-
putation times of hours or days for pre-rendered digital content can be acceptable,
with human assistance provided as necessary. However, interactive applications can-
not be pre-rendered in this way, as the user actions are not known in advance. Instead,
the computer must perform real-time rendering: as events unfold based on user input,
the system must model how the scene would look and sound from moment to moment
as the user moves and interacts with the virtual world. It must do so with minimal
3 Interactive and Immersive Auralization 79
latency of about 10–100 ms, depending on the application. Audio introduces the
additional challenge of a hard real-time deadline. While a visual frame rendered
slightly late is not ideal but perhaps acceptable, audio lags may result in silent gaps
in the output. Such signal discontinuities annoy the user and break immersion and
presence. Therefore, auralization systems in VR tend to prioritize computational
efficiency and perceptual plausibility while building toward perceptual authenticity
from that starting point.
Goal. The purpose of this chapter is to present the fundamental concepts and design
principles of modern real-time auralization systems, with an emphasis on recent
developments in virtual reality and gaming applications. We do not aim for an exhaus-
tive treatment of the theory and methods in the field. For such a treatment, we refer
the reader to Vorländer’s treatise on the subject [102].
Organization. We begin by outlining the computational challenges and the result-
ing architectural design choices of real-time auralization systems in Sect. 3.2.This
architecture is then formalized via the Bidirectional Impulse Response (BIR), Head-
Related Transfer Functions (HRTFs), and rendering equation in Sect. 3.3. In Sect. 3.4,
we summarize relevant psycho-acoustic phenomena in complex VR scenes and elab-
orate on how one must balance a believable rendering with real-time constraints
among other system design factors in Sect. 3.5. We then discuss in Sect. 3.6 how the
formalism, perception, and design constraints come together into the deterministic-
statistical decomposition of the BIR, a powerful idea employed by most auralization
systems. Section 3.7 provides a brief overview of the two common approaches to
acoustical simulation: geometric and wave-based methods. In Sect. 3.8, we discuss
some example systems in use today in more depth, to illustrate how they balance the
various constraints informing their design decisions, followed by the conclusion in
Sect. 3.9.
3.2 Architecture of Real-time Auralization Systems
In this section, we discuss the specific physical aspects of sound that make it compu-
tationally difficult to model, which motivates a modular, efficient system architecture.
3.2.1 Computational Cost
To understand the specific modeling concerns of auralization, it helps to juxtapose
with light simulation in games and VR applications. In particular
•Speed: The propagation speed of sound is low enough that we perceive its various
transient aspects such as initial reflections and reverberation, which carry distinct
perceptible information, while light propagation can be treated as instantaneous;
80 N. Raghuvanshi and H. Gamper
•Phase: Everyday sounds are often coherent or harmonic signals whose phase
must be treated carefully throughout the auralization pipeline to avoid audible
distortions such as signal discontinuities, whereas natural light sources tend to be
incoherent;
•Wavelength: Audible sound wavelengths are comparable to the size of architec-
tural and human features (cm to m) which makes wave diffraction ubiquitous.
Unlike visuals, audible sound is not limited by line of sight.
Given the unique characteristics of sound propagation outlined above, auralization
must begin with a fundamental treatment of sound as a transient, coherent wave
phenomenon, while lighting can assume a much simpler geometric formulation of ray
propagation for computing a stochastic, steady-state solution [57]. Auralization must
carefully approximate the relevant physical mechanisms underlying the vibration of
objects, propagation in air, and scattering by the listener’s body. All these mechanisms
require modeling highly oscillatory wave fields that must be sufficiently sampled in
space and time, giving rise to the tremendous computational expense of brute-force
simulation.
Assume some physical domain of interest with diameter D, the highest frequency
of interest νmax and speed of propagation c. The smallest propagating wavelength
of interest is c/νmax. Thus, the total degrees of freedom in the space-time volume
of interest are Ndo f =(2Dνmax/c)4. The factor of two is due to the Nyquist limit
which enforces two degrees of freedom per oscillation. As an example, for full
audible bandwidth simulation of sound propagation up to νmax =20,000 Hz in a
scene that is D=100 m across, with c=340 m/s in air: Ndof =1.9×1016 .For
an update interval of 60 ms to meet latency requirements for interactive listener
head orientation updates [22], one would thus need a computational rate of over 100
PetaFLOPS. By comparison, a typical game or VR application will allocate a single
CPU core for audio with a computational rate in the range of tens of GigaFLOPS,
which is too slow by a factor of at least one million. This gap motivates research in
the area.
3.2.2 Modular Design
Since pioneering work in the 1990s such as DIVA [86,96], most real-time auraliza-
tion systems follow a modular architecture shown in Fig. 3.1. This architecture results
in a flexible implementation and significant reduction of computational complexity,
without substantially impacting simulation accuracy in cases of practical interest.
Rather than simulating the global scene as a single system which might be pro-
hibitively expensive (see Sect. 3.2.1), the problem is divided into three components
in a causal chain without feedback:
•Production: Sound is first produced at the source due to vibration, which, com-
bined with local self-scattering, results in a direction-dependent radiated source
signal;
3 Interactive and Immersive Auralization 81
directional
sound field
Propagation Spatialization
headphones speakers
sound signal
geometric model
& materials
Fig. 3.1 Modular architecture of real-time auralization systems. The propagation of sound emitted
from each source is simulated within the 3D environment to compute a directional sound field
immersing the listener. This field is given to the spatializer component that computes appropriate
transducer signals for headphone or speaker playback
•Propagation: The radiated sound diffracts, scatters, and reflects in the scene to
result in a direction-dependent sound field at the listener location;
•Spatialization: The sound field is heard by the listener. The spatialization com-
ponent computes transducer signals for playback, taking the listener’s head orien-
tation into account. In the case of using headphones, this implies accounting for
scattering due to the listener’s head and shoulders, as described by the head-related
transfer function (HRTF).
Our focus in this chapter will be on the latter two components; sound production tech-
niques such as physical-modeling synthesis are covered in Chap. 2. Here, we assume
a source modeled as a (monophonic) radiated signal combined with a direction-
dependent radiation pattern.
This separation of the auralization problem into different components is key for
efficient computation. Firstly, the perceptual characteristics of all three components
may be studied separately and then approximated with tailored numerical meth-
ods. Secondly, since the final rendering is composed of these separate models, they
can be flexibly modified at runtime. For instance, a source’s sound and directivity
pattern may be updated, or the listener orientation may change, without expensive
re-computation of global sound propagation. Section 3.3 will formalize this idea.
Limitations. This architecture is not a good fit for cases with strong near-field inter-
action. For instance, if the listener’s head is close to a wall, there can be non-negligible
multiple scattering, so the feedback between propagation and spatialization cannot
be ignored. This can be an important scenario in VR [69]. Similarly, if one plays a
trumpet with its bell very close to a surface, the resonant modes and radiated sound
will be modified, much like placing a mute, which is a case where there is feedback
between all three components outlined above. Thus, numerical simulations for musi-
cal acoustics tend to be quite challenging. The interested reader can consult Bilbao’s
text on the subject [12] and more recent overview [14]. In the computer graphics
82 N. Raghuvanshi and H. Gamper
community, the work in [104] also shows sound production and propagation mod-
eled directly without the separability assumption, with special emphasis on handling
dynamic geometry, for application in computer animation. Such simulations tend to
be off-line, but modern graphics cards have become fast enough for approximate
modeling of interactive 2D wind instruments in real-time [6].
3.2.3 Propagation
The propagation component takes the locations of a source and listener in the scene
to predict the scene’s acoustic response, modeling salient effects such as diffracted
occlusion, initial reflections, and reverberation. Combined with source sounds and
radiation patterns, it outputs a directional sound field to the listener. Propagation is
usually the most compute-intensive portion of an auralization pipeline, motivating
many techniques and systems, which we will discuss in Sects. 3.7 and 3.8.The
methods have two assumptions in common.
Linearity. For most auralization applications, it is safe to assume that sound ampli-
tudes remain low enough to obey ideal linear propagation, modeled by the scalar
wave equation. As a result, the sound field at the listener location is a linear sum-
mation of contributions from all sound sources. There are some cases in games and
VR when the assumption of linearity may be violated, for instance with explosions
or brass instruments. In most such cases, the non-linear behavior is restricted to the
vicinity of the event and may be treated via a first-order perturbative approximation
which amounts to linear propagation with a locally varying sound speed [4,27].
Quasi-static scene configuration. Interactive scenes are dynamic, but most prop-
agation methods assume that the problem may be treated as quasi-static. At some
fixed update rate, such as a visual frame, they take a static snapshot of the scene
shape as well as the locations of the source and listener within it. Then propagation
is modeled assuming a linear, time-invariant system for the duration of the visual
frame. The computed response for each sound source is smoothly interpolated over
frames to ensure a dynamic rendering free of artifacts to the listener.
Fast-moving sources need to be treated with additional care as direct interpolation
of acoustic responses can become error-prone [80]. An important related aspect is
the Doppler Shift on first arrival, a salient, audible effect. It may be approximated
in the source model by modifying the radiated signal based on source and listener
velocities, or by interpolating the propagation delay of the initial sound. Another
case violating the quasi-static assumption are aero-acoustic sounds radiated from
fast object motion through the air. These can be approximated within the source
model with Lighthill’s acoustic analogy [53], with subsequent linear propagation for
real-time rendering [30,31].
3 Interactive and Immersive Auralization 83
3.2.4 Spatialization
In a virtual reality scenario, the target of the audio rendering engine is typically a
listener located within the virtual scene experiencing the virtual acoustic environment
with both ears. For this experience to feel plausible or natural, sound should be
rendered to the user’s ears as if they were actually present in the virtual scene. The
architecture in Fig. 3.1 neglects the effect of the listener on global sound propagation.
The spatialization system (shown to the right in the figure) inserts the listener virtually
into the scene and requires additional processing. A properly spatialized virtual sound
source should be perceived by the listener as emanating from a given location. In the
simplest case of free-field propagation, a sound source can be positioned virtually by
convolving the source signal with a pair of filters (also known as head-related transfer
functions (HRTFs)). This results in two ear input signals that can be presented directly
to the listener over headphones. For a more complex virtual scene containing multiple
sound sources as well as their acoustic interactions with the virtual environment,
spatialization entails encoding appropriate localization cues to the sound field at
the listener’s ear entrances. Common approaches include spherical-harmonics based
rendering (“Ambisonics”) [42,67] as well as object-based rendering [17].
HRTFs. If the sound is played back to the listener via headphones, this implies
simulating the filtering that sound undergoes in a real sound field as it enters the
ear entrances, due to reflections and scattering from the listener’s torso, head, and
pinnae. A convenient way to describe this filtering behavior is via the HRTFs. The
HRTFs are a function of the direction of arrival and contain the localization cues
that the human auditory system decodes to determine the direction of an incoming
wavefront. HRTFs for a particular listener are usually constructed via measurements
in an anechoic chamber [40], though recent efforts exist to derive HRTFs for a listener
on the fly without an anechoic chamber [50,61], by adapting or personalizing existing
HRTF databases using anthropometric features [15,38,41,89,106], or by capturing
image or depth data to model the HRTFs numerically [20,58,65]. For a review of
HRTF personalization techniques, refer to Chap. 4and see [48]. The HRTFs can be
tabulated as two spherical functions H{l,r}(s,t)that encapsulate the angle-dependent
acoustic transfer in the free field to the left and right ears. The set of incident angles
scontained in the HRTF dataset is typically dictated by the HRTF measurement
setup [5,39]. The process of applying HRTFs to a virtual source signal to encode
localization cues is referred to as binaural spatialization.
Spatialization for loudspeaker arrays is also possible, commonly performed using
channel-based methods such as Vector Base Amplitude Panning [72]orAmbison-
ics [42]. It is also possible to physically reproduce the virtual directional sound field
using Wave Field Synthesis [2] with large loudspeaker arrays. For the rest of this
chapter, we will focus on binaural spatialization, although most of the discussion can
be easily adapted to loudspeaker reproduction as discussed in Chap. 5.
Spherical-harmonics based rendering. Various methods exist to spatialize acoustic
scenes. A convenient description of directional fields is via spherical harmonics
(SHs) or Ambisonics [43]. Given a SH representation of a scene, binaural ear input
84 N. Raghuvanshi and H. Gamper
signals can be obtained directly via filtering with a SH representation of the listener’s
HRTFs [29]. However, encoding complex acoustic scenes to SHs of sufficiently high
order while minimizing audible artifacts can be challenging [10,11,19,51]. The
openly available Resonance Audio [47] system follows this approach.
Object-based rendering. In this chapter, we will follow the direct parameterization
over time and angle of arrival which is also common in practice, such as the illustrative
auralization system we discuss in Sect. 3.8.4. The system directly outputs signals
and directions, suitable for spatialization by applying appropriate HRTF pairs. The
description of the acoustic propagation problem from a source to the listener in terms
of a directional sound field as presented in Sect. 3.3.4 results in a convenient interface
between the propagation model and the spatialization engine.
This provides three major advantages. Firstly, it enables a modular system design
that treats propagation modeling and (real-time) spatialization as separate problems
that are solved by independent sub-systems. This separation in turn allows improving
and optimizing the sub-systems individually and can lead to significant computa-
tional cost savings. Secondly, a description of a sound field enveloping the listener
in terms of time and angle of arrival is equivalent to an object-based representa-
tion, which is a well-established input format for existing spatialization software,
thus allowing the system designer to build easily on existing spatialization systems.
Finally, psycho-acoustic research on perceptual limits of human spatial hearing, such
as just-noticeable-differences, are expressed as a function of time and angle of arrival
(Sect. 3.4). Knowledge of these perceptual limits can be exploited for further com-
putational savings.
3.3 Mathematical Model
Auralization may be formalized as a linear, time-invariant process as follows. Assume
a quasi-static state of the world at the current visual frame. To auralize a sound source,
consider its current pose (position and orientation) to determine its directional sound
radiation and then model propagation and spatialization as a feed-forward chain of
linear filters. Those filters in turn depend on the current world shape and listener
pose, respectively.
Notation. For the remainder of this chapter, for any quantity () referring to the
listener, we use prime ()to denote a corresponding quantity referring to the source.
In particular, xis listener location and xsource location. Temporal convolution is
denoted by ∗.
3 Interactive and Immersive Auralization 85
3.3.1 The Green’s Function
With the linearity and time-invariance assumptions, along with the absence of mean
flow or wind, the Navier-Stokes equations simplify to the scalar wave equation that
models propagating longitudinal pressure deviations from quiescent atmospheric
pressure [70]:
(1c2)∂2
t−∇
2
xpt,x,x=δ(t)δx−x,(3.1)
where c=340 m/s is the speed of sound, ∇2
xthe 3D Laplacian operator ranging over
x. The solution is performed on some 3D domain provided by the scene’s shape, with
appropriate boundary conditions to model the frequency-dependent absorptivity of
physical materials.
Sound propagation is induced by a pulsed excitation at time t=0 and source
location xwith δ(·)denoting the Dirac delta function. The solution p(t,x,x)is
Green’s function that fully describes the scene’s global wave transport, including
diffraction and scattering. The principle of acoustic reciprocity ensures that source
and listener positions are interchangeable [70]:
p(t,x,x)=p(t,x,x). (3.2)
For treating general scenes, a numerical solver must be employed to discretely sample
Green’s function in space and time. This includes accurate wave-based methods that
directly solve for the time-evolving field on a grid, or fast geometric methods that
employ the high-frequency Eikonal approximation. We will discuss solution methods
in Sect. 3.7.
In principle, Green’s function has complete information [3], including direction-
ality, which can be extracted via spatio-temporal convolution of p(t,x,x)with
volumetric source and listener distributions that can model arbitrary radiation pat-
terns [13] and listener directivity [91]. But such an approach is too expensive for
real-time evaluation on large scenes, requiring temporal convolution and spatial
quadrature over sub-wavelength grids that need to be repeated when either the source
or listener moves. Geometric techniques cannot follow such an approach at all, as
they do not model wave phase.
This is where modularity (Sect. 3.2.2) becomes indispensable: the source and
listener are not directly included within the propagation simulation, but are instead
incorporated via tabulated directivity functions that result from their local radiation
and scattering characteristics. Below, we formulate the propagation component of
this modular approach, beginning with the simplest case of an isotropic source and
listener, building up to a fully bidirectional representation that can be combined with
arbitrary source and listener directivity during rendering.
86 N. Raghuvanshi and H. Gamper
3.3.2 Impulse Response
Consider an isotropic (omni-directional) sound source located at xthat is emitting
a coherent pressure signal q(t). The resulting pressure signal at listener location x
can be computed using a temporal convolution:
q(t;x,x)=q(t)∗p(t;x,x). (3.3)
Here, p(t;x,x)is obtained by evaluating Green’s function between the listener and
source locations (x,x). We denote this evaluation by putting them after semi-colon
p(t;x,x)to signify they are held constant, yielding a function of time alone. This
function is the (monoaural) impulse response capturing the various acoustic path
delays and amplitudes from the source to the listener via the scene. The vibrational
aspects of how the source event generated the sound q(t)are abstracted away—
it may be synthesized at runtime, or read out from a pre-recorded file and freely
substituted.
3.3.3 Directional Impulse Response
The directional impulse response d(t,s;x,x)[32] generalizes the impulse response
p(t;x,x)to include direction of arrival, s. Intuitively, it is the signal obtained by
the listener if they were to point an ideal directional microphone in direction swhen
the source at xemits an isotropic impulse.
Given a directional impulse response, spatialization for the listener can be per-
formed to reproduce the directional listening experience via
q{l,r}(t;x,x)=q(t)∗S2
dt,s;x,x∗H{l,r}R−1(s), tds ,(3.4)
where H{l,r}(s,t)are the left and right HRTFs of the listener as discussed in
Sect. 3.2.4,Ris a rotation matrix mapping from head to world coordinate system,
and s∈S2represents the space of incident spherical directions forming the inte-
gration domain. Note the advantage of separating propagation (directional impulse
response) from spatialization (HRTF application). The expensive simulation nec-
essary for solving (3.1) can ignore the listener’s body entirely, which is inserted
later taking its dynamic rotation Rinto account, via separately tabulated HRTFs as
in (3.4).
3 Interactive and Immersive Auralization 87
BIR: ( , , ; , ′)
( ; 1,1
′, , )
( ; 2,2
′, , )
Fig. 3.2 Bidirectional impulse response (BIR). An impulse radiates from source position x, prop-
agates through a scene, and arrives via two paths in this simple case at listener position x. The paths
radiate in directions s
1and s
2and arrive from directions s1and s2, respectively, with delays based on
the respective path lengths. The bidirectional impulse response (BIR) denoted by D(t,s,s;x,x)
contains this time-dependent directional information. Evaluating for specific radiant and incoming
directions isolates arrivals, as shown on the right. (figure adapted from [26])
3.3.4 Bidirectional Impulse Response (BIR) and Rendering
Equation
The above still leaves out direction-dependent radiation at the source. A complete
description of auralization for localized sound sources can be achieved by the natural
extension to the bidirectional impulse response (BIR) [26]; an 11-dimensional func-
tion of the wave field, D(t,s,s;x,x), illustrated in Fig. 3.2. Analogous to the HRTF,
the source’s radiation pattern is tabulated in a source directivity function (SDF),
S(s,t), such that its radiated signal in any direction sis given by q(t)∗S(t;s).
We can now write the (binaural) rendering equation:
q{l,r}(t;x,x)=q(t)∗
Dt,s,s;x,x∗SR−1(s), t∗H{l,r}R−1(s), tds ds
,
(3.5)
where Ris a rotation matrix mapping from the listener’s head to the world coordinate
system, Rmaps rotation from the source to the world coordinate system, and the
double integral varies over the space of both incident and emitted directions s,s∈
S2. A similar formulation can be obtained for speaker-based rendering by using, for
instance, VBAP speaker panning weights [72] instead of HRTFs.
The BIR is convolved with the source’s and listener’s free-field directional
responses Sand H{l,r}, respectively, while accounting for their rotation since (s,s)
88 N. Raghuvanshi and H. Gamper
are in world coordinates, to capture modification due to directional radiation and
reception. The integral repeats this for all combinations of (s,s), yielding the net
binaural response. This is finally convolved with the emitted signal q(t)to obtain a
binaural output that should be delivered to the entrances of the listener’s ear canals.
Finally, if multiple sound sources are present, this process is repeated for each source
and the results are summed.
Bidirectional decomposition and reciprocity. The bidirectional impulse response
generalizes the more restrictive notions of impulse response in (3.4) and (3.3), illus-
trated in Fig. 3.2. The directional impulse response can be obtained by integrating
over all radiating directions sand yields directional effects to the listener for an
omnidirectional source:
d(t,s;x,x)≡S2
D(t,s,s;x,x)ds.(3.6)
Similarly, a subsequent integration over directions to the listener, s, yields back the
monoaural impulse response, p(t;x,x).
The BIR admits direct geometric interpretation. With source and listener located
at (x,x), respectively, consider any pair of radiated and arrival directions (s,s).
In general, multiple paths connect these pairs, (x,s)(x,s), with correspond-
ing delays and amplitudes, all of which are captured by D(t,s,s;x,x). Figure 3.2
illustrates a simple case. The BIR is thus a fully reciprocal description of sound prop-
agation within an arbitrary scene. Interchanging source and listener, all propagation
paths reverse:
D(t,s,s;x,x)=D(t,s,s;x,x). (3.7)
This reciprocal symmetry mirrors that for the underlying wave field, p(t;x,x)=
p(t;x,x)and requires a full bidirectional description. In particular, the directional
impulse response is non-reciprocal.
3.3.5 Band-limitation and the Diffraction Limit
It is important to remember that the bidirectional impulse response is a mathemati-
cally convenient intermediate representation only, and cannot be realized physically.
The only physically observed quantity is the final rendered audio, q{l,r}(t;x,x).
In particular, the BIR representation allows unlimited resolution in time and direc-
tion. The source signal, q(t), is temporally band-limited for typical sounds, due to
aggressive absorption in solid media and air as frequency increases. Similarly, audi-
tory perception is limited to 20 kHz. Band-limitation holds for directional resolution
as well because of the diffraction limit [16] which places a fundamental restriction
on the angular resolution achievable with a spatially finite radiator or receiver.
3 Interactive and Immersive Auralization 89
For a propagating wavelength λ, the diffraction-limited angular resolution scales
as D/λ, where Dis the diameter of an enclosing sphere, such as around a radiating
object, or the listener’s head and shoulders in the case of HRTFs [105]. Therefore, all
the convolutions and spherical quadratures in (3.5) may be performed on a discretiza-
tion with sufficient sub-wavelength resolution at the highest frequency of interest.
Alternatively, it is common to perform time convolutions in frequency domain via
the Fast Fourier Transform (FFT) for efficiency. Similarly, spherical harmonics (SH)
form an orthonormal linear basis over the sphere and can be used to accelerate the
spherical quadrature of function product to an inner product of spherical harmonic
(SH) coefficients. An end-to-end auralization system using this approach was shown
in [63].
3.4 Structure and Perception of the Bidirectional Impulse
Response (BIR)
To explain how the theory outlined above can be put into practice, we will first review
the physical and perceptual structure of the BIR, followed by a discussion of how
auralization systems approximate in various ways.
3.4.1 Physical Structure
The structure of a typical (bidirectional) impulse response may be understood in three
phases in time, as illustrated in Fig. 3.3. First, the emitted sound must propagate via the
shortest path, potentially diffracting around obstruction edges to reach the listener
after some onset delay. This is the initial (or “direct”) sound. The initial sound is
followed by early reflections due to scattering and reflection from scene geometry.
As sound continues to scatter multiple times from the scene, the temporal arrival
123
1
2
3
4
4
arrival direction,
radiant direction,
impulse response,
initial sound
initial time delay gap
onset delay mixing time
early reflections late reverberation
Fig. 3.3 Structure of the bidirectional impulse response (figure adapted from [26])
90 N. Raghuvanshi and H. Gamper
density of reflections increases, while the energy of an individual arrival decreases
due to absorption at material boundaries and in the air. Over time, with sufficient
scattering, the response approaches decaying Gaussian noise, which is referred to
as late reverberation. The transition from early reflections to late reverberation is
demarcated by the mixing time [1,98].
As we discuss next, each of these phases has a distinct contribution to the overall
spatial perception of a sound. These properties of the human auditory perception
play a key role in informing how one might approximate the rendering equation (3.5)
within limited computational resources, while still retaining an immersive auditory
experience. A more detailed review of perception of room acoustics can be found
in [37] and [60]. All observations and terms below can be found in these references,
unless otherwise noted.
3.4.2 Initial (“Direct”) Sound
Our perception strongly relies on the initial sound to localize sound sources, a phe-
nomenon called the precedence effect [62]. Referring to Fig. 3.3, if there is a sec-
ondary arrival that is roughly within 1 ms of the initial sound, we perceive a direction
intermediate between the two arrival directions, termed summing localization, rep-
resenting the temporal resolution of spatial hearing. Beyond this 1 ms time window,
our perceptual system exerts a strongly non-linear suppression effect, so people do
not confuse the direction of strong reflections with the true heading of the sound.
Sometimes called the Haas effect, a later arrival may need to be as much as 10 dB
louder than the initial sound to affect the perceived direction significantly. Note that
this is not to say that the later arrival is not perceived at all, only that its effect is not
to substantially change the localized direction.
Consider the case shown in Fig. 3.3, and assume the walls do not substantially
transmit sound. The sound shown inside the room would be localized by the listener
outside as arriving from the direction of the doorway, rather than the line of sight.
Such cues are a natural part of how we navigate to visually occluded events in
everyday life. The upshot is that in virtual reality, the initial sound path may be
multiply-diffracted and must be modeled with particular care so that the user gets
localization cues consistent with the virtual world.
3.4.3 Early Reflections
Early reflections directly affect the perception of source properties such as loud-
ness, width, and distance while also informing the listener about surrounding scene
geometry such as nearby reflectors. A copy of a sound following the initial arrival
is perceptually fused up until a delay called the echo threshold, beyond which it is
3 Interactive and Immersive Auralization 91
perceived as a separate auditory event. The echo threshold varies between 10 ms
for impulsive sounds, through 50 ms for speech to 80 ms for orchestral music [62,
Table 1].
The impact of the loudness of early reflections is important in two ways. Firstly,
the perception of source distance is known to correlate with the energy ratio between
initial sound and remaining response (whose energy mostly comes from early reflec-
tions), called the direct-to-reverberant ratio (DRR) [92]. This is often also called the
“wet ratio” by audio designers. Secondly, how well one can understand and localize
sounds depends on the ratio of the energy of direct sound and early reflection in the
first 50 ms to the rest of the response, as measured by clarity (C50).
The directional distribution of reflections conveys important detail about the size
and shape of the local environment around the listener and source. The ratio of
reflected energy arriving horizontally and perpendicular to the initial sound is called
lateral energy fraction and contributes to the perception of spaciousness and affects
the apparent source width. Further, in VR, strong individual reflections from surfaces
close to the listener provide an important proximity cue [69].
Thus, an auralization system must model strong initial reflections as well as the
aggregate energy and directionality of later reflections up to the first 80 ms to ensure
important cues about the sound source and environment are conveyed.
3.4.4 Late Reverberation
The reverberation time, T60, is the time taken by the reverberant energy to decay
by 60 dB. Since the reverberation contains numerous, lengthy paths through the
scene, it provides a sense of the overall scene, such as its size. The T60 is frequency-
dependent; the relative decay rate across various frequencies informs the listener
about the acoustic materials in a scene and atmospheric absorption.
The aggregate directional properties of reverberation affect listener envelopment
which is the perception of being present in a room and immersed in its reverberant
field (see Chap. 11 and Sect. 11.4.3 for further discussions on related topics). In
virtual reality, one may often be present outside a room containing sounds and any
implausible envelopment becomes especially distracting. For instance, consider the
situation in Fig. 3.3—rendering an enveloping room reverberation for the listener
will sound wrong, since the expectation would be low envelopment.
3.5 System Design Considerations for VR Auralization
Many types of real-time auralization systems exist today that approximate the ren-
dering equation (3.5), and in particular, how to evaluate the scene’s sound propaga-
tion (i.e., the BIR, D(t,s,s;x,x)) which is typically the most compute-intensive
92 N. Raghuvanshi and H. Gamper
portion. They gain efficiency by making approximations based on the intended appli-
cation, with a knowledge of the limits of auditory perception.
3.5.1 Room Auralization
The roots of auralization research lie in the area of computational modeling of room
acoustics, an active area of research with developments dating back at least 50
years [7,60]. The main objective of these computer models has been to aid in the
architectural design of enclosures, such as offices, classrooms, and concert halls. The
predictions of these models can then be used by acousticians to propose architec-
tural design changes or acoustic treatments to improve the reverberant properties of
a particular room or hall, such as speech intelligibility in a classroom. This requires
models that simulate the room’s first reflections and reverberation with perceptual
authenticity. The direct path in such applications can often be computed analytically
since the line of sight is rarely blocked. We direct the reader to Gade’s book chapter
[37] on the subject of room acoustics for an excellent summary of the requirements,
metrics, and methods in the field from the viewpoint of concert hall design.
While initially the computer models could only produce quantitative estimates
of room acoustic parameters, with increasing compute power, real-time auralization
systems were proposed near the beginning of the millennium [86]. As we will discuss
in more detail shortly, geometric methods are standard in the area today because they
are especially well-suited for modeling a single enclosure where visual occlusion
between sounds and listener is not dominant. This holds very well in any hall designed
for speech or music. Room auralization is available today in commercial packages
such as ODEON [82] and CATT [28].
3.5.2 VR Auralization
The concerns of real-time VR auralization are quite distinct along a number of
dimensions, which result from going from individual room to a scene that can span
entire city blocks with numerous indoor and outdoor areas. This results in a unique
set of considerations that we enumerate below, for two reasons. Firstly, they provide
a framing for understanding current research in the area and the trade-offs current
systems make, which we will discuss in the following sections. Secondly, we hope
that the concise listing of practical problems motivates new research in the area, as
no system today can meet all these criteria.
1. Real time within limited computation. A VR application’s auralization com-
ponent can usually only use a single or a few CPU cores for audio simulation
at runtime, since resources must be shared with simulating other aspects of the
world, such as rigid-body collisions, character animation, and AI path plan-
3 Interactive and Immersive Auralization 93
ning. In contrast, owing to the application, in room acoustic auralization one can
consume a majority of the resources of a computer including the parallel com-
pute power of modern graphics cards. With power-efficient mobile processors
integrated into phones and standalone head-mounted displays, the pressure to
minimize computation has only increased.
2. Scene complexity and non line of sight. Room acoustics theory often starts by
assuming a single connected space such as a concert hall that has lines of sight
from the stage to all listener locations. This allows for a powerful simplification
of the sound field as an analytically computable direct sound combined with a
diffuse reverberant field. Modern VR systems for building and game acoustics
consider the much broader class of all scenes such as a building floor with many
rooms, or a street canyon with buildings that may be entered. These are complex
scenes not just in the sense of surface detail but also in that the air volume
is topologically complex, with many concavities. As a result, non line of sight
cases are common. For instance, hearing sounds in the same room with plausible
reverberation can be as important as not hearing sounds inside another room, or
hearing sounds from unseen sources diffracted around a corner or door.
3. Perception. Physical accuracy is important to VR auralization not as a goal in
itself but rather in so far as it impacts sensory immersion. This opens opportu-
nities for fast approximations, and deeply informs practical systems that scale
their errors based on the acuity of the human auditory system. This observa-
tion underlies the deterministic-statistical decomposition discussed in the next
section. Further, in many applications such as games, plausibility can be suffi-
cient as a starting point, while for instance in auralizing building acoustics one
might need perceptual authenticity.
4. Dynamic sounds. VR auralization must often support dynamic sound sources
that can translate and rotate. The rendering must respond with low latency and
without distracting artifacts, even for fast source motion. This adds significant
complexity to a minimum-viable practical system. However, in architectural
acoustic systems, static sound sources can be a feasible starting point.
5. Dynamic geometry. In many applications, the scene geometry can be changed
interactively. This may be while designing a virtual space, in which case an
acoustical system for static scenes may re-compute on the updated geometry;
depending on the system this can take seconds to hours. The more challenging
case is when the geometry is changing in real time. The change might be “locally
dynamic”, such as opening a door or moving an obstruction. Since such changes
are localized in an otherwise static scene, many systems are able to model such
effects. Lastly, the scene may be “globally dynamic”, where there might be
unpredictable global changes, such as when a game player creates a building in
Minecraft or Fortnite and expects to hear the audio rendering adapt to it in real
time—while this has the most practical utility it is also the most challenging
case.
6. Robustness. VR requires high robustness given unpredictable user inputs. This
means the severity and frequency of large outlying errors may matter more
than average error. For instance, as the listener moves quickly through a scene
94 N. Raghuvanshi and H. Gamper
through multiple rooms, the variation in reverberation and diffracted occlusion
must stay smooth reliably. This is a tightly restrictive constraint: a technique
that has large outlying errors may not be viable in immersive VR regardless of
its average error. As an example, an implausible error in calculating occlusion
with only 0.1% probability for an experience running at 30 frames per second
means distracting the user every 33 s on average. This deteriorates to 3.3 s with
10 sound sources and so on.
7. Scalability. The system should ideally expose compute-quality trade-offs along
two axes. Firstly, VR scenes can contain hundreds to thousands of dynamic sound
sources, and it is desirable if the signal processing can scale from high-quality
rendering of a few sound sources to lower quality (but still plausible) render-
ing for numerous sound sources. Secondly, the acoustical simulation should
also allow methods for reducing quality gracefully as scene size increases. For
instance, high-quality propagation modeling of a conference room, up to a rough
simulating of a city.
8. Automation. For VR applications, it is preferable to avoid any per-scene man-
ual work, such as geometric scene simplification. Game scenes in particular can
span over kilometers with multiple buildings designed iteratively during the pro-
duction process. This makes manual simplification a major hurdle for practical
usage. The auralization system must ideally directly ingest complex scenes with
millions of polygons, and perform any necessary simplification while minimiz-
ing any human expertise or input, unlike in room auralization.
9. Artistic direction. VR often requires the final rendering to be controlled by
a sound designer. For instance, the reverberation and diffracted occlusion on
important dialogue might be reduced to boost speech intelligibility in a game.
Or one might want to re-map the dynamic range of the audio rendering with
the limits of the audio reproduction system or user comfort in mind. A viable
system must provide methods that allow such design intent to be expressed and
influence the auralization process appropriately.
3.6 Rendering the BIR: the Deterministic-Statistical
Decomposition
A powerful technique employed by most real-time auralization systems is to decom-
pose the BIR as a sum of a deterministic and statistical component. This is deeply
informed by acoustical perception (Sect. 3.4) and is key to enabling the computational
trade-offs VR auralization must contend with, as described in the prior section. The
initial sound and strong early reflections, such as sound heard via a portal or echoes
heard from nearby large surfaces, are treated deterministically: that is, simulated and
rendered in physical detail, and updated in real time based on the dynamic source and
listener pose and scene geometry. Weak early reflections and late reverberation are
3 Interactive and Immersive Auralization 95
represented only statistically, ignoring the precise details of each of the amplitudes
and delays of thousands of arrivals or more, which are perceived in aggregate.
To formalize, the BIR is decomposed as
D(t,s,s;x,x)=Dd(t,s,s;x,x)+Ds(t,s,s;x,x). (3.8)
Referring to Fig. 3.3, the initial sound and early reflection spikes deemed perceptually
salient can be included accurately in Dd. The residual is Ds, which is usually modeled
as noise characterized by its perceptually relevant statistical properties.
Substituting into the rendering equation (3.5) and observing linearity, we have
q{l,r}(t;x,x)=
{d,s}
q(t)∗ D{d,s}∗SR−1(s), t∗H{l,r}R−1(s), tds ds
,
(3.9)
so that the input mono signal, q(t), is split off as input into separate filtering pro-
cesses for the two components, whose binaural outputs are summed. This is a fairly
standard architecture followed by both research and commercial systems, as the two
components may be approximated independently with perception and the particu-
lar application in mind. For the remainder of this section, we will assume the BIR
components have been computed and focus on the signal processing for rendering.
The next section will discuss how this decomposition informs the design of acoustic
simulation methods.
3.6.1 Deterministic Component, Dd
The deterministic component, Dd, is typically represented as a set of ndpeaks:
Dd(t,s,s;x,x)≈
nd−1
i=0
ai(t)∗δ(t−τi)δ(s−s
i)δ(s−si). (3.10)
Each term represents an echo of the emitted impulse that arrives at the listener position
after a delay of τifrom world direction si, having been previously radiated from
the source in world direction s
iat time t=0. The amplitude filter ai(t)captures
transport effects along the path from edge diffraction, scattering, and frequency-
dependent transmission/reflection from scene geometry. Note that the amplitude
filter is causal, i.e., ai(t)=0fort<0, and by convention τi+1>τ
i. The parameter
ndis key for trading between rendering quality and computational resources. It is
usual to at least treat the initial sound path deterministically (i.e., nd≥1) because of
its high importance for localization due to the Precedence Effect. Audio engines will
usually designate this (i=0) as the “dry” path with separate design controls due to
its perceptual importance.
Substituting from (3.10) into Eq. (3.9), we get
96 N. Raghuvanshi and H. Gamper
q{l,r}
d(t)=
nd−1
i=0
q(t)∗δ(t−τi)∗ai(t)∗SR−1(s
i), t∗H{l,r}R−1(si), t.
(3.11)
Thus, each path’s processing is a linear filter chain whose binaural output is summed
to render the deterministic component to the listener. Reading the equation from left
to right: for each path, take the monophonic source signal and input it to a delay line.
Read the delay line at (fractional) delay τiand filter the output based on amplitude
filter ai, then filter it based on the source’s radiation pattern. The lookup via R−1(s
i)
signifies that one must rotate the radiant direction of the path from world space to
the local coordinate system of the source’s spherical radiation pattern data.
Finally, the last factor makes concrete the modularity shown in Fig. 3.1: the result-
ing monophonic signal from this prior processing is sent to the spatializer module
as arriving from direction R−1(si)relative to the listener. One is free to substitute
any spatializer to separately trade off quality and speed of spatialization versus other
costs and priorities for the system. One could even use multiple spatialization tech-
niques, such as high-quality spatialization for the initial path, and lower fidelity for
reflections. In a software implementation, the spatializer often acts as a sink for
monophonic signals, processing each, mixing their outputs, and sending them to a
low-level audio engine for transmission to transducers, thus performing the summa-
tion in (3.11) as well.
Similar to the choice of spatializer, the details of all other filtering operations
are highly flexible. For the amplitude filter ai, the simplest realization is to multi-
ply by a scalar for average magnitude over frequencies, thus representing arrivals
with idealized Dirac spikes. But for the initial sound filter a0, even in a minimal-
istic setting it is common to apply a low-pass filter to capture the audible muf-
fling of visually occluded sounds. A more accurate implementation accounting for
frequency-dependent boundary impedance could use equalization filters in octave
bands. For source directivity, it is common to measure and store radiation patterns
as third-octave or octave-band data tabulated over the sphere of directions while
ignoring phase. Convolution can then be realized via modern fast graphic equalizer
algorithms that employ recursive time-domain filters [68].
The commutative and associative properties of convolution are a powerful tool
to optimize signal processing. The ordering of filters in (3.11) has been chosen to
illustrate this. The delay is applied in the very first operation. This makes it so that
we only need one single-write-multiple-read delay line shared across all paths. The
signal q(t)is written as input, and each path reads out at delay τi. This is a commonly
used optimization. Further, one may then use the associative property to group the
factors: ai(t)∗SR−1(s
i), t. If both are implemented, say, using an octave-band
graphic equalizer, then the per-band amplitudes can be multiplied first and provided
to a single instance of the equalizer—a nearly two-fold reduction in equalization
compute. These optimizations illustrate the importance of linearity and modularity
in the efficient implementation of auralization systems.
3 Interactive and Immersive Auralization 97
3.6.2 Statistical Component, Ds
The central concept for rendering the statistical component, Ds, is to use an analysis-
synthesis approach [56]. The analysis phase does lossy perceptual coding of the
statistical component of the BIR, Ds, to compute ¯
Dsas the energy envelope of the
response summing over time, frequency, and direction. We use the over-bar notation
¯
f(¯y)to indicate that yis sub-sampled, and f’s corresponding energy is appropriately
summed at each sample of ¯ywithout loss via some windowing. For instance, if p(t)
is an impulse response, ¯p(¯
t)indicates the corresponding echogram, which is the
histogram of p2(t)sampled at some time-bin centers, ¯
t. This notation is introduced to
indicate the reduction in the sampling rate of y, and loss of fine structure information
in fat its original sampling rate, such as phase.
Parametric reverberation. During real-time rendering, the description captured in
¯
Dscan be synthesized using fast parametric reverberation techniques: the “param-
eters” being statistical properties that determine ¯
Ds, as we will discuss. The key
advantage is that since the fine structure of the response in time, frequency, and
direction is left unspecified, one has vast freedom in choosing efficient techniques.
These techniques often rely on recursive time-domain filtering which can potentially
make the CPU cost far smaller than applying a few seconds long filter via frequency-
domain convolution. The research problem is to make the artificial reverberation
sound natural. Among other concerns, the produced reverberation must have realis-
tically high temporal echo density and sound colorless, not introducing perceivable
spectral or temporal modulations that cannot be controlled. For further reading, we
point readers to the extensive survey in [99]. In the following, we focus on how one
might characterize ¯
Ds.
Energy Decay Relief (EDR). The EDR [56] is a central concept for statistical encod-
ing of acoustical responses. Consider a monoaural impulse response, p(t). The EDR,
¯p(¯
t,¯ω), is computed by performing short-time Fourier analysis on p(t)to compute
how its energy spectral density integrated over perceptual frequency bands with cen-
ters ¯ωvaries over time-bin centers ¯
t. It can be visualized as a spectrogram. Frequency
dependence results from materials of the boundary (e.g., wood tends to be more
absorbent at high frequencies compared to concrete) and atmospheric absorption.
Frequency band centers are typically spaced by octaves for real-time auralization,
and time bins typically have a width of around 10 ms.
The reduced sampling rate makes the EDR, ¯p, already quite compact compared
to p, which is a highly oscillatory noisy signal at audio sample rates. Further, the
EDR is smooth in time: it exhibits slow variation during early reflections (especially
if the strong peaks have been separated out already into Dd) followed by monotonic
decay during late reverberation. This opens up many avenues for a low-dimensional
description with a few parameters. For instance, for a single enclosure, the EDR in
each frequency band may be well-approximated by an exponential decay, resulting in
a compact description for the late reverberation parameterized by the initial energy,
¯p0, and 60-dB decay time, T60 in each frequency band:
98 N. Raghuvanshi and H. Gamper
¯p(¯
t,¯ω) ≈¯p0(¯ω)10−6¯
t/T60(¯ω).(3.12)
Apart from substantial further compression, the great advantage of such a paramet-
ric description is that it is easy to interpret, allowing artistic direction. Reverberation
plugins will typically provide ¯p0as a combination of a broadband “wet gain” and
a graphic equalizer, as well as the decay times, T60(¯ω) over frequency bands. For
interactive auralization, the artist can exert aesthetic control by the simple means
of modifying the reverberation parameters produced from acoustic simulation. For
instance, when the player enters a narrow tunnel in VR, footsteps might get a realistic
initial power ( ¯p0) to convey the constricted space, yet speech might have the wet gain
reduced to increase the clarity (C50) and improve the intelligibility of dialogue.
Bidirectional EDR. For an enclosure where conditions approach ideal diffuse rever-
beration, the EDR can be a sufficient description. Parametric reverberators will typ-
ically ensure that the same EDR is realized at both the ears but that the fine structure
is mutually decorrelated, so that the reverberation is perceived by the listener as
outside their head. However, in VR applications it becomes important to model the
directionality inherent in reverberation because it can become strongly anisotropic.
For instance, a visually occluded sound in another room heard through a door will
be temporally diffuse, but directionally localized towards the door.
The concept of EDR can be extended naturally to the bidirectional EDR,
¯
Ds(¯
t,¯ω, ¯s,¯
s;x,x), which adds dependence on direction for both source and lis-
tener. It can be constructed and interpreted as follows. Consider a source located at x
that radiates a Dirac impulse in a beam centered around directional bin center ¯
s.After
propagating through the scene, it is received by the listener at location x, who beam-
forms in the direction ¯sand then computes the EDR on the received time-dependent
signal. The bidirectional EDR thus captures the frequency-dependent energy decay
for all direction-bin pairs {¯s,¯
s}.
Invoking the exponential decay model, the bidirectional EDR may be approxi-
mated as
¯
Ds(¯
t,¯ω, ¯s,¯
s;x,x)≈¯p0(¯ω, ¯s,¯
s;x,x)10−6¯
t/T60(¯ω,¯s,¯
s;x,x).(3.13)
Due to the curse of dimensionality, simulating and rendering the bidirectional EDR
can get quite costly despite the simplifications. In practice, one must choose the sam-
pling resolution of all the parameters judiciously depending on the application. An
extreme case of this is when we sum over the entire range of a parameter, effectively
removing it as a dimension.
Let’s consider one example that illustrates the kind of trade-offs offered by sta-
tistical modeling in balancing rendering quality and computational complexity. One
may profitably compute the T60 for energy summed over all listener directions s, and
source directions s, which amounts to computing the monophonic EDR to derive
the reverberation time. In that case, one obtains a simplified hybrid approximation:
¯
¯
Ds(¯
t,¯ω, ¯s,¯
s;x,x)≈¯p0(¯ω, ¯s,¯
s;x,x)10−6¯
t/T60(¯ω;x,x).(3.14)
3 Interactive and Immersive Auralization 99
The first factor still captures strong anisotropy in reverberant energy, such as reverber-
ation heard by a listener as streaming from a portal, or reverberant power being higher
when a human speaker faces a close by reverberant chamber rather than away. In fact,
¯p0(¯ω, ¯s,¯
s;x,x)can be understood as a multiple-input-multiple-output (MIMO)
frequency-dependent transfer matrix for incoherent energy between a source and
receiver for directional channels sampled via sand s, respectively. The approxima-
tion lies in the second factor—directionally varying decay times for a single sound
source are not modeled, which may be quite subtle to perceive in many cases.
3.7 Computing the BIR
Acoustic simulation is the key computationally expensive task in modern auralization
systems due to the high complexity of today’s virtual scenes. In particular, at every
visual frame, for all source and listener pairs with locations (x,x), the system must
compute the BIR D(t,s,s;x,x), which may then be applied on each source’s audio
as discussed in the prior section. There are two distinct ways the problem may be
approached: geometric and wave-based methods. In this section, we will discuss the
fundamental ideas behind these techniques.
3.7.1 Geometric Acoustics (GA)
Geometric methods approximate sound propagation via the zero-wavelength (infi-
nite frequency) asymptotic limit of the wave equation (3.1). Borrowing terminology
from fluid mechanics, this yields a Lagrangian approach, where packets of energy
are tracked explicitly through the scene as they travel along rays and repeatedly
scatter into multiple packets in all directions each time they hit the scene boundary.
The key strength of geometric methods is speed and flexibility: compared to a full-
bandwidth wave simulation, tracing rays can be much cheaper, and it is much easier
to incorporate physical phenomena and construct the BIR, assembled by explicitly
constructing paths connecting source to listener. Today, these methods are standard
in the area of room auralization.
Their key challenge falls into two categories. Firstly, one must efficiently search
for paths connecting source to listener via complex scenes. Searching costs compu-
tation. Doing too little can under-sample the response, causing audible jumps in the
rendering. Secondly, diffraction at audible wavelengths must be considered explicitly
(since it is not present by default) to ensure plausibility. Both must be incorporated
while balancing smooth rendering for moving sources and listener against the CPU
cost of geometric analysis inherent in path search.
Below, we briefly elaborate on the general design of GA systems and practical
implications for VR auralization, and refer the reader to Savioja and Svensson’s
excellent survey on the recent developments in GA techniques [87].
100 N. Raghuvanshi and H. Gamper
Simplified geometry. Due to the zero-wavelength approximation, geometric meth-
ods remain sensitive to geometric detail indefinitely below audible wavelengths. For
instance, if one directly used a visual mesh for GA simulation, a coffee mug can
create a strong audible echo if the source and listener are connected by a specu-
lar reflection path hitting the cup. Such specular glints are observed for light, but
not sound with its much longer wavelength. So, it becomes important to build an
equivalent simplified acoustical model of the scene which captures only large facets,
combined with coefficients that summarize scattering due to diffraction. For instance,
the seating area in a concert hall might be replaced with an enclosing box with an
equivalent scattering coefficient. This process requires the user to have a degree of
acoustical expertise, and inaccuracies can result without carefully specified geom-
etry and boundary data [21]. However, for VR auralization, automation is highly
desirable, with some recent work along these lines [88].
Deterministic-statistical decomposition. Geometric methods directly incorporate
the deterministic-statistical decomposition in the simulation process to reduce CPU
burden. In particular, the two components Ddand Dsare typically computed and
rendered separately and then mixed in the final rendering to balance quality and
speed.
GA methods perform a deterministic path search only up to a certain number of
bounces on the scene boundary, called the reflection order. This is a key parameter
for GA systems because it has a sensitive impact on both performance and render-
ing quality, varying by system and application. Typically, the user can specify this
parameter, which then implicitly determines the number of deterministic peaks ren-
dered, nd,in(3.10). To accelerate path search, early methods [86] proposed using
the image source method [7], which is well-suited for single enclosures but scales
exponentially with reflection order and does not account for edge diffraction.
Following work on beam tracing, [36] showed that in multi-room scenes, pre-
computing a beam-tree data structure can at once control the exponential scaling
and also incorporate edge diffraction which is crucial for plausibility in such densely
occluded scenes. The system introduced precomputation as a powerful technique
for reducing runtime acoustics computation, which most modern systems employ at
least to some degree.
A key general concept employed in the beam tracing work in [36]istheroom-
portal decomposition: an indoor scene with many rooms is approximately decom-
posed into a set of Simplicial convex shapes that represent room volume, connected
by flat portals representing doors. This is a frequently used method in GA systems, as
it allows efficient deterministic path search on the discrete graph formed by rooms as
nodes and portals as connecting edges. However, room-portal decomposition does
not generalize to outdoor or mixed scenes, which is a key limitation that recent
research is focusing on to allow fast deterministic search of high-order diffraction
paths [34,88].
Techniques developed for light transport in the computer graphics community are
a great fit for computing the statistical component owing to its phase incoherence.
Many methods are possible, such as those based on radiosity [8,93]. Stochastic
3 Interactive and Immersive Auralization 101
path tracing is a standard method in both graphics and acoustics communities today,
used originally by DIVA [86] and in modern systems like RAVEN [90]. More recent
improvements use bidirectional path tracing [24], which directly exploits the bidi-
rectional reciprocity principle (3.7) to accelerate computation.
GA methods cannot construct the fine structure of the reverberant portion of the
response, but as we discussed in Sect. 3.6.2, it is often sufficient to build the bidirec-
tional energy decay relief, ¯
Ds(¯
t,¯ω, ¯s,¯
s;x,x), or some lower dimensional approx-
imation ignoring directionality. With path tracing techniques, this is directly accom-
plished by accumulating into a histogram indexed on all the function parameters—
each path represents an energy packet that accumulates into its corresponding his-
togram bin. The key parameter trading quality and cost is the number of paths sampled
so that the energy value in each histogram bin is sufficiently converged.
With simplified scenes admitting a room-portal decomposition one can expect
robust convergence, or even use approximations that avoid path tracing altogether
[94], but for path tracing in complex VR scenes, the required number of paths for a
converged histogram can vary significantly based on source and listener locations,
{x,x}. For instance, if they are connected only through a few narrow apertures in the
scene, it can be hard to find connecting paths despite extensive random searching.
There is precedence for such issues in computer graphics as well [101], representing
a frontier for new research with systematic convergence studies, as initiated in [24].
3.7.2 Wave Acoustics (WA)
Wave acoustic methods take an Eulerian approach: space time is discretized onto a
fictitious background, such as a uniform discrete grid, and then one updates pressure
amplitude in each cell at each time-step. Paths are not constructed explicitly, so as
energy scatters in various directions from scene surfaces, the amount of information
tracked does not change. Thus, arbitrary combinations of diffraction and scattering
are naturally captured by wave methods. By running a simulation with a source
located at x, a discrete approximation of Green’s function p(t,x;x)is directly
produced by running a volumetric simulation for a sufficient duration. The BIR
D(t;s,s,x,x)may then be computed via accurate plane-wave decomposition in
a volume centered at the source and listener location [2,91] or via the much faster
approximation using instantaneous flux density [26], first applied to audio coding
in [74].
Numerical solvers. The main challenge of wave methods is their computational cost.
Since wave solvers directly resolve the detailed wave field by discretizing space and
time, their cost scales as the fourth power of the maximum simulated frequency and
third power of the scene diameter, due to Nyquist criteria as outlined in Sect. 3.2.1.
This made them outright infeasible for most practical uses until the last decade,
apart from low-frequency modal simulations up to a few hundred Hertz. However,
they have seen a resurgence of interest over the last decade, with many kinds of
102 N. Raghuvanshi and H. Gamper
solvers being actively researched today for auralization, such as spectral methods [52,
77], finite difference methods [49,85], and the finite element method [71,103].
Alongside the progress in numerical methods, the increased computational power of
CPUs and graphics processors, as well as the availability of increased RAM, now
allows simulations of practical cases of interest, such as concert halls, up to mid-
frequencies (1 kHz and beyond). This is still short of complete audible bandwidth,
and it is common to use approximate extrapolation beyond the band-limit frequency.
The compute times remain suitable only for off-line computation, ranging in a few
hours. The availability of commodity cloud computation has further aided the wider
applicability of wave methods despite the cost.
Precomputation and static scenes. The idea of precomputation has been central
to the increasing application of wave methods in VR auralization. Real-time aural-
ization with wave methods was first shown to be viable for complex scenes in [80].
The method performs multiple simulations off-line and the resulting (monophonic)
impulse responses are encoded and stored in a file. At runtime, this file is loaded,
and the sampled acoustical data are spatially interpolated for a dynamic source and
listener which informs spatialization of the source audio. This overall architecture is
followed by most wave-based auralization methods.
The disadvantage of precomputation is that it is limited to static scenes. However,
it has the great benefit that the fidelity of acoustical simulation becomes decoupled
from runtime CPU usage. One may perform a detailed simulation directly on complex
scene geometry ensuring robust results at runtime. These trade-offs are highly analo-
gous to “light baking” which is a common feature of game engines today: expensive
global illumination is simulated beforehand on static scenes to ensure fast runtime
rendering. Similar to developments in lighting, one can conceivably incorporate local
dynamism such as additional occlusion from portals [76] or moving objects [84]in
the future.
Parametric encoding. The key research challenge introduced by precomputation is
that the BIR field D(t,s,s,x,x)is 11-dimensional and highly oscillatory.Capturing
it in detail can easily take an impractical amount of storage. Spatial audio coding
methods such as DirAC [73,74] demonstrate a path forward, in that they extract and
render perceptual properties from directional audio recordings rather than trying to
re-create the physical sound field. This in turn is similar in spirit to audio coding
methods such as MP3 where precise waveform reconstruction is eschewed in favor
of controllable trade-offs between perceived quality and compressed size.
These observations have motivated a new thread of auralization research on wave-
based parametric methods [26,78,79] that combine precomputed wave acoustics
with compact, perceptual coding of the resulting BIR fields. Such methods are prac-
tical enough today to be employed in many gaming applications. The deterministic-
statistical decomposition plays a crucial role in this encoding stage, as we will elab-
orate in Sect. 3.8.4 when we discuss [26] in more detail.
Physical encoding. In a parallel thread, there has been work on methods that directly
approximate and convolve the complete BIR without involving perceptual coding.
3 Interactive and Immersive Auralization 103
The equivalent source method was proposed in [63,64], at the expense of restricting
to scenes that are a sparse set of exterior-scattering building facades. More recent
methods for high-quality building auralization have been developed, which sample
and interpolate BIRs for dynamic rendering [55]. The advantage is that no inherent
assumptions are made about the perception or the structure of the BIR, but in turn,
such systems tend to be more expensive and current technology is limited to static
sound sources.
3.8 Auralization Systems
In this section, we will discuss a few illustrative example systems in more detail. We
emphasize that this should not be interpreted as a representative survey. Instead, our
aim is to illustrate how the design of practical systems can vary widely depending on
the intended application, chosen algorithms, and in particular how systems choose
to prioritize a subset of the design constraints (Sect. 3.5). Most of these systems are
available for download and experimentation.
3.8.1 Room Acoustics for Virtual Environments (RAVEN)
RAVEN [90] is a research system built from the ground up aiming for perceptually
authentic and real-time auralization in VR. The computational budget is thus on the
high side, such as all the resources of a single or few networked computers. This is
in line with the intended application: for an acoustician evaluating a planned design,
it is more important to hear a result with reliable predictive value, and the precise
amount of computation does not matter as long as it is real time. RAVEN is a great
example of the archetypal decisions involved in the end-to-end design of modern
real-time geometric systems.
A key assumption in the system is that the scene is a typical building floor. Many
decisions and efficiencies flow naturally. Chiefly, one can employ the room-portal
decomposition as discussed in Sect. 3.7.1. Local scene dynamism is also allowed by
the system, such as opening or closing doors, with limited precomputation on the
scene geometry. However, like most geometric acoustic systems, the scene geometry
has to be manually simplified with acoustical expertise to achieve the simplified cells
required by rooms and portals. Flexible signal processing that can include artistic
design need not be considered, since the application is physical prediction.
RAVEN models diffraction on both the deterministic and statistical components
of the BIR. The former uses the image source method, with reflection orders up to 3
for real-time evaluation. Edge sources are introduced to account for diffraction paths
that, e.g., first undergo a bounce from a flat surface and then diffract around a portal
edge. Capturing such effects is especially important for smooth results on dynamic
source and listener motion, which RAVEN carefully models.
104 N. Raghuvanshi and H. Gamper
The statistical component uses stochastic ray tracing with improved convergence
using the “diffuse rain” technique [90]. To model diffraction for reverberation, a
probabilistic scheme is used [95] that deflects rays that pass close enough to scene
edges. Since the precise reconstruction of the reverberant characteristics is of central
importance in architectural acoustics, RAVEN models the complete bidirectional
energy decay relief, as illustrated in [90, Fig. 5.19].
3.8.2 Wwise Spatial Audio
Audiokinetic’s Wwise [9] is a commonly employed audio engine in video games,
alongside many other audio design applications. Wwise provides both geometric
acoustical simulation and HRTF spatialization using either object-based or spherical-
harmonic processing (Sect. 3.2.4). The system stands in illustrative contrast to
RAVEN, showing how different application needs can deeply shape technical choices
of auralization systems. A detailed description of ideas and motivation can be found
in the series of white papers [23].
Gaming applications require very low CPU utilization (fraction of a single core)
without requiring physical accuracy. But one needs to approximate carefully. The
rendering must stay perceptually believable, such as smooth acoustic changes on
fast source motion or visual occlusion. Minimizing precomputation is desirable for
reducing artist iteration times. Finally, the ability of artists to interpret the acoustic
simulation and design the rendered output is paramount.
To meet these goals, Wwise also starts with a deterministic-statistical decom-
position. Like most geometric systems, the user must provide a simplified audio
geometry for the scene, which is the bulk of the work. Once this is done, the system
responds interactively without precomputation. The initial sound is derived based
on an explicit path search on simplified geometry at runtime, with reflections mod-
eled via image sources up to some user-controlled reflection order (usually ~3 for
efficiency).
Importantly, rather than estimating diffraction losses based on physical approxi-
mations such as the Uniform Theory of Diffraction [59] that cost CPU, the system
exposes an abstract “diffraction coefficient” that varies smoothly as the sound source,
and corresponding image sources transition between visual occlusion and visibility.
This ameliorates the key perceptual deficit of audible loudness jumps that result when
diffraction is ignored. The audio designer can draw a function in the user interface
to map the diffraction coefficient to loudness attenuation. This design underlines
how practical systems balance CPU cost, plausible rendering, and artistic control.
Note how just reducing accuracy to gain CPU is not the path taken: instead, one
must carefully understand which physical behaviors must be preserved to not violate
our (stringent) sensory expectations, such as that sound fields rarely show a sudden
audible variation on small movement in everyday life.
For modeling the statistical component, the system avoids costly stochastic ray
tracing in favor of reverberation flow modeled on a room-portal decomposition of
3 Interactive and Immersive Auralization 105
the simplified scene. The design is in the vein of [94], with diffuse energy flow on
a graph composed of rooms as nodes and portals as edges. However, in keeping
with the primary goal of audio design, the user is free to choose or parametrically
design individual filters for each room, while the system ensures that the net result
correctly accumulates reverberation and spatializes it as streaming to the listener
from (potentially) multiple portals. Again, plausibility, performance, and design are
prioritized over adherence to accuracy, keeping in mind the primary use case of
scalable rendering for games and VR.
3.8.3 Steam Audio and Resonance Audio
Steam Audio [100] and Resonance Audio [46] are geometric acoustics systems also
designed for gaming and VR applications with similar considerations as Wwise Spa-
tial Audio. They both offer HRTF spatialization combined with geometric acoustics
modeling; however, diffraction is ignored. A distinctive aspect of Steam Audio is
the capability to precompute room reverberation filters (i.e., the statistical compo-
nent) directly from scene geometry without requiring any simplification, auralized
dynamically based on listener location. Resonance Audio on the other hand primar-
ily focuses on highly efficient spatialization [47] that scales down to mobile devices
for numerous sources, using up to third-order spherical harmonics. In fact, Reso-
nance Audio can be used as a plugin within the Wwise audio engine to perform
spatialization, illustrating the utility of the modular design of auralization systems
(Sect. 3.2).
3.8.4 Project Acoustics (PA)
We now consider a wave-based system, Project Acoustics [66], which has shown
practical viability for gaming [81] and VR [45] experiences recently. We summarize
its key design ideas here; technical details can be found in [26,78,79]. As is typical
for wave acoustics systems (Sect. 3.7.2), costly simulation is performed in a pre-
computation stage, shown on the left of Fig. 3.4. Many simulations are performed in
parallel that collectively sample and compress the entire BIR field D(t,s,s,x,x)
into an acoustic dataset. With today’s commodity cloud computing resources, com-
plete game scenes may be processed in less than an hour.
The bidirectional reciprocity principle (3.7) plays an important role. The listener
location, x, is typically restricted in motion to head height above the ground, thus
varying in two dimensions rather than three, such as the floors of a building. Potential
listener locations are sampled in the lowered dimension adapting to local geome-
try [25]. Note that source locations, x, may still vary in three dimensions. Then, a
series of 3D wave simulations are performed with each potential listener location
106 N. Raghuvanshi and H. Gamper
encode
parameter fieldswave simulation
lookup: ( , )
perceptual
parameters
spatialization
Preprocessing Runtime
acoustic
dataset
aesthetic
modification
Fig. 3.4 High-level architecture of Project Acoustics’ wave-based parametric auralization
acting as source during simulation. The reduction in BIR field dimension by one
yields an order-of-magnitude reduction in data size.
Project Acoustics’ main idea is to employ lossy perceptual encoding on the
BIR field to bring it within practical storage budgets of a few hundred MB. The
deterministic-statistical decomposition is employed at this stage. The initial arrival
time and direction are encoded explicitly to ensure the correct localization of the
sound, and the rest of the response is encoded statistically (i.e., nd=1 referring
to Sect. 3.6.1). An example simulation snapshot is shown in Fig. 3.4 with the cor-
responding initial path encoding visualized on the right. Color shows frequency-
averaged loudness, and arrows show the localized direction at the listener location,
x, with the source location xvarying over the image. For instance, any source inside
the room would be localized by the listener as arriving from the door, so the arrows
inside the room consistently point in the door-to-listener direction. The perceptual
parameters vary smoothly over space, mirroring our everyday experience, allowing
further compression via entropy coding [78].
The statistical component simplifies (3.14) further to average over all simulated
frequencies, approximating the bidirectional energy decay relief as
¯
¯
Ds(¯
t,¯ω, ¯s,¯
s;x,x)≈¯p0(¯s,¯
s;x,x)10−6¯
t/T60(x,x).
The directions {¯s,¯
s}sample the six signed Cartesian directions, thus discretizing
¯p0to a 6 ×6 “reflections transfer” matrix that compactly approximates directional
reverberation, alongside a single T60 value across direction and frequency. Visual-
izations of the reflections transfer matrix can be found in [26] that illustrate how
it captures anisotropic effects like directional reverberation from portals or nearby
reverberant chambers.
One can observe that this encoding is quite simplified and can be expected to
only plausibly reproduce the simulated BIR field. The choices result from the sys-
tem’s goal: capturing key geometry-dependent audio cues within a compact storage
budget—too large a size simply obviates practical use. For instance, one could encode
3 Interactive and Immersive Auralization 107
much more detailed information such as numerous (nd∼20−50) individual reflec-
tion peaks [80] but that is far too costly, in turn motivating recent research on how one
might trade between number of encoded peaks (nd) and perceived authenticity [18].
Generally speaking, precomputed systems shift the trade-off from quality-versus-
CPU as with runtime propagation simulation to quality-versus-storage (Sects. 3.8.1
and 3.8.2). This holds regardless of whether the precomputation is geometric (Steam
Audio) or wave-based (Project Acoustics). Precomputation can introduce limitations
such as slower artist turnaround times and static scenes, but in return significantly
lowers the barrier to viability whenever the available CPU is severely restricted,
which is the case for gaming applications or untethered VR platforms.
Wave simulation forces precomputation in today’s systems due to its high compu-
tational cost, but its advantage compared to geometric methods is that complex visual
scene geometry is processed directly, without requiring any manual simplification.
Further, arbitrary order of diffraction around detailed geometry in general scenes
(trees, buildings, chairs, etc.) is modeled, which avoids the risk of not sampling a
salient path. In sum, one pays a high, fixed precomputation cost largely insensitive
to scene complexity, and if that is feasible, obtains robust results directly from visual
geometry with a low CPU cost.
As discussed in Sect. 3.6.2, parametric approaches enable intuitive controls for
sound designers, which is of crucial importance in gaming applications, as we also
saw in the design of the Wwise Spatial Audio system. In the case of PA, the parameters
are looked up at each source-listener location pair at runtime (right of Fig. 3.4), and
it becomes possible for the artist to specify dynamic aesthetic modifications of the
physically-based baseline produced by simulation [44]. The sounds and modified
acoustic parameters can then be sent to any efficient parametric reverberation and
spatialization sub-system for rendering the binaural output.
3.9 Summary and Outlook
Creating an immersive and interactive sonic experience for virtual reality applications
requires auralizing complex 3D scenes robustly and within tight real-time constraints.
To meet these requirements, real-time systems follow a modular approach of dividing
the problem into sound production, propagation, and spatialization. These can be
mathematically formulated via the source directivity function, bidirectional impulse
responses (BIR), and head-related transfer functions (HRTFs), respectively, leading
to a general framework. Human auditory perception of acoustic responses deeply
informs most systems, motivating optimizations such as the deterministic-statistical
decomposition of the BIR.
We discussed many design considerations that inform the design of practical sys-
tems. We illustrated with a few auralization systems how the application requirements
shape design choices, ranging from perceptual authenticity in architectural acous-
tics, to game engines where believability, audio design, and CPU usage take central
priority. With more development, one can hope for auralization systems in the future
108 N. Raghuvanshi and H. Gamper
that are capable of scaling their quality-compute trade-offs to span all applications
of VR auralization. Such a convergent evolution would be in line with current trends
in visual rendering where off-line photo-realistic rendering techniques and real-time
game techniques are becoming increasingly unified [33].
Looking to the future, real-time auralization faces two major research challenges:
scalability and scene dynamics. Game and VR scenes are trending toward completely
open worlds where entire cities are modeled at once, spanning tens of kilometers, with
numerous sound sources, where very few assumptions can be made about the scene’s
geometry or complexity. Similar considerations hold for engineering prediction of
outdoor acoustics, such as noise levels in a city. We need real-time techniques that
can scale to such challenging scenarios within CPU budgets, perhaps by analogy with
level-of-detail techniques used in graphics. Scene dynamism is a related challenge.
Many current game engines allow the users to make global changes to immersive 3D
worlds in real time. Dynamic techniques are required that can model, for instance,
the diffraction loss around a just-created wall within tolerable latency. Progress in
this direction has only just begun [35,75,83,84].
The open challenge for the future is to build real-time auralization systems that can
gracefully scale from plausible to accurate audio rendering for complex, dynamic,
city-scale scenes depending on available computational resources. There is much to
be done, and many undiscovered, foundational ideas remain.
References
1. Abel, J. S., Huang, P.: A Simple, Robust Measure of Reverberation Echo Density in Audio
Engineering Society Convention 121 (2006).
2. Ahrens, J.: Analytic Methods of Sound Field Synthesis (T-Labs Series in Telecommunication
Services) Two thousand, twelfth (Springer, 2014).
3. Ajdler, T., Sbaiz, L.,Vetterli, M.: The Plenacoustic Function and Its Sampling. Signal Pro-
cessing, IEEE Transactions on 54, 3790–3804 (2006).
4. Albert, D. G., Liu, L.: The Effect of Buildings on Acoustic Pulse Propagation in an Urban
Environment. The Journal of the Acoustical Society of America 127, 1335–1346 (2010).
5. Algazi, V. R., Duda, R. O., Thompson, D. M., Avendano, C.: The cipic hrtf database in
Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio
and Acoustics (Cat. No. 01TH8575) (2001), 99–102.
6. Allen, A., Raghuvanshi, N.: Aerophones in Flatland: Interactive Wave Simulation of Wind
Instruments. ACM Trans. Graph. 34 (2015).
7. Allen, J. B., Berkley, D. A.: Image Method for Efficiently Simulating Small- Room Acoustics.
J. Acoust. Soc. Am 65, 943–950 (1979).
8. Antani, L., Chandak, A., Taylor, M., Manocha, D.: Direct-to-Indirect Acoustic Radiance
Transfer. IEEE Transactions on Visualization and Computer Graphics 18, 261–269 (2012).
9. AudioKinetic Inc.: Wwise https://www.audiokinetic.com/products/wwise/. 2018.
10. Avni, A. et al.: Spatial perception of sound fields recorded by spherical microphone arrays with
varying spatial resolution. The Journal of the Acoustical Society of America 133, 2711–2721
(2013).
3 Interactive and Immersive Auralization 109
11. Ben-Hur, Z., Brinkmann, F., Sheaffer, J., Weinzierl, S., Rafaely, B.: Spectral equalization
in binaural signals represented by order-truncated spherical harmonics. The Journal of the
Acoustical Society of America 141, 4087–4096 (2017).
12. Bilbao, S.: Numerical Sound Synthesis: Finite Difference Schemes and Simulation in Musical
Acoustics First (Wiley, 2009).
13. Bilbao, S., Hamilton, B.: Directional Sources inWave-Based Acoustic Simulation. IEEE/ACM
Transactions on Audio, Speech, and Language Processing 27, 415–428 (2019).
14. Bilbao, S. et al.: Physical Modeling, Algorithms, and Sound Synthesis: The NESS Project.
Computer Music Journal 43, 15–30 (2020).
15. Bilinski, P., Ahrens, J., Thomas, M. R., Tashev, I. J., Platt, J. C.: HRTF magnitude synthesis
via sparse representation of anthropometric features in 2014 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP) (2014), 4468–4472.
16. Born, M.,Wolf, E.: Principles of Optics: 60th Anniversary Edition 7th edition. English (Cam-
bridge University Press, Cambridge, 2019).
17. Breebaart, J. et al.: Spatial Audio Object Coding (SAOC) - The Upcoming MPEG Standard
on Parametric Object Based Audio Coding English. In (Audio Engineering Society, 2008).
18. Brinkmann, F., Gamper, H., Raghuvanshi, N., Tashev, I.: Towards Encoding Perceptually
Salient Early Reflections for Parametric SpatialAudio Rendering English. in Audio Engineer-
ing Society Convention 148 (Audio Engineering Society, 2020).
19. Brinkmann, F., Weinzierl,S.: Comparison of Head-Related Transfer Functions Pre-Processing
Techniques for Spherical Harmonics Decomposition. English (2018).
20. Brinkmann, F. et al.: A Cross-Evaluated Database of Measured and Simulated HRTFs Includ-
ing 3D Head Meshes, Anthropometric Features, and Headphone Impulse Responses. en.
Journal of the Audio Engineering Society 67, 705–718 (2019).
21. Brinkmann, F. et al.: A Round Robin on Room Acoustical Simulation and Auralization. J.
Acoustical Soc. of Am. (2019).
22. Brungart, D. S.,Kordik, A. J., Simpson, B. D.: Effects of Headtracker Latency in Virtual Audio
Displays. en. J. Audio Eng. Soc. 54, 13 (2006).
23. Buffoni, L.-X.: A Wwise Approach to Spatial Audio (Blog Series)
https://blog.audiokinetic.com/a-wwise-approach-to-spatial-audiopart-1/. 2020.
24. Cao, C., Ren, Z., Schissler, C., Manocha, D., Zhou, K.: Interactive Sound Propagation with
Bidirectional Path Tracing. ACM Transactions on Graphics (TOG) 35, 180 (2016).
25. Chaitanya,C. R. A., Snyder, J. M., Godin, K., Nowrouzezahrai, D., Raghuvanshi, N.: Adaptive
Sampling for Sound Propagation. IEEE Trans. on Vis. Comp. Graphics 25, 1846–1854 (2019).
26. Chaitanya, C. R. A. et al.: Directional Sources and Listeners in Interactive Sound Propagation
Using ReciprocalWave Field Coding. ACM Transactions on Graphics (SIGGRAPH 2020) 39
(2020).
27. Cooper, C. M., Abel, J. S.: Digital Simulation of "Brassiness" and Amplitude- Dependent
Propagation Speed in Wind Instruments in Proc. 13th Int. Conf. on Digital Audio Effects
(DAFx-10) (2010), 1–6.
28. Dalenbäck, B.-I.: CATT-Acoustic Software https://www.catt.se/.2021.
29. Davis, L. S. et al.: High Order Spatial Audio Capture and Its Binaural Head-Tracked Playback
Over Headphones with HRTF Cues English. In Audio Engineering Society Convention 119
(Audio Engineering Society, 2005).
30. Dobashi,Y.,Yamamoto, T., Nishita, T.: Real-TimeRendering ofAerodynamic Sound Using
Sound Textures Based on Computational Fluid Dynamics. ACM Trans. Graph. 22, 732–740
(2003).
31. Dobashi, Y., Yamamoto, T., Nishita, T.: Synthesizing Sound from Turbulent Field Using
Sound Textures for Interactive Fluid Simulation. Computer Graphics Forum (Proc. EURO-
GRAPHICS 2004) 23, 539–546 (2004).
32. Embrechts, J.-J.: Review on the Applications of Directional Impulse Responses in Room
Acoustics in Actes Du CFA 2016 (2016).
33. Epic Games: Unreal Engine 5 Documentation https://docs.unrealengine.com/5.0/en-
US/RenderingFeatures/Lumen/. 2020.
110 N. Raghuvanshi and H. Gamper
34. Erraji, A., Stienen, J., Vorländer, M.: The Image Edge Model. en. Acta Acustica 5, 17 (2021).
35. Fan, Z., Vineet, V., Gamper, H., Raghuvanshi, N.: Fast Acoustic Scattering Using Convolu-
tional Neural Networks in ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP) (2020), 171–175.
36. Funkhouser, T. et al.: A Beam Tracing Method for Interactive Architectural Acoustics. The
Journal of the Acoustical Society of America 115, 739–756 (2004).
37. Gade, A. in Springer Handbook of Acoustics (ed Rossing, T.) Two thousand, seventh. Chap.
9 (Springer, 2007).
38. Gamper, H., Johnston, D., Tashev, I. J.: Interaural time delay personalisation using incomplete
head scans in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP) (2017), 461–465.
39. Gardner, B., Martin, K., et al.: HRFT Measurements of a KEMAR Dummyhead Micro-
phone (Vision and Modeling Group, Media Laboratory, Massachusetts Institute of Technol-
ogy, 1994).
40. Gardner, W. G., Martin, K. D.: HRTF measurements of a KEMAR. The Journal of the Acous-
tical Society of America 97, 3907–3908 (1995).
41. Geronazzo, M., Spagnol, S., Bedin, A.,Avanzini, F.: Enhancing vertical localization with
image-guided selection of non-individual head-related transfer functions in 2014 IEEE Inter-
national Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), 4463–
4467.
42. Gerzon, M. A.: Ambisonics in Multichannel Broadcasting and Video. J. Audio Eng. Soc. 33,
859–871 (1985).
43. Gerzon, M. A.: Periphony: With-Height Sound Reproduction. J. Audio Eng. Soc 21, 2–10
(1973).
44. Godin, K., Gamper, H., Raghuvanshi, N.: Aesthetic Modification of Room Impulse Responses
for Interactive Auralization in AES International Conference on Immersive and Interactive
Audio (Audio Engineering Society, 2019).
45. Godin, K. W., Rohrer, R., Snyder, J., Raghuvanshi, N.: Wave Acoustics in a Mixed Reality
Shell in AES Conf. on Audio for Virt. and Augmented Reality (AVAR) (2018).
46. Google Inc.: Resonance Audio https://developers.google.com/resonance-audio/. 2018.
47. Gorzel, M. et al.: Efficient Encoding and Decoding of Binaural Sound with Resonance Audio
in AES International Conference on Immersive and Interactive Audio (2019).
48. Guezenoc, C., Seguier, R.: HRTF individualization: A survey. arXiv preprint
arXiv:2003.06183 (2020).
49. Hamilton, B., Bilbao, S.: FDTD Methods for 3-D RoomAcoustics Simulation With High-
Order Accuracy in Space and Time. IEEE/ACM Transactions on Audio, Speech, and Language
Processing 25 (2017).
50. He, J., Ranjan, R., Gan, W.-S.: Fast continuous HRTF acquisition with unconstrained move-
ments of human subjects in 2016 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP) (2016), 321–325.
51. Hold, C., Gamper, H., Pulkki, V., Raghuvanshi, N., Tashev, I.: Improving Binaural Ambison-
ics Decoding by Spherical Harmonics Domain Tapering and Coloration Compensation in
Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
(2019).
52. Hornikx, M., Forssén, J.: Modelling of Sound Propagation to Three- Dimensional Urban
Courtyards Using the Extended Fourier PSTD Method. Applied Acoustics 72, 665–676
(2011).
53. Howe, M. S.: Theory of Vortex Sound 1st edition. English (Cambridge University Press, New
York, 2002).
54. Hughes, J. et al.: Computer Graphics: Principles and Practice 3rd edition. English (Addison-
Wesley Professional, Upper Saddle River, New Jersey, 2013).
55. Jörgensson, F. K. P.: Wave-Based Virtual Acoustics English. PhD thesis (Technical University
of Denmark, 2020).
3 Interactive and Immersive Auralization 111
56. Jot, J.-M.: An Analysis/Synthesis Approach to Real-Time Artificial Reverberation in [Pro-
ceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal
Processing 2(1992), 221–224.
57. Kajiya, J. T.: The Rendering Equation in Proceedings of the 13th Annual Conference on
Computer Graphics and Interactive Techniques 20 (ACM, New York, NY, USA, 1986), 143–
150.
58. Katz, B. F.: Boundary element method calculation of individual head-related transfer function.
I. Rigid model calculation. The Journal of the Acoustical Society of America 110, 2440–2448
(2001).
59. Kouyoumjian, R., Pathak, P.: A Uniform Geometrical Theory of Diffraction for an Edge in a
Perfectly Conducting Surface. Proceedings of the IEEE 62, 1448–1461 (1974).
60. Kuttruff, H.: Room Acoustics Fourth (Taylor & Francis, 2000).
61. Li, S., Tobbala, A., Peissig, J.: Towards Mobile 3D HRTF Measurement English. in (Audio
Engineering Society, 2020).
62. Litovsky, R. Y., Colburn, S. H., Yost, W. A., Guzman, S. J.: The Precedence Effect. The
Journal of the Acoustical Society of America 106, 1633–1654 (1999).
63. Mehra, R., Antani, L., Kim, S., Manocha, D.: Source and Listener Directivity for Interactive
Wave-Based Sound Propagation. IEEE Transactions on Visualization and Computer Graphics
20, 495–503 (2014).
64. Mehra, R. et al.: Wave-Based Sound Propagation in Large Open Scenes Using an Equivalent
Source Formulation. ACM Trans. Graph. 32 (2013).
65. Meshram, A. et al.: P-HRTF: Efficient personalized HRTF computation for high-fidelity spa-
tial sound in 2014 IEEE International Symposium on Mixed and Augmented Reality (ISMAR)
(2014), 53–61.
66. Microsoft Corp.: Project Acoustics https://aka.ms/acoustics. 2018.
67. Noisternig, M., Sontacchi, A., Musil, T., Hóldrich, R.: A 3D ambisonic based binaural sound
reproduction system in Audio Engineering Society Conference: 24th International Confer-
ence: Multichannel Audio, The New Reality (2003).
68. Oliver, R. J., Jot, J.-M.: Efficient Multi-Band DigitalAudio Graphic Equalizer with Accurate
Frequency Response Control in Audio Engineering Society Convention 139 (2015).
69. Paasonen, J., Karapetyan, A., Plogsties, J., Pulkki, V.: Proximity of Surfaces - Acoustic and
Perceptual Effects. J. Audio Eng. Soc 65, 997–1004 (2017).
70. Pierce, A. D.: Acoustics: An Introduction to Its Physical Principles and Applications (Acous-
tical Society of America, 1989).
71. Pind, F. et al.: Time Domain Room Acoustic Simulations Using the Spectral Element Method.
The Journal of the Acoustical Society of America 145, 3299–3310 (2019).
72. Pulkki, V.: Virtual Sound Source Positioning Using Vector Base Amplitude Panning. English.
Journal of the Audio Engineering Society 45, 456–466 (1997).
73. Pulkki, V.: Spatial Sound Reproduction with Directional Audio Coding. J. Audio Eng. Soc.
(2007).
74. Pulkki,V., Merimaa, J.: Spatial ImpulseResponseRendering II:Reproduction of Diffuse Sound
and Listening Tests. J. Aud. Eng. Soc. 54, 3–20 (2006).
75. Pulkki, V., Svensson, U. P.: Machine-Learning-Based Estimation and Rendering of Scattering
in Virtual Reality. J. Acoust. Soc. Am. 145, 2664–2676 (2019).
76. Raghuvanshi, N.: Dynamic Portal Occlusion for Precomputed Interactive Sound Propagation.
arXiv:2107.11548 [cs, eess] (2021).
77. Raghuvanshi, N., Narain, R., Lin, M. C.: Efficient and Accurate Sound Propagation Using
Adaptive Rectangular Decomposition. IEEE Transactions on Visualization and Computer
Graphics 15, 789–801 (2009).
78. Raghuvanshi, N., Snyder, J.: ParametricWave Field Coding for Precomputed Sound Propa-
gation. ACM Transactions on Graphics (TOG) - Proceedings of ACM SIGGRAPH 2014 33
(2014).
79. Raghuvanshi, N., Snyder, J.: Parametric Directional Coding for Precomputed Sound Propa-
gation. ACM Trans. Graph. (2018).
112 N. Raghuvanshi and H. Gamper
80. Raghuvanshi, N., Snyder, J., Mehra, R., Lin, M. C., Govindaraju, N. K.: PrecomputedWave
Simulation for Real-Time Sound Propagation of Dynamic Sources in Complex Scenes. ACM
Transactions on Graphics 29 (2010).
81. Raghuvanshi, N., Tennant, J., Snyder, J.: Triton: Practical Pre-Computed Sound Propagation
for Games and Virtual Reality. The Journal of the Acoustical Society of America 141, 3455–
3455 (2017).
82. Rindel, J. H., Christensen, C. L.: The Use of Colors, Animations and Auralizations in Room
Acoustics in Internoise 2013 (2013).
83. Rosen, M., Godin, K. W., Raghuvanshi, N.: Interactive Sound Propagation for Dynamic
Scenes Using 2D Wave Simulation. en. Computer Graphics Forum 39, 39–46 (2020).
84. Rungta, A., Schissler, C.,Rewkowski,N., Mehra, R., Manocha, D.: Diffraction Kernels for
Interactive Sound Propagation in Dynamic Environments.IEEE Transactions on Visualization
and Computer Graphics 24, 1613–1622 (2018).
85. Savioja, L.: Real-Time 3D Finite-Difference Time-Domain Simulation of Mid-Frequency
Room Acoustics in 13th International Conference on Digital Audio Effects (2010).
86. Savioja, L., Huaniemi, J., Lokki, T.,Vaananen, R.: Creating InteractiveVirtual Acoustic Envi-
ronments. J. Audio Eng. Soc. (1999).
87. Savioja, L., Svensson, U. P.: Overview of Geometrical Room Acoustic Modeling Techniques.
The Journal of the Acoustical Society of America 138, 708–730 (2015).
88. Schissler, C., Mehra, R., Manocha, D.: High-Order Diffraction and Diffuse Reflections for
Interactive Sound Propagation in Large Environments. ACM Transactions on Graphics (TOG)
33, 39 (2014).
89. Schonstein, D., Katz, B. F.: HRTF selection for binaural synthesis from a database using
morphological parameters in International Congress on Acoustics (ICA) (2010).
90. Schröder, D.: Physically Based Real-Time Auralization of Interactive Virtual Environments
(Logos Verlag, 2011).
91. Sheaffer, J., Van Walstijn, M., Rafaely, B., Kowalczyk, K.: Binaural Reproduction of Finite
Difference Simulations Using Spherical Array Processing. IEEE/ACM Trans. Audio, Speech
and Lang. Proc. 23, 2125–2135 (2015).
92. Shinn-Cunningham, B. G.: Distance cues for virtual auditory space in Proceedings of the
IEEE-PCM 2000 (2000), 227–230.
93. Siltanen, S., Lokki, T., Kiminki, S., Savioja, L.: The Room Acoustic Rendering Equation. J.
Acoust. Soc. Am. (2007).
94. Stavrakis, E., Tsingos, N., Calamia, P.: Topological Sound Propagation with Reverberation
Graphs. Acta Acustica/Acustica - the Journal of the European Acoustics Association (EAA)
(2008).
95. Stephenson, U. M., Svensson, U. P.: An Improved Energetic Approach to Diffraction Based
on theUncertainty Principle in 19th Int. Cong. onAcoustics (ICA) (2007).
96. Takala, T., Hahn, J.: Sound Rendering. SIGGRAPH Comput. Graph. 26, 211–220 (1992).
97. Theis, T. N., Wong, H.-S. P.: The end of Moore’s law: A new beginning for information
technology. Computing in Science & Engineering 19, 41–50 (2017).
98. Tukuljac, H. P. et al.: A Sparsity Measure for Echo Density Growth in General Environments
in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP) (2019), 1–5.
99. Valimaki, V., Parker, J. D., Savioja, L., Smith, J. O., Abel, J. S.: Fifty Years of Artificial
Reverberation. IEEE Transactions on Audio, Speech, and Language Processing 20, 1421–
1448 (2012).
100. Valve Corporation: Steam Audio ().
101. Veach, E., Guibas, L. J.: Metropolis Light Transport in Proceedings of the 24th Annual
Conference on Computer Graphics and Interactive Techniques (ACM Press/Addison-Wesley
Publishing Co., USA, 1997), 65–76.
102. Vorländer, M.: Auralization: Fundamentals of Acoustics, Modelling, Simulation, Algorithms
andAcousticVirtualReality (RWTHedition) First (Springer, 2007).
3 Interactive and Immersive Auralization 113
103. Wang, H., Sihar, I., Pagán Muñoz, R., Hornikx, M.: Room Acoustics Modelling in the Time-
Domain with the Nodal Discontinuous Galerkin Method. The Journal of the Acoustical Society
of America 145, 2650–2663 (2019).
104. Wang, J.-H., Qu, A., Langlois, T. R., James, D. L.: TowardWave-based Sound Synthesis for
Computer Animation. ACM Trans. Graph. 37, 109:1–109:16 (2018).
105. Zhang, W., Abhayapala, T. D., Kennedy, R. A., Duraiswami, R.: Insights into Head-Related
Transfer Function: Spatial Dimensionality and Continuous Representation. The Journal of
the Acoustical Society of America 127, 2347–2357 (2010).
106. Zotkin, D., Hwang, J., Duraiswaini, R., Davis, L. S.: HRTF personalization using anthropo-
metric measurements in 2003 IEEEWorkshop on Applications of Signal Processing to Audio
and Acoustics (IEEE Cat. No. 03TH8684) (2003), 157–160.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 4
System-to-User and User-to-System
Adaptations in Binaural Audio
Lorenzo Picinali and Brian F. G. Katz
Abstract This chapter concerns concepts of adaption in a binaural audio context (i.e.
headphone-based three-dimensional audio rendering and associated spatial hearing
aspects), considering first the adaptation of the rendering system to the acoustic
and perceptual properties of the user, and second the adaptation of the user to the
rendering quality of the system. We start with an overview of the basic mechanisms of
human sound source localisation, introducing expressions such as localisation cues
and interaural differences, and the concept of the Head-Related Transfer Function
(HRTF), which is the basis of most 3D spatialisation systems in VR. The chapter then
moves to more complex concepts and processes, such as HRTF selection (system-
to-user adaptation) and HRTF accommodation (user-to-system adaptation). State-
of-the-art HRTF modelling and selection methods are presented, looking at various
approaches and at how these have been evaluated. Similarly, the process of HRTF
accommodation is detailed, with a case study employed as an example. Finally, the
potential of these two approaches are discussed, considering their combined use in
a practical context, as well as introducing a few open challenges for future research.
4.1 Introduction
Binaural technology is the solution for sound spatialisation which is the closest to
real-life listening. It attempts to mimic the entirety of acoustic cues associated with the
human localisation of sounds, reproducing the corresponding acoustic pressure signal
at the entrance of the two ear canals of the listener (binaural literally means “related to
two ears”). These two signals should be a complete and sufficient representation of the
sound scene, since they are the only information that the auditory system requires in
L. Picinali (B
)
Imperial College London, South Kensington Campus, London SW7 2AZ, UK
e-mail: l.picinali@imperial.ac.uk
B. F. G. Katz
Sorbonne Université, CNRS, UMR 7190, Institut Jean Le Rond d’Alembert,
Lutheries - Acoustique - Musique, Paris, France
e-mail: brian.katz@sorbonne-universite.fr
© The Author(s) 2023
M. Geronazzo and S. Serafin (eds.), Sonic Interactions in Virtual Environments,
Human—Computer Interaction Series, https://doi.org/10.1007/978-3-031- 04021-4_4
115
116 L. Picinali and B. F. G. Katz
order to identify the 3D location of a sound source. Thus, binaural rendering of spatial
information is fundamentally based on the production (either through recording or
synthesis) of localisation cues that are the consequence of the incident sound upon
the listener’s torso, head, and ears on the way to the ear canal, and subsequently to
the eardrums. These cues are, namely, the ITD (interaural time difference), the ILD
(interaural level difference) and spectral cues [48,68]. Their combined effects are
represented by the Head-Related Transfer Function (HRTF), which characterises the
spectro-temporal filtering of a locus of source positions around a given head.1
4.1.1 Localisation Cues and Their Individual Nature
The ILD and ITD as a function of source position are determined principally by
the size and shape of the head, as well as the position of the ears on the two sides.
In order to better understand these localisation cues, Fig. 4.1 shows how ITD and
ILD vary as a function of both distance (1.5–10m) and azimuth. This comparison
highlights potential effects of ITD/ILD mismatch, especially if they occur near the
interaural axis where they can affect distance perception. The results were obtained
by Boundary Element Method (BEM) simulation of the HRTF using the open-source
mesh2hrtf software [110,111]. The mesh employed was obtained from an MRI
scan of a Neumann dummy recording head (model KU-100), previously used in
HRTF computation [32] and measurement [4] comparisons. These cues vary as a
function of frequency. For this example, the ITD was calculated using the Thresh-
old lp –30 dB method (for a summary of various ITD estimation methods see [50]),
which detects the first onset using a –30 dB relative threshold on a 3 kHz low-pass fil-
tered version of the HRIR, as this has been shown to be the most perceptually relevant
method for ITD estimation among 32 different estimation methods and variants [7,
50]. The ILD was calculated as the difference of left and right HRIR RMS values,
after applying a 3kHz high-pass filter. The use of low-pass and high-pass filters for
the two different acoustic cues is based on previous studies showing the frequency
dependence of the different auditory cues [101], with ITD being dominated by low-
frequency content (with interpretation of phase information being inconclusive for
frequencies smaller than head dimensions) and ILD varying more significantly with
high-frequency content (where the wavelength is less than the dimensions of the
head). The application of a 2–3 kHz filter can be used to generally separate the con-
tributions of the pinnae in the HRIR [50]. One can observe that ITD varies little over
the simulated distance range, while becoming more vague and ambiguous near the
1We use the term HRTF to indicate the set of filters, each representing a pair of transfer functions
from a point source in space at a given distance around a given head to the left and right ear,
normalised by the transfer function with the body absent. The plural, HRTFs, therefore, represents
a collection of more than one HRTF, typically for different heads or test conditions. The head-related
impulse response or HRIR is the time domain transform of the HRTF.
4 System-to-User and User-to-System Adaptations in Binaural Audio 117
Fig. 4.1 Isocontours for ITD (left) and ILD (right) as a function of azimuth (in degrees) and radial
distance (from 1.5 to 10 m) obtained via numerical simulation of the HRTF of a dummy head (not
shown to scale). ITD (3 kHz low-pass Head-Related Impulse Response—HRIR, Threshold, –30 dB
first onset method) 50 µs contours. ILD (3 kHz high-pass HRIR, RMS difference) 1 dB contours
(from [48])
interaural axis. In contrast, the ILD varies with distance in the same interaural axis
range of 70◦–110◦.
Other physical interactions between the sound wave and the torso, head, and pin-
nae (the external parts of the ear) introduce a range of spectral cues (principally
through series of peaks and notches) which can be used to judge whether a sound
source is e.g. above or below, to the front or rear of the listener, while ITD and ILD
remain relatively unchanged. Considering the various morphological regions of the
pinnae, as indicated later in Sect. 4.2.1—Fig. 4.2a, each of these is potentially related
to specific characteristic of the HRTF filters. As such, individual morphological vari-
ations will result in different HRTFs. When reproducing binaural audio, it has been
experimentally demonstrated that using an HRTF that does not match the one of
118 L. Picinali and B. F. G. Katz
the listener has a detrimental effect on the accuracy and realism of virtual sound
perception. For example, it has been noted that listeners are able to localise virtual
sounds that have been spatialized using their own HRTFs with a similar accuracy
to free field listening, though some studies have shown poorer elevation judgements
and increased front-back confusions [67], which may be due to the idealised ane-
choic nature of HRTFs and the importance of slight head movements and associated
dynamic cues [37,102]. These errors can significantly increase when using someone
else’s HRTF [99]. Furthermore, using non-individual HRTFs (see Sect. 4.1.2) has
been shown to affect various perceptual attributes when considering complex scenes,
in addition to those associated with source localisation: i.e. Coloration, Externalisa-
tion, Immersion, Realism and Relief/Depth [87]. In this chapter, the primary focus
is on localisation as the perceptual evaluation metric. Chapter 5introduces and dis-
cusses other relevant metrics.
4.1.2 Minimising HRTF Mismatch Between the System
and the Listener
Various means have been investigated to minimise erroneous or conflicting binaural
acoustic localisation cues relative to the natural cues delivered to the auditory sys-
tem and, as such, improve the quality of the resulting binaural rendering. Majority
of research has focused on improving the similarity between the rendering sys-
tems’ localisation cues and those of the individual listener. This is generally termed
“individualisation” or “individualised” binaural rendering. To clarify questions of
nomenclature, we propose the following terms:
•individual to identify the HRTF of the user;
•individualised or personalised to indicated an HRTF modified or selected to best
accommodate the user;
•non-individual or non-individualised to indicate an HRTF that has not been tailored
to the user and
•dummy head or so-called generic HRTF sets are specific instances of non-
individual HRTFs, often designed with the goal of representing a certain pool
of subjects.
While not exhaustive, a general overview of individualisation methods is discussed
here.
Binaural Recordings and Synthesis
The first and most direct method to create an individual rendering is to perform
the recording with binaural microphones placed in the ear canal of the listener. This
is however, in most cases, an impractical solution. The second still rather direct
method is to measure the HRTF of an individual for a collection of spatial positions
and to then use this individual HRTF to produce an individual binaural synthesis
4 System-to-User and User-to-System Adaptations in Binaural Audio 119
rendering through convolution of the sound source with the relevant incident direc-
tionHRTF[14,105]. While this is the most common method employed to date,
it is generally limited to those with the facilities and equipment to carry out such
measurements [4].
The general pros and cons between binaural recordings and binaural synthe-
sis merit mention. While individual binaural recordings provide arguably the most
accurate 3D audio capture/reproduction method, they require the sonic environment
and the individual to be situated accordingly. For any reasonable production, this
would resemble a theatrical piece being performed around the individual in a first
person context. The recording would capture the acoustic detail of the soundscape,
including reflections from various surfaces, diffraction and scattering effects. How-
ever, the head orientation of the individual would be encoded into the recording,
imposed on the listener at playback. If presented to another individual, the issues of
HRTF mismatch are introduced, degrading the spatial audio quality to an unknown
degree for each individual. In laboratory conditions, this method suffers additional
difficulty, as the individual takes part in the recording, making the presentation of
unfamiliar material difficult. In contrast, binaural synthesis allows for the scripting,
manipulation and mixing of 3D scenarios without the intended listener present. With
real-time synthesis, head tracking can be incorporated allowing freedom of move-
ment by the individual, a basic requirement for VR applications. HRTF mismatch is
alleviated through the use of individual HRTFs. However, the quality of the produc-
tion is affected by the level of detail in the acoustic simulation of the environment,
including elements such as source and surface properties. Highly complex scenes
and acoustic environments can require significant computational resources (the inter-
ested reader can refer to Chap. 3for further details on this topic). Spatial synthesis
using HRTF data is also affected by the measurement conditions of the employed
HRTF, predominantly the measurement distance. If sound sources are to be rendered
at various distances, this requires either multiple HRTF datasets, or deformation of
the individual HRTF data to approximate such changes in distance. Further discus-
sion of these details is beyond the scope of this chapter. In continuing, the focus will
be limited to questions concerning the individual nature of the HRTF as integrated
into an auditory VR environment through binaural synthesis.
Introduction to System-to-User and User-to-System adaptation
A variety of alternative methods exist in order to improve the match between the
HRTF used for the rendering and the specific HRTF of the listener. It is the aim of
this chapter to present an overview of those approaches that have been evaluated
and validated through experimental research. In order to map the various methods
and at the same time simplify the narrative and facilitate the reading, the text has
been organised in two separate sections. Section 4.2 presents research which looks
at matching the rendering system to the specific listener (system-to-user adaptation),
thus aiming to provide every individual with the best HRTF possible. Section 4.3
looks at the problem from a diametrically opposite point of view, introducing studies
where the listener is trained in order to adapt to the rendering system (user-to-system
120 L. Picinali and B. F. G. Katz
adaptation), therefore aiming at improving the performance of a specific individual
when using non-individual HRTFs.
While a rather extensive number of studies exist on the topic of system-to-user
adaptation, a more limited amount of research has been carried out focusing on user-
to-system adaptation. For this reason, while Sect. 4.2 is presented as an extensive
review of several research projects, Sect. 4.3, after an initial overview, then dives
more in depth into one specific study carried out by this chapter’s authors, giving
details of the methodology and briefly discussing the results. Section 4.5 concludes
by presenting a brief overview of open challenges on this topic.
4.2 System-to-User Adaptation: HRTF Synthesis and
Selection
Two main approaches exist for obtaining individual (or at least personalised) HRTFs
without having to measure them acoustically. The first one focuses on numerical
simulations, therefore using mathematical methods to generate an HRTF for a given
individual from 3D models of the head, torso, and pinnae. Techniques such as the
Boundary Element Method (BEM), Finite Element Method (FEM), and Finite Dif-
ference Time Domain (FDTD) method which are commonly employed in diffraction,
scattering, and resonance problems allow one to calculate the HRTF of a given indi-
vidual based on precise geometrical data (e.g. coming from a 3D scan of the head and
pinnae), which have been used for this purpose since the late 1990s, and have shown
increased uptake and success in the past years thanks to technological advancements
in domains such as high-performance computing and high-resolution 3D scanning.
An example of such a resulting 3D mesh from a Neumann KU-100 dummy head
can be seen in Fig. 4.2b. The second one relies on using HRTFs from available
datasets, either transforming them in order to provide a better fit for a given listener
or selecting a best fit considering, for example, preference or performance, e.g. using
a sound localisation task or signal metric. Due to the relative independence between
the ITD and the Spectral Cues, the HRTF can be decomposed and different elements
addressed by different methods, e.g. an ITD structural model can be used with best
fit selected Spectra Cues [22,78].
As can be expected, each of these approaches comes with specific challenges.
Moreover, the success in employing one or the other depends significantly on factors
such as the available data (quantity and quality), the time constraints in order to run
the tests and the calculations, and the context for which the rendering is needed (i.e.
the requirements in terms of quality, interactivity, etc.). An overview of the various
techniques and related challenges, including solutions found through state-of-the art
research studies, is presented in the following sections.
4 System-to-User and User-to-System Adaptations in Binaural Audio 121
Fig. 4.2 Pinna morphology nomenclature and example BEM mesh (from [91])
4.2.1 HRTF Modelling
Various attempts have been made to investigate the function of the pinna, linking
HRTFs to its morphology as well as that of the head and torso. Early work by Teran-
ishi and Shaw [93] looked at creating a physical model of the pinnae and analysing
the various excitation modes generated by a nearby point source. The model, based
on very simple geometries, showed responses similar to those of real data, and rep-
resented one of the first steps towards better understanding the spatially varying
acoustic role of the pinna. Similar work was done by Batteau [12], who created a
mathematical representation of the acoustical transformation performed by the pinna
and produced the first mathematically described theory of sound source localisation
based on a reflection-diffraction model. These studies were the baseline of research
carried out 30 and more years later, when the available computational power allowed
to create more complex models, and to validate those by comparing them with exper-
imental measures (e.g. [58]). Further modelling work was carried out looking at
simplified models and approximations. Notable examples are those of Genuit [26]
based on a structural simplification model of the pinnae, Algazi and colleagues [1]
based on an approximation of the head and the torso using ellipsoidal and spherical
models, and Spagnol and colleagues [89] looking at ray-tracing analysis of pinna
reflection patterns. It is relevant to note that many of the early studies focused on
models for understanding the various phenomena and principles involved, rather than
models for binaural audio rendering. For these early studies, much of the research
on spatial perception was carried out independently from acoustical/morphological
studies regarding the details of the pinnae.
122 L. Picinali and B. F. G. Katz
Structural Modelling
One of the first experiments using these techniques applied to HRTFs (including
pinnae) was carried out by Katz [49,51,52]. This work focused on using BEM to
calculate HRTFs by modifying various aspects of the geometrical models, for exam-
ple, eliminating the pinna, changing the size and shape of the head, and accounting
for hair acoustic impedance. Results from numerical simulations were then compared
with experimental measures, validating the technique and improving our understand-
ing of the role of the pinnae in modifying the incoming sound in a direction-dependent
manner. Similar work was carried out in the same period by Kahana [44,46]. Such
simulations were initially limited, due to computational resources, to an upper fre-
quency of 6 kHz, then extended to 10 and 20 kHz in later studies [32,45]. Even
in these cases the validation was performed comparing the numerical model results
with experimental measurements showing a good match between the two, also in
light of the variances observed between different HRTF measurement systems for
the same individual [4,47]. The computational complexity of these numerical meth-
ods was a major limitation in the early years of using this technique for generating
HRTFs. Various optimisation techniques are being proposed [35,55,70], allowing
significantly faster computation times with reasonable processing resources (i.e. no
longer needing super computers). This led to the development of easy-to-use and
open-source tools for the numerical calculation of HRTFs. A notable example is
mesh2hrtf [110], a software package centred on a BEM solver, as well as tools
for the pre-processing of geometry data, generation of evaluation grids and post-
processing of calculation results. It is essential here to consider a major challenge to
be tackled when approaching HRTF synthesis from geometrical models, which is the
acquisition and processing of the 3D models from which the HRTFs are computed.
Evaluations of various 3D scanning methods, specifically looking at capturing the
geometry of the pinnae, have been carried out [44,69,80].
Numerical simulations also brought significant benefits with regard to repeata-
bility, replicability and reproducibility. A comparison of different numerical tools
for simulating an HRTF from scan data by Greff and Katz [32] (here employing
the high-resolution scan of a Neumann KU-100 shown in Fig. 4.2b) showed little
variance. In contrast, a similar comparison of acoustical HRTF measurements using
the same head at different laboratories [4] showed significant variations between
resulting HRTFs. Another significant advantage of numerically modelling HRTFs
rather than measuring them is that with physical measurements on human subjects
it is difficult or impossible to isolate the influence of different morphological char-
acteristics on the actual HRTF filters.
Morphological Relationships
Exploring and modelling the relationship between geometrical features and filter
characteristics is indeed a very important step for advancing our understanding of
the spatial hearing processes. Research in this area was strongly advanced with the
distribution of the CIPIC HRTF database [2], which included associated morpho-
4 System-to-User and User-to-System Adaptations in Binaural Audio 123
Fig. 4.3 Two pinna created with the parametric model developed in [91]
logical parameter data for most subjects. This effort was followed with the LISTEN
HRTF database [98], providing similar data. Benefiting from the power of numer-
ical simulation and controlled geometrical models, Katz and Stitt [91] investigated
the effect of morphological changes by varying specific morphological parameters,
an extension of the CIPIC set of morphological parameters to provide more unique
solutions. In order to do this, they created a Parametric Pinna Model (PPM) and
with BEM they investigated the sensitivity of the HRTF to specific morphological
alterations. Examples of pinnae created using this PPM can be seen in Fig. 4.3.
Evaluations included the use of auditory models [88] to identify those morpholog-
ical changes most likely to affect spatial hearing perception. In line with previous
studies, morphological features near to the rear of the helix were found to have little
influence on HRTF objective metrics, while the dimension of the concha had a much
more relevant impact, both looking at the directional and diffuse HRTF spectral com-
ponents. 2Other relevant findings include the importance of the region around the
triangular fossa, which is often not considered when looking at HRTF personalisa-
tion, and the fact that the relief (or depth, directions parallel to the interaural axis)
parameters were found to be at least as important as side-facing parameters, which
are more frequently cited in morphological/HRTF studies.
Such interest in binaural audio, combined with major advancements in terms of
available technologies, has encouraged the publication of large datasets of BEM-
generated HRTFs and correspondent high-accuracy 3D geometrical models. An
example is the Sydney York Morphological and Acoustic Recordings of Ears
(SYMARE) database [42], which was then followed by other examples of either head-
related or more reduced complexity pinnae-related datasets [18,34]. The availability
2The diffuse field component is the spatial average of the HRTF. When removed from the HRTF,
the result is a diffuse field equalised directional transfer function (DTF) [64].
124 L. Picinali and B. F. G. Katz
of such large datasets opened the door to the use of machine learning approaches to
tackle the issue of morphology-based HRTF personalisation. An example is the work
by Grijalva and colleagues [33], where a non-linear dimensionality reduction tech-
nique is used to decompose and reconstruct the HRTF for individualisation, focusing
on elements which vary the most between positions and across individuals. Results
may offer improved performance over linear methods, such as principal component
analysis (e.g. [81]).
HRTFs, Binaural Models and Perceptual Evaluations
It is evident that since the 1990s a large amount of work has been carried out looking
at synthesising HRTFs and better understanding the relationship between these and
morphological features of the pinnae, head and torso. Nevertheless, it must be reiter-
ated that very few of the reviewed studies have included perceptual evaluations on the
modelled HRTFs [18,56], and that in no case such subject-based validations were
extensive enough to fully support the use of synthesised HRTFs instead of measured
ones. It is therefore clear that significant research is still needed in order to develop
and validate models that can describe, classify and ultimately generate individual
HRTFs from a reduced set of parameters.
While numerical assessments can be very useful when trying to better explain
experimental results, they cannot be the only way to explore and validate the quality
of the rendering choices. Binaural models (e.g. [88]) could become an invaluable
tool to help overcome such limitations, as they offer a computational simulation of
binaural auditory processing and, in certain cases, also allow to predict listeners’
responses to binaural signals. Using them, it is possible to rapidly perform com-
prehensive evaluations that would be too time-consuming to implement as actual
auditory experiments (e.g. [17]).
An example of this approach can be found in [29], where an anthropometry-based
mismatch function between HRTF pairs, looking at the relationship between pinna
geometry and localisation cues, was used to select an optimal HRTF for a given
individual, specifically looking at vertical localisation. The outcome of the selection
was then evaluated using an auditory model which computed a mapping between
HRTF spectra and perceived spatial locations. While this study outlined that the best
fitting HRTF selected with the proposed method was predicted to yield a significantly
improved vertical localisation when compared to a selected generic HRTF, it must be
reiterated that the reliability of perceptual models is still to be thoroughly validated,
and potential biases can be identified and dealt with only through actual perceptual
evaluations. Another similar application of binaural models has been recently pub-
lished, focusing on the comparison between different Ambisonics-based binaural
rendering methods [25]. The very large number of independent variables (e.g. each
method was tested with Ambisonics orders from 1 to 44), as well as the complex-
ity of the interactions between such variables, would make it very challenging to
run perceptual evaluations with subjects. This study showed not only that models’
predictions were consistent with previous perceptual data, but also contributed to
validate the models’ ability to predict user responses to binaural signals.
4 System-to-User and User-to-System Adaptations in Binaural Audio 125
It is likely that models will never be able to provide 100% accurate assessments
near to the zone of perfect reproduction, in part due to the difficulties in modelling
processes such as cognitive loading and procedural/perceptual learning. However,
it is reasonable to expect them to provide broadly correct predictions for larger
errors. This means that they could be particularly useful when prototyping rendering
algorithms and designing HRTF personalisation experiments, in order to rapidly
reduce the number of conditions and variables which are subsequently assessed
through real subject-based perceptual evaluations.
Artificial intelligence and machine learning should play an important role in such
future research, looking at improving both HRTF synthesis and selection processes,
as well as perceptual models accuracy and reliability.
4.2.2 HRTF Selection
A different approach for obtaining individual (or at least personalised) HRTFs with-
out having to acoustically measure them is to rely on available HRTF databases, either
transforming/tuning the transfer function according to certain subjective criteria, or
designing a process for selecting the best fitting HRTF for a given subject. Regarding
the first option, as mentioned at the beginning of this section, it is generally known that
frequency-independent ITDs from a given HRTF can be modified and personalised
according to e.g. the head circumference of a given listener [9]. Such a technique is
implemented in a few binaural spatialisers [22,78]. However, the personalisation of
other HRTF features, such as monoaural and interaural Spectral Cues, presents more
significant challenges. Early works in this direction looked at improving vertical
localisation by scaling the HRTF in frequency [64,65]. Other “simpler” approaches
to tuning were found to be effective, for example, by manually modifying frequency
and phase for every HRTF direction, for the left and right ears independently [86].
Hwang and colleagues [40] carried out a principal component analysis on the CIPIC
HRTFs and used the output components to develop a customisation method based
on subjective tuning of a generalised HRTF. Such customisation allowed listeners
to perform significantly better in vertical perception and front-back discrimination
tasks. The same approach was used to modify and personalise a KEMAR HRTF,
resulting also in this case in significantly improved vertical localisation abilities [84].
HRTF Selection Methods
Methods for selecting a best fit HRTF based on subjective criteria can be grouped
into two general categories: physical measurement-based matching and perceptual
selection. The first pertains to selecting an HRTF from an existing set based on mor-
phological measurements or sparse acoustical measurements. Of importance is the
determination of the relevant morphological features, as they pertain to spatial hearing
and HRTF-related cues, as examined by [91]. Zotkin and colleagues [112] looked at
a selection strategy based on matching certain anthrophometric pinnae parameters of
126 L. Picinali and B. F. G. Katz
the specific subject with those of HRTFs within a dataset, while providing associated
low-frequency information using a “head-and-torso” model. Comparison between
a non-personalised HRTF and the selected HRTF via this method showed height-
ened localisation accuracy and improved subjective perception of the virtual auditory
scene when using the latter. A similar approach was used by [81], where advanced
statistical methods were employed to create a subset of morphological parameters,
which were then employed for predicting what might be the subject’s preferred HRTF
based on measurement matching. HRTFs selected using this method performed bet-
ter than randomly selected ones. An alternate selection perspective was proposed
in [30], where a reflection model was applied to the picture of the pinnae of the
subject, facilitating the extraction of relevant anthropometric parameters which were
then used for selecting one or more HRTFs from an existing database. This selection
method resulted in a significant improvement in elevation localisation performances,
as well as an enhancement of the perceived externalisation of the simulated sources.
The relationship between features of the pinna shape and HRTF notches, focusing
specifically on elevation perception, was successfully used in [27] for selecting a best
fitting HRTF from pinna images. Interestingly, studies on Spectral Cues have sug-
gested the importance of notches over peaks in the HRTF [31]. Another work from
Geronazzo and colleagues [28] introduced a rather original approach by developing
the Mixed Structural Modelling (MSM), a framework for HRTF individualisation
which combines structural modelling and HRTF selection. The level of flexibility
of this solution, which allows to mix modelled and recorded components (therefore
HRTF selection and synthesis), is particularly promising when looking at the HRTF
personalisation process.
HRTF Evaluation
It must be highlighted that whether selection is based on measured or perceptual data,
the evaluation of said method is necessarily perceptual as the final application is a
human-centred experience. With this in mind, a fundamental yet unanswered ques-
tion is: “What determines the suitability of an HRTF for a given subject?” [48]. When
establishing whether an HRTF is a good fit, should one look at how precisely sound
sources can be localised using that HRTF (direct approaches), or should other sub-
jective metrics (e.g. realism, spatial quality or overall preference) be employed [87]?
In employing perceptual selection, the choice of protocol becomes more critical.
In addition, as was observed with acoustical measurements, the repeatability of the
measurement apparatus (here the response of human subjects) must be examined and
taken into account. As an example, past studies using binaural audio rendering for
applications other than spatial hearing research (e.g. [74]) relied on simple percep-
tually based HRTF selection procedures which, at a later stage, resulted in being less
repeatable than originally thought [6]. Without extensive training as seen in some of
the principal earlier studies, the reliability of naive listeners (those situations which
are also more representative of applied uses of binaural audio rather than studies on
fundamental auditory processing) must be taken into account. Early studies on HRTF
selection through ratings [53,74] assumed innate reliability in quality judgements.
4 System-to-User and User-to-System Adaptations in Binaural Audio 127
Fig. 4.4 Trajectory graphic
description reference for
HRTF quality ratings:
horizontal (left) and median
(right) plane trajectories
indicating the start/stop
position and trajectory
direction (•) (from [92])
More recently, studies have shown that such reliability cannot be assumed, but must
be evaluated, with some listeners being highly repeatable while others are not [6].
It can be assumed that different HRTFs will, for a given subject, result in different
performances in a sound source localisation task. From this we can infer that an
optimal HRTF could be selected looking at such performances, for example, using
metrics such as localisation errors and front-back and up-down confusion rates (see
Sect. 4.3.2 for metric definitions). This assumption has been the baseline of several
studies where an HRTF selection procedure was designed and evaluated based on
localisation performances [41,83,96]. Such methods previously required specialised
hardware, though current consumer Virtual Reality (VR) devices, thanks to their
increasingly higher performance in terms of tracking capabilities (e.g. [43]), can now
be employed for rendering and reporting the perceived direction of the sound source.
However, these methods still remain rather time-consuming, as a large number of
positions across the whole sphere should be evaluated in order to obtain reliable
results.
Alternatively, HRTF selection can be the result of subjective evaluations based
on indirect quality judgement approaches. Several research works have looked at
asking listeners to rate HRTFs based on the perceived quality of some descriptive
attributes, from the overall impression [106] to how well the auditory presentation
matched specifically described locations or movements of the virtual source [53,83,
85] (e.g. Fig. 4.4). Several methods have been introduced for ultimately being able
to select one or more best performing HRTFs; these include ranking [83], rating on
scales [6,53,82], multiple selection-elimination rounds [97] and pairwise compar-
isons [85,106]. In general, there seems to be an agreement on the fact that expert
assessors (as defined by [107]) perform significantly better (i.e. in a more reliable
and repeatable manner) if compared with initiated assessors [6,54]. To gain further
insight into indirect method results, some work has been carried out to develop global
perceptual distance metrics with the aim to describe both HRTF and listener simi-
larities [8]. In addition to proposing and evaluating a set of perceptual metrics, this
work encourages further research into novel experiment design which could help in
minimising the need for data normalisation and, more importantly, outlines the need
for further investigations on the stability of these perceptual experiments/evaluations,
specifically looking at repeatability and training.
128 L. Picinali and B. F. G. Katz
Methods Comparison
Few studies have examined the similarity between direct (i.e. localisation perfor-
mances) and indirect HRTF selection methods. Using an immersive VR reporting
system for the localisation test, results from [108] indicated a significant and pos-
itive mean correlation between HRTF selection based on localisation performance
and HRTF ranking/selection based on quality judgement; the best HRTF selected
according to one method had significantly better rating according to metrics in the
other method. In contrast, using a gestalt reporting method through the use of an
avatar representation of the listener’s head, results from [54] showed no signifi-
cant correlations. A number of protocol differences exist between these two studies,
including the type of tasks used for both methods, the user interface (see [10,11]
regarding localisation reporting method effects), the stimuli signals, as well as the
metrics evaluated in the quality judgement task.
4.3 User-to-System Adaptation: HRTF Accommodation
The previous section examined HRTF selection and individualisation methods in
the signal domain. While such methods aim to provide every individual user with
the best HRTF possible, such approaches are not always available in all conditions.
However, evidence is increasingly available showing that the adult brain is adaptable
to environmental changes. It has been demonstrated that this adaptability (or plas-
ticity) regarding spatial auditory processing can lead to a reduction in localisation
error over time in the case when a listener’s normal localisation cues are significantly
modified.
It has been established that one can adapt to modified HRTFs over time, with
ear moulds inserted in the pinnae [19,38,94,95], or with non-individual HRTFs
through binaural rendering [73,77,90,92,99,109]. Studies have shown that one can
adapt to distorted HRTFs, e.g. in [60] where participants suffering from hearing loss
learned to use HRTFs whose spectrum had been warped to move audio cues back into
frequency bands they could perceive. HRTF learning is not only possible, but lasting
in time [62,92,109]: users have been shown to retain performance improvements up
to 4 months after training [109]. Given enough time, participants using non-individual
HRTFs may achieve localisation performance on par with participants using their
own individual HRTFs [73,77,92].
This concept has been successfully used to improve user localisation performance
within virtual auditory environments when using non-individual HRTFs. Readers are
referred to [61,104] for more general reviews on the broader topic of HRTF learning.
4 System-to-User and User-to-System Adaptations in Binaural Audio 129
4.3.1 Training Protocol Parameters
Learning methods explored in previous studies are often based on a localisation
task. This type of learning is referred to as explicit learning [61], as opposed to
implicit learning where the training task does not immediately focus participant
attention on localisation cues [73,92]. Performance-wise, there is no evidence to
suggest either type is better than the other. Implicit learning gives more leeway for
task design gamification. The technique is more and more applied to the design of
HRTF learning methods [39,73,90,92], and while its impact on HRTF learning
rates remains uncertain [90], its benefit for learning, in general, is, however, well
established [36]. On the other hand, explicit learning more readily produces training
protocols where participants are consciously focusing on the learning process [63],
potentially helping with the unconscious re-adjustment of auditory spatial mapping.
As much as the nature of the task, providing feedback can play an important
role during learning. VR technologies are more and more relied upon to increase
feedback density in the hope of increasing HRTF learning rates (in Chap. 10,the
interested reader can find further insights on multisensory feedback in VR). While
results encourage the use of a visual virtual environment [60], it has been reported
that proprioceptive feedback alone can be used to improve learning rates [16,73].
Direct comparison of experimental results suggests that active learning with direct
feedback is more efficient (i.e. leads to faster improvement) than passive learning
from sound exposure [61]. There is also a growing consensus on the use of adaptive
(i.e. head-tracked) binaural rendering during training to improve learning rates [19],
despite the generalised use of static head-locked localisation tasks to assess perfor-
mance evolution [61]. It is not trivial to ascertain whether the benefit of head-tracked
rendering comes from continuous situated feedback improving audio cue recalibra-
tion, or from unbalanced comparison, as static head-locked rendering creates user
frustration and results in less sound exposure [90]).
Studies on the training stimulus indicate that learning extends to more than the
signals used during learning [39,90]. This result is likely dependent on specific char-
acteristics of the stimuli and how these relate to auditory localisation mechanisms,
i.e. whether they present the transient energy and broad frequency content necessary
for auditory spatial discrimination [24,57,72].
There is no clear cut result on optimum training session duration and scheduling.
Training session duration reported in previous studies ranges from ≈8min[66]to
≈2h [60]. Comparative analysis argues in favour of several short training sessions
over long ones [61]. Training session spread is also widely distributed in the litera-
ture, ranging from all sessions in one day [57] versus one every week or every other
week [92]. Where results suggest spreading training over time benefits learning (all
in 1 day versus spread over 7 days) [57] outcomes from [73,92] indicate that weekly
sessions and daily sessions result in the same overall performance improvement (for
equal total training duration). There is some example of latent learning (improvement
between sessions) in the literature [66], naturally encouraging the spread of training
sessions. Regardless of duration and spread, studies have shown that learning sat-
130 L. Picinali and B. F. G. Katz
uration occurs after a while. In [59], most of the training effect took place within
the first 400 trials (≈160 min), a result comparable to that reported by [20] where
saturation was reached after 252 to 324 trials.
One of the critical questions not fully answered to date is the role of the HRTF
fit in the training process or how similar the training HRTF is to the actual HRTF of
the individual. It would appear that a certain degree of affinity between a participant
and the training HRTF facilitates learning [73,92]. In contrast, lack of adaptation
can occur if the HRTF to be learned is too different from one’s own HRTF. This
is evidenced by mixed adaptation results in studies where ill-suited HRTF matches
were tested.
4.3.2 HRTF Accommodation Example
We present here as an example HRTF learning study by Stitt et al. [92], which
examined the effect of adaptation to non-individual HRTFs. This study was cho-
sen for this example as it provides a controlled study over a significant number of
training sessions. As a “worst-case” real-world scenario, perceptually worst-rated
non-individual HRTFs were chosen by each subject to allow for maximum poten-
tial for improvement, another factor of interest in its design. This study is part of
a series of studies on the subject of user-to-system adaptation, providing continuity
of comparisons [15,73,77]. The methodology consisted of a training game and a
localisation test to evaluate performance carried out over 10 sessions. Subjects using
non-individual HRTFs (group W10) were tested alongside control subjects using
their own individual measured HRTFs (group C10).
Prior to any training, subjects were assigned non-individual HRTFs based on
quality judgements of rendered sound object trajectories for 7 HRTF sets, taken as
“perceptually orthogonal”[53]. These trajectories, shown in Fig. 4.4, were presented
to subjects as a reference. Following the results of [8], which examined the reliability
and repeatability of HRTF judgements by naive and experienced subjects, this rating
task was performed three times, leading to a total of six ratings per subject, counting
the two trajectories, with the overall judgement rating taken as the overall mean.
The lowest rated HRTF for each subject was then used as that subject’s worst-match
HRTF. This method is an improvement over alternate methods which are either
uncontrolled (e.g. a single HRTF used by all listeners) or limited in the extent of
relative spectral changes presented to subjects when compared to their individual
HRTFs.
The training procedure for the 10 sessions was devised as a simple game with a
searching task in which the listener had to find a target at a hidden position in some
direction (θ,φ), ignoring radial distance. Subjects searched for the hidden target by
moving the motion-tracked hand-held object around their head (see concept in Fig.
4.5). For the duration of the search, alternating pink/white noise (50–20000 Hz) with
an overall level of approximately 55 dBA measured at the ear was presented to the
listener, positioned at the location of the tracked hand-held object relative to the
4 System-to-User and User-to-System Adaptations in Binaural Audio 131
Fig. 4.5 Training game
concept design
subject’s head. This provided a link between the proprioceptively known position of
the subject’s own hand and spatial cues in the binaural rendering. The alternation
rate of the pink/white noise bursts increased with increasing angular proximity to the
target direction using a Geiger counter metaphor [71,79]. Once the subject reached
the intended target direction, a success sound would play, spatialised at the target’s
location. The training game lasted 12 min and subjects were instructed to find as many
targets as possible in the time available. Sessions 1–4 occurred at 1-week interval,
while the remaining sessions occurred at 2-week interval.
It should be emphasised that no auditory localisation on the part of the subject
was actually required to accomplish this task, only tempo judgements of the alter-
nation rate of the pink/white noise bursts and proprioceptive knowledge of one’s
hand position. HRTF adaptation was therefore an implicit result of game play, but
not the task of the game as far as the participant was aware. This task was designed
to facilitate learning with source positions outside of the visual field of view, as well
as to function for individuals with visual impairments.
Performance Evaluation Metrics
The HRTF accommodation was evaluated via localisation tests. Subjects were pre-
sented a brief burst of noise (to limit the influence of any possible head movement
during playback) and would subsequently point in the perceived direction of the
sound using the hand-held object. No feedback was given to subjects regarding the
target position. The noise burst consisted of a train of three, 40ms Gaussian broad-
band noise pulses (20000 Hz) with 2 ms raised cosine window applied at onset and
offset and 30ms of silence between each burst [73]. There were 25 target directions
with 5 repetitions of each target, resulting in the tested sphere including a full 360◦
of azimuth, and –40–90◦of elevation.
Two types of metrics were used to analyse localisation errors: angular and confu-
sion metrics. The interaural coordinate system defines a lateral and polar angle Fig.
4.6a. The lateral angle is the angle between the interaural axis and the line between the
132 L. Picinali and B. F. G. Katz
Fig. 4.6 Interaural polar coordinate system and associated polar angle cone-of-confusion zone
definitions
origin and the source. The lateral angle approaches cones-of-confusion along which
the interaural cues (ITD and ILD) are approximately equal. A cone-of-confusion is
defined by the contour around the listener for a given ITD or ILD (see Fig. 4.1). For
ITD, these contours can be generally represented by a hyperbolic function, where
the difference in arrival time to the two ears is constant and the vertex is on the
interaural axis, between the two ears. The intersection of the ITD and ILD cones-of-
confusion for a given stimulus prescribes a closed curve (approaching a circle). The
ITD and ILD are insufficient to resolve the localisation ambiguity, requiring further
information, such as from Spectral Cues or head movements. The polar angle is the
angle between the horizontal plane and a perpendicular line from the interaural axis
to the point, such that the polar angle prescribes the source location on the cone-
of-confusion. The polar angle is primarily linked with the monaural, Spectral Cues
in the HRTF. This independence of binaural and Spectral Cues makes the interaural
coordinate system a natural choice when looking at localisation performance. If the
perceived ILD, ITD and Spectral Cues of a given source do not adequately coincide
with the expectations of the auditory system for a single point in space, uncertainty in
localisation response ensues. The most commonly referenced uncertainties are polar
angle confusions.
Polar angle confusions are classified using a traditional segmentation of the cone-
of-confusion [73,92], revised in [108]. The classification results in three potential
confusion types, front-back, up-down and combined, with a fourth type correspond-
ing to precision errors, represented schematically in Fig. 4.6b. The precision category
designates any response close enough to the real target so as not to be associated to
4 System-to-User and User-to-System Adaptations in Binaural Audio 133
Fig. 4.7 Result analysis by subgroup. aMean absolute polar angle error and 95% confidence
intervals for groups W10+,W10– and C10 across sessions 1–10. bResponse classification analysis:
Mean classification of results for group W10 by type (precision (×), front-back error (), up-down
error ()andcombined error ()) for groups W10+ (—, 3 subjects) and W10– (--, 5 subjects)
over sessions 1–10 (from [92])
the other confusion types. In short, responses classified under precision are for those
within ±45◦of the target angle, front-back classified errors are responses reflected in
the frontal plane, and those classified up-down are for those reflected in the transverse
plane. Any responses that fall outside of these regions are classified as combined type
errors.
Performance Evaluation Results
Results examined the evolution of polar angle error and confusion rates. As a measure
of accommodation, the rate of improvement was defined as the gradient of the linear
regression of polar angle error. The rates of improvement for the 8 subjects spanned
values of 0.5◦to 4.6◦/session over sessions 5–10 (as results for initial sessions have
been shown to be influenced by procedural learning effects [59]). In contrast, results
for the control group over the same sessions spanned 0◦to 2.2◦/session. A clustering
analysis of the test group relative to the control group, C10, separated those whose
rate of improvement exceeded that of the control group (subgroup W10+) and the
remaining subjects (W10–) who did not. This second group failed to exhibit clear
HRTF adaptation results over and above that of the control group whose improvement
can be considered primarily as procedural learning.
The polar errors are shown in Fig. 4.7a for groups W10+,W10– and C10. Group
W10+ approached a similar level of absolute performance to C10. This demonstrates
that these subjects were able to adapt well to their worst-rated HRTF to a level
approaching subjects using their individually measured one. It also shows clearly that,
despite continuous training, some subjects (W10–) exhibited little or no improvement
beyond the procedural learning seen in C10.
134 L. Picinali and B. F. G. Katz
The response classification results for groups W10+ and W10– are shown in Fig.
4.7b. At the outset of the study, it can be observed that up-down and front-back
type error rates are comparable between the two subgroups, with W10– exhibiting
more combined type errors. This metric could be a potential indicator for identifying
poor HRTF adaptation conditions. Subsequently, it can be clearly seen that group
W10+ exhibits a steady increase in precision classified responses, with reductions in
front-back errors over sessions 3–5 and subsequent reductions in combined errors.
In contrast, group W10– exhibits generally consistent response classifications across
sessions, with only small increases in precision classification mirrored by a decreas-
ing trend in front-back errors. For all subjects, it can be noted that the occurrence of
up-down errors is quite rare.
Results of this accommodation study show that adaptation to an individual’s per-
ceptually worst-rated HRTF can continue as long as training is provided, though the
rate of improvement decreases after a certain amount of training. A subgroup achiev-
ing localisation performance levels approaching the control group with individual
HRTFs. These performance levels were comparable to those observed in [73] with
identical test protocol, where subjects performed only three training sessions using
their perceptually best rated HRTF.
4.4 Discussion
It is clear that, while various methods and tools are available for selecting a best fit
HRTF for a given listener, there is no established evaluation protocol to determine
how well these methods work and compare with each other. While some work is
advancing in proposing common methodologies and metrics [75], the lack of estab-
lished methods raises some very relevant questions about the feasibility of a unique
HRTF selection task which performs reliably and independently from factors such
as the listeners expertise, the signals employed, the user interface, the context where
the tests are carried out and, more in general, the task for which the final quality
is judged. It seems evident that any major leap forward in this field is limited until
two primary issues are addressed: (1) the establishment of pertinent metrics to per-
ceptually assess HRTFs and (2) the relationship between these metrics and specific
characteristics of the signal domain HRTF filters.
The use of HRTF adaptation, in examining the results of this and previous studies,
has been shown to be a viable option to improve spatial audio rendering, at least
with regard to localisation. The level of adaptation achievable is related to the initial
suitability (perceptual similarity) between the system HRTF and the user’s individual
HRTF, with more suitable HRTFs showing more rapid adaptation. No significant
effect has been found regarding the specific training intervals, though spreading out
sessions is better than multiple sessions on the same day. The adaptation method
could be integrated into a stand-alone game application, or as part of device setup
and personalization configurations, typical of most VR devices to some degree. The
major limitation, once the training HRTF is chosen, is the need for repeated training
4 System-to-User and User-to-System Adaptations in Binaural Audio 135
Fig. 4.8 Example active HRTF learning training game. Training setup: (top-left) participant in
the experiment room, (bottom-left) third person view of the training platform, (right) participant
viewpoint during the training (from [77])
sessions, and this must be made clear to users so that they do not expect ideal results
from the start.
The combination of user-to-system and system-to-user adaptation is a promis-
ing solution. While user-to-system adaptation appears limited by the initial training
HRTF employed, system-to-user adaptation methods provide various means of pro-
viding, if not a perfect individual HRTF, a reasonable near approximation. As such,
selection of a pretty-good HRTF match followed by user training could be a viable
real-world solution.
An example of such a tailored HRTF training has been tested in [77]. In this work,
as compared to the previous mentioned study in Sect. 4.3.2, the subject was aware of
the goal of the training, with specific HRTF-based localisation difficulties presented
with increasing difficulty (see Fig. 4.8). In addition, a best match HRTF condition was
employed using an interactive exploration method, rather than the general ranking
described in Sect. 4.2.2 and a worst-case selection scenario. Results indicated that
the proposed training program led to improved learning rates compared to that of
previous studies. A further addition of this study was the inclusion of a simulated
room acoustic response, moving from the typical anechoic conditions of previous
studies to a more natural acoustic for the user. Results showed that the addition of
the room acoustics improved HRTF adaptation rate across sessions.
4.5 Conclusions and Future Directions
While binaural audio and spatial hearing have been studied for over 100 years, major
advancements in these fields have occurred in the last two to three decades, possi-
bly thanks to progress in real-time computing technologies. It has been extensively
shown that everyone perceives spatial sound differently thanks to the particular shape
of their ears, head and torso. For this reason, either high-quality simulations need to
136 L. Picinali and B. F. G. Katz
be uniquely tailored to each individual listener, or the listener needs to adapt to the
configuration (i.e. the HRTF on offer) of the rendering system, or again some combi-
nation of both using individualised HRTFs. This chapter has provided an overview of
research aimed at systematically exploring, assessing and validating various aspects
of these two approaches. But while there is a good level of agreement on certain
notions and principles, e.g. that using non-individual HRTFs can result in impaired
localisation performance which can however be improved through perceptual train-
ing, there are still open challenges in need of further investigation.
A rather general but very important question that has yet to be addressed is how
we can measure whether a simulated immersive audio experience is suitable and of
sufficient quality for a given individual. Previous work has established a certain level
of standardisation for assessing general audio quality (e.g. related to telecommuni-
cation and audio compression algorithms), but equivalent work has yet to be carried
out in the field of immersive audio. Objective and subjective metrics for assessing
HRTF similarity have been explored and evaluated in the past [5], and recently pub-
lished research suggests that additional metrics might exist, e.g. looking at speech
understanding performance [21] or machine learning artificial localization tests [3,
13]. Nevertheless, extensive research is still needed in order to understand and model
low-level psychophysical (sensory) as well as high-level psychological (cognitive)
spatial hearing perception.
Factors other than choices related to binaural audio processing could also have an
impact on the overall perception of the rendered scenes. The fact that high-quality,
albeit non-interactive, immersive audio rendering can be achieved through record-
ings done with a simple binaural microphone, which by definition do not account for
individualised HRTFs, can be considered an example of the major complexity and
dimensionality of the problem. Matters such as the choice of audio content, the con-
text of the rendered scene, as well as the experience of the listener (e.g. whether they
have previously participated in immersive audio assessments) have been shown to be
relevant when assessing the perceived quality of the immersive audio rendering [6,
54]. Such a discussion found a natural continuation in Chap. 5.
Looking more in depth at the need to quantify the individually perceived quality
of the rendering, the understanding of the perceptual weighting of morphological
factors contributing to spatial hearing becomes an essential target to be achieved.
Data-based machine learning approaches may be a useful tool when tackling this, as
well as challenges related to user-to-system adaptation. Examples include allowing a
certain level of customisation of the training by individually and adaptively varying
the difficulty of the challenge, maximising learning and at the same time avoiding
an overload of sensory and cognitive capabilities. Further explorations on spatial
hearing adaptation shall focus on exploring the transferability of the acquired training
between different hearing skills (e.g. [100]) and examining to what extent spatial
auditory training performed in VR is transferable to real-life tasks.
Another very relevant yet still under-explored area of research is employing cog-
nitive and psycho-physiological measurements when trying to assess both the quality
of rendered spatial hearing cues and the cognitive effort during HRTF training. In
the first case, measures related with behavioural performance, as well as electroen-
4 System-to-User and User-to-System Adaptations in Binaural Audio 137
cephalographic markers of selective attention, could be used to assess the suitability
of immersive rendering choices [23], possibly opening the path towards passive
perceptual-based HRTF selection. In the second case, similar metrics, with the addi-
tion of other measures of listening effort such as pupil dilation [103], could be
employed for customising spatial hearing training routines, maximising outcomes
while maintaining engagement and feasibility of the proposed tasks.
Final Thoughts
While most studies have focused on laboratory conditions to isolate specific percep-
tion elements, recent context-relevant studies have begun to examine the impact of
spatial audio quality on task accomplishment. For example, [76] compared perfor-
mance in a first-person-shooter VR game context with different HRTF conditions.
Results showed performance for extreme elevation target positions was affected by
the quality of HRTF matching. In addition, a subgroup of participants showed higher
sensitivity to HRTF choice than others. At the same time, low-level sensory percep-
tion is only one of the dimensions where immersive audio simulations can have a
significant impact. In order to significantly advance our understanding of the impact
of HRTF personalisation in virtually rendered scenes and tasks, research needs to
move beyond the evaluation of individual immersive audio tasks and metrics (e.g.
sound localisation and/or perceived quality of the rendering), moving towards the
evaluation of full experiences. The impact of immersive audio beyond perceptual
metrics such as localisation, externalisation and immersion [87] is an as yet unex-
plored area of research, specifically when related with social interaction, entering
the behavioural and cognitive realms.
In the past, several studies have been published in which auditory-based AR/VR
interactions were created and evaluated without considering HRTF choice or using
HRTF personalisation approaches that had not previously been appropriately vali-
dated from a perceptual point of view, or again ignoring the effects of HRTF accom-
modation, or blaming them in order to justify unexpected results. Considering our
current knowledge and experience in immersive audio research, we are keen to rec-
ommend carrying out some level of personalisation of the spatial rendering when per-
forming studies which involve auditory-based or multimodal interactions in AR/VR.
As a baseline, ITDs can easily be customised to match the head circumference of
the specific listener (as mentioned above, this function is already implemented in
a few spatialisers, such as [22,78]). Furthermore, HRTF selection routines, both
perceptual and morphology based, could be very beneficial if carried out before the
experiment, albeit it is important for the repeatability of such choices to be assessed
with the specific subject (i.e. repeating the selection several times in order to ver-
ify the consistency across the trials, and possibly discard subjects/methods which
do not show a sufficient level of repeatability). Regarding the use of synthesised
HRTFs, until these are validated through extensive perceptual studies our advice
is to use measured ones, possibly coming from the same dataset in order to avoid
measurement-based differences.
138 L. Picinali and B. F. G. Katz
In addition to these recommendations, it is important to emphasize that the future
of immersive audio research will need to include studies focusing on different con-
texts (e.g. AR/VR interactions, virtual museum explorations and virtual assistant
avatars), exploring the impact (and need) of HRTF personalisation on complex tasks
such as interpersonal exchanges and distance learning in VR. Furthermore, in order
to ensure a sufficient level of standardisation and consistently advance the achieve-
ments of research in this area, it seems evident that a concerted and coordinated effort
across disciplines and research groups is highly desirable.
Acknowledgements Preparation of the chapter was made possible by support from SONICOM
(www.sonicom.eu), a project that has received funding from the European Union’s Horizon 2020
research and innovation program under grant agreement No. 101017743.
References
1. Algazi, V. R., Duda, R. O., Duraiswami, R., Gumerov, N. A., Tang, Z.: Approximating the
head-related transfer function using simple geometric models of the head and torso. J Acoust
Soc Am 112, 2053–2064 (2002).
2. Algazi, V. R., Duda, R. O., Thompson, D. M., Avendano, C.: The cipic hrtf database in Pro-
ceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and
Acoustics (Cat. No. 01TH8575) (2001), 99–102.
3. Ananthabhotla, I., Ithapu, V. K., Brimijoin, W. O.: A framework for designing head-related
transfer function distance metrics that capture localization perception. JASA Express Letters
1, 044401:1–6 (2021).
4. Andreopoulou, A., Begault, D. R., Katz, B. F.: Inter-Laboratory Round Robin HRTF Measure-
ment Comparison. IEEE J Selected Topics in Signal Processing 9, 895–906 (2015).
5. Andreopoulou, A., Katz, B. F. G.: On the use of subjective HRTF evaluations for creating global
perceptual similarity metrics of assessors and assessees in Intl Conf on Auditory Display (2015),
13–20.
6. Andreopoulou, A., Katz, B. F. G.: Investigation on Subjective HRTF Rating Repeatability in
Audio Eng Soc Conv 140 (Paris, June 2016), 9597:1–10.
7. Andreopoulou, A., Katz, B. F.: Identification of perceptually relevant methods of inter-aural
time difference estimation. J Acoust Soc Am 142, 588–598 (2017).
8. Andreopoulou, A., Katz, B. F.: Subjective HRTF evaluations for obtaining global similarity
metrics of assessors and assessees. Journal on Multimodal User Interfaces 10, 259–271 (2016).
9. Aussal, M., Alouges, F., Katz, B. F.: ITD Interpolation and Personalization for Binaural Synthe-
sis Using Spherical Harmonics in Audio Eng Soc UK Conf (York, UK, Mar. 2012), 04:01–10.
10. Bahu, H.: Localisation auditive en contexte de synthèse binaurale nonindividuelle PhD thesis
(Université Pierre et Marie Curie-Paris VI, 2016).
11. Bahu, H., Carpentier, T., Noisternig, M.,Warusfel, O.: Comparison of different egocentric
pointing methods for 3D sound localization experiments. Acta Acust 102, 107–118 (2016).
12. Batteau, D. W.: The role of the pinna in human localization. Proceedings of the Royal Society
of London. Series B. Biological Sciences 168, 158–180 (1967).
13. Baumgartner, R., Majdak, P., Laback, B.: Modeling sound-source localization in sagittal planes
for human listeners. J Acoust Soc Am 136, 791–802 (2014).
14. Begault, D. R.: 3-D Sound for Virtual Reality and Multimedia (Academic Press, Cambridge,
1994).
4 System-to-User and User-to-System Adaptations in Binaural Audio 139
15. Blum, A., Katz, B., Warusfel, O.: Eliciting adaptation to non-individual HRTF spectral cues
with multi-modal training in 7ème Cong de la Soc Française d’Acoustique et 30ème congrès
de la Soc Allemande d’Acoustique (CFA/DAGA) (Strasbourg, 2004), 1225–1226.
16. Bouchara, T., Bara, T.-G., Weiss, P.-L., Guilbert, A.: Influence of vision on short-termsound
localization training with non-individualized HRTF in EAA Spatial Audio Signal Processing
Symp (2019), 55–60.
17. Brinkmann, F., Weinzierl, S.: Comparison of Head-Related Transfer Functions Pre-Processing
Techniques for Spherical Harmonics Decomposition English. in (Audio Engineering Society,
Aug. 2018).
18. Brinkmann, F. et al.: A cross-evaluated database of measured and simulated HRTFs including
3D head meshes, anthropometric features, and headphone impulse responses. J Audio Eng Soc
67, 705–718 (2019).
19. Carlile, S., Balachandar, K., Kelly, H.: Accommodating to new ears: the effects of sensory and
sensory-motor feedback. J Acous Soc America 135, 2002–2011 (2014).
20. Carlile, S., Leong, P., Hyams, S.: The nature and distribution of errors in sound localization by
human listeners. Hearing Research 114, 179–196 (1997).
21. Cuevas-Rodriguez, M., Gonzalez-Toledo, D., Reyes-Lecuona, A., Picinali, L.: Impact of non-
individualised head related transfer functions on speechin- noise performances within a syn-
thesised virtual environment. The JAcoust Soc Am 149, 2573–2586 (2021).
22. Cuevas-Rodríguez, M. et al.: 3D Tune-In Toolkit: An open-source library for real-time binaural
spatialisation. PloS one 14, e0211899 (2019).
23. Deng, Y., Choi, I., Shinn-Cunningham, B., Baumgartner, R.: Impoverished auditory cues limit
engagement of brain networks controlling spatial selective attention. NeuroImage 202, 116151
(2019).
24. Dramas, F., Katz, B., Jouffrais, C.: Auditory-guided reaching movements in the peripersonal
frontal space in Acoustics’08. 9e Congrèès Français d’Acoustique of the SFA. 123 (Acoustical
Society of America, 2008), 3723.
25. Engel, I., Goodman, D. F. M., Picinali, L.: Assessing HRTF preprocessing methods for
Ambisonics rendering through perceptual models. en. Acta Acustica 6, 4 (2022).
26. Genuit, K.: A model for the description of outer-ear transmission characteristics PhD thesis
(Rhenish-Westphalian Technical University, Düsseldorf, 1984), 220.
27. Geronazzo, M., Peruch, E., Prandoni, F.,Avanzini, F.: Applying a single-notch metric to image-
guided head-related transfer function selection for improved vertical localization. Journal of
the Audio Engineering Society 67, 414–428 (2019).
28. Geronazzo, M., Spagnol, S., Avanzini, F.: Mixed structural modeling of headrelated transfer
functions for customized binaural audio delivery in 2013 18th International Conference on
Digital Signal Processing (DSP) (2013), 1–8.
29. Geronazzo, M., Spagnol, S., Avanzini, F.: Do we need individual head-related transfer func-
tions for vertical localization? The case study of a spectral notch distance metric. IEEE/ACM
Transactions on Audio, Speech, and Language Processing 26, 1247–1260 (2018).
30. Geronazzo,M., Spagnol, S., Bedin, A.,Avanzini, F.: Enhancing vertical localization with image-
guided selection of non-individual head-related transfer functions in 2014 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), 4463–4467.
31. Greff,R., Katz, B.: Perceptual evaluation of HRTF notches versus peaks for vertical localisation
in Intl Cong on Acoustics 19 (Madrid, Spain, 2007), 1–6.
32. Greff, R., Katz, B.: Round Robin comparison of HRTF simulation results : preliminary results.
in Audio Eng Soc Conv 123 (New York, USA, 2007), 1–5.
33. Grijalva, F., Martini, L., Florencio, D., Goldenstein, S.: A manifold learning approach for per-
sonalizing HRTFs from anthropometric features. IEEE/ACM Transactions on Audio, Speech,
and Language Processing 24, 559–570 (2016).
34. Guezenoc, C., Seguier, R.: A wide dataset of ear shapes and pinna-related transfer functions
generated by random ear drawings. J Acoust Soc Am 147, 4087–4096 (2020).
35. Gumerov, N. A., O’Donovan, A. E., Duraiswami, R., Zotkin, D. N.: Computation of the head-
related transfer function via the fast multipole accelerated boundary element method and its
spherical harmonic representation. JAcoust Soc Am 127, 370–386 (2010).
140 L. Picinali and B. F. G. Katz
36. Hamari, J., Koivisto, J., Sarsa, H.: Does gamification work? A literature review of empirical
studies on gamification in Intl Conf on System Sciences (2014), 3025–3034.
37. Hendrickx, E. et al.: Influence of head tracking on the externalization of speech stimuli for
non-individualized binaural synthesis. J Acoust Soc Am 141, 2011–2023 (2017).
38. Hofman, P. M., Van Riswick, J. G., Van Opstal, A. J.: Relearning sound localization with new
ears. Nature Neuroscience 1, 417–421 (1998).
39. Honda, A. et al.: Transfer effects on sound localization performances from playing a virtual
three-dimensional auditory game. Applied Acoustics 68, 885–896 (2007).
40. Hwang, S., Park, Y., Park, Y.-s.: Modeling and customization of head-related impulse responses
based on general basis functions in time domain. Acta Acustica united with Acustica 94, 965–
980 (2008).
41. Iwaya,Y.: Individualization of head-related transfer functions with tournamentstyle listening
test: Listening with other’s ears. Acoustical Science & Technology 27, 340–343 (2006).
42. Jin, C. T. et al.: Creating the Sydney York morphological and acoustic recordings of ears
database. IEEE Transactions on Multimedia 16, 37–46 (2013).
43. Jost, T. A., Nelson, B., Rylander, J.: Quantitative analysis of the Oculus Rift S in controlled
movement. Disability and Rehabilitation: Assistive Technology, 1–5 (2019).
44. Kahana, Y.: Numerical modelling of the head-related transfer function PhD thesis (University
of Southampton, 2000).
45. Kahana, Y., Nelson, P. A.: Boundary element simulations of the transfer function of human
heads and baffled pinnae using accurate geometric models. Journal of Sound and Vibration
300, 552–579 (2007).
46. Kahana, Y., Nelson, P. A., Petyt, M., Choi, S.: Boundary element simulation of HRTFs and
sound fields produced by virtual acoustic imaging systems in Audio Engineering Society Con-
vention 105 (1998).
47. Katz, B., Begault, D.: Round robin comparison of HRTF measurement systems : preliminary
results. in Intl Cong on Acoustics 19 (Madrid, Spain, 2007), 1–6.
48. Katz, B., Nicol, R. in Sensory Evaluation of Sound (ed Zacharov, N.) 349–388 (CRC Press,
Boca Raton, 2019).
49. Katz, B. F. G.: Measurement and Calculation of Individual Head-Related Transfer Functions
Using a Boundary Element Model Including the Measurement and Effect of Skin and Hair
Impedance PhD thesis (The Pennsylvania State University, 1998).
50. Katz, B. F. G., Noisternig, M.: A comparative study of interaural time delay estimation methods.
J Acoust Soc Am 135, 3530–3540 (2014).
51. Katz, B. F.: Boundary element method calculation of individual head-related transfer function.
I. Rigid model calculation. J Acoust Soc Am 110, 2440–2448 (2001).
52. Katz, B. F.: Boundary element method calculation of individual head-related transfer function.
II. Impedance effects and comparisons to real measurements. J Acoust Soc Am 110, 2449–2455
(2001).
53. Katz, B. F., Parseihian, G.: Perceptually based head-related transfer function database opti-
mization. J Acoust Soc Am 131, EL99–EL105 (2012).
54. Kim, C., Lim, V., Picinali, L.: Investigation Into Consistency of Subjective and Objective
Perceptual Selection of Non-individual Head-Related Transfer Functions. J Audio Eng Soc 68,
819–831 (2020).
55. Kreuzer, W., Majdak, P., Chen, Z.: Fast multipole boundary element method to calculate head-
related transfer functions for a wide frequency range. J Acoust Soc Am 126, 1280–1290 (2009).
56. Kreuzer, W., Majdak, P., Haider, A.: A boundary element model to calculate HRTFs. Compar-
ison between calculated and measured data in Proceedings of the NAG-DAGA International
Conference 2009 (2009), 196–199.
57. Kumpik, D. P., Kacelnik, O., King, A. J.: Adaptive reweighting of auditory localization cues
in response to chronic unilateral earplugging in humans. J of Neuroscience 30, 4883–4894
(2010).
58. Lopez-Poveda, E. A., Meddis, R.: A physical model of sound diffraction and reflections in the
human concha. J Acoust Soc Am 100, 3248–3259 (1996).
4 System-to-User and User-to-System Adaptations in Binaural Audio 141
59. Majdak, P., Goupell, M. J., Laback, B.: 3-D localization of virtual sound sources: Effects of
visual environment, pointing method, and training. Attention, Perception, & Psychophysics
72, 454–469 (2010).
60. Majdak, P., Walder, T., Laback, B.: Effect of long-term training on sound localization perfor-
mance with spectrally warped and band-limited head-related transfer functions. J Acous Soc
America 134, 2148–2159 (2013).
61. Mendonça, C.: A review on auditory space adaptations to altered head-related cues. Frontiers
in Neuroscience 8, 219:1–14 (2014).
62. Mendonça, C., Campos, G., Dias, P., Santos, J. A.: Learning auditory space: Generalization
and long-term effects. PloS One 8, 1–14 (2013).
63. Mendonça, C. et al.: On the improvement of localization accuracy with nonindividualized
HRTF-based sounds. J Audio Eng Soc 60, 821–830 (2012).
64. Middlebrooks, J. C.: Individualdifferences in external-ear transfer functions reduced by scaling
in frequency. J Acoust Soc Am 106, 1480–1492 (1999).
65. Middlebrooks, J. C., Macpherson, E. A., Onsan, Z. A.: Psychophysical customization of direc-
tional transfer functions for virtual sound localization. J Acoust Soc Am 108, 3088–3091
(2000).
66. Molloy, K., Moore, D. R., Sohoglu, E., Amitay, S.: Less is more: latent learning is maximized
by shorter training sessions in auditory perceptual learning. PloS One 7, 1–13 (2012).
67. Morimoto, M., Aokata, H.: Localization cues of sound sources in the upper hemisphere. J
Acous Soc Japan 5, 165–173 (1984).
68. Nicol, R.: Binaural Technology 77 (Audio Engineering Society, New York 2010).
69. Ospina, F. R., Emerit, M., Katz, B. F.: The 3D morphological database for spatial hearing
research of the BiLi project in Proc. of Meetings on Acoustics 23 (Pittsburg, May 2015), 1–17.
70. Otani, M., Ise, S.: Fast calculation system specialized for head-related transfer function based
on boundary element method. J Acoust Soc Am 119, 2589–2598 (2006).
71. Parseihian, G., Katz, B., Conan, S.: Sound effect metaphors for near field distance sonification
in Intl Conf on Auditory Display (Atlanta, June 2012), 6–13.
72. Parseihian, G., Katz, B. F. G.: Morphocons: A New Sonification Concept Based on Morpho-
logical Earcons. J Audio Eng Soc 60, 409–418 (2012).
73. Parseihian, G., Katz, B. F. G.: Rapid head-related transfer function adaptation using a virtual
auditory environment. J Acous Soc America 131, 2948–2957 (2012).
74. Picinali, L., Afonso, A., Denis, M., Katz, B. F.: Exploration of architectural spaces by blind
people using auditory virtual reality for the construction of spatial knowledge. International
Journal of Human-Computer Studies 72, 393–407 (2014).
75. Poirier-Quinot, D., Stitt, P., Katz, B. in Advances in Fundamental and Applied Research on
Spatial Audio (eds Katz, B., Majdak, P.) (InTech, 2022).
76. Poirier-Quinot, D., Katz, B. F.: Assessing the impact of Head-Related Transfer Function indi-
vidualization on task performance: Case of a virtual reality shooter game. J. Audio Eng. Soc
68, 248–260 (2020).
77. Poirier-Quinot, D., Katz, B. F.: On the improvement of accommodation to non-individual
HRTFs via VR active learning and inclusion of a 3D room response. Acta Acustica 5, 1–17
(2021).
78. Poirier-Quinot, D., Katz, B. F.: The Anaglyph binaural audio engine in Audio Engineering
Society Convention 144 (2018).
79. Poirier-Quinot, D., Parseihian, G., Katz, B. F.: Comparative study on the effect of Parameter
Mapping Sonification on perceived instabilities, efficiency, and accuracy in real-time interactive
exploration of noisy data streams. Displays 47, 2–11 (2016).
80. Reichinger, A., Majdak, P., Sablatnig, R., Maierhofer, S.: Evaluation of methods for optical 3-D
scanning of human pinnas in 2013 International Conference on 3D Vision-3DV 2013 (2013),
390–397.
81. Schonstein, D., Katz, B. F.: HRTF selection for binaural synthesis from a database using
morphological parameters in International Congress on Acoustics (ICA) (2010).
142 L. Picinali and B. F. G. Katz
82. Schönstein, D., Katz, B. F.: Variability in perceptual evaluation of HRTFs. Journal of the Audio
Engineering Society 60, 783–793 (2012).
83. Seeber, B. U., Fastl, H.: Subjective selection of non-individual head-related transfer functions
in Proceedings of the 2003 Intl Conf on Auditory Display (ICAD) (2003), 259–262.
84. Shin, K. H., Park, Y.: Enhanced vertical perception through head-related impulse response
customization based on pinna response tuning in the median plane. IEICE Transactions on
Fundamentals of Electronics, Communications and Computer Sciences 91, 345–356 (2008).
85. Shukla, R., Stewart, R., Roginska, A., Sandler, M.: User selection of optimal HRTF sets via
holistic comparative evaluation in the Audio Engineering Society Conference on Audio for
Virtual and Augmented Reality (AVAR) 2018 (Audio Engineering Society, Redmond, WA,
USA, 2018).
86. Silzle, A.: Selection and tuning of HRTFs in Audio Eng Soc Conv 112 (2002), 1–14.
87. Simon, L., Zacharov, N., Katz, B. F. G.: Perceptual attributes for the comparison of Head-
Related Transfer Functions. J Acous Soc America 140, 3623–3632 (Nov. 2016).
88. Søndergaard, P., Majdak, P. in The Technology of Binaural Listening (ed Blauert, J.) 33–56
(Springer, Berlin, Heidelberg, 2013).
89. Spagnol, S., Geronazzo, M., Avanzini, F.: On the relation between pinna reflection patterns
and head-related transfer function features. IEEE transactions on audio, speech, and language
processing 21, 508–519 (2012).
90. Steadman, M. A., Kim, C., Lestang, J.-H., Goodman, D. F., Picinali, L.: Short-term effects of
sound localization training in virtual reality. Scientific Reports 9, 1–17 (2019).
91. Stitt, P., Katz, B. F.: Sensitivity analysis of pinna morphology on head-related transfer functions
simulated via a parametric pinna model. J Acoust Soc Am 149, 2559–2572 (2021).
92. Stitt,P., Picinali, L., Katz, B. F. G.: Auditory Accommodation to poorly MatchedNon-Individual
spectral Localization Cues throughActive Learning. Scientific Reports 9, 1063:1–14 (2019).
93. Teranishi, R., Shaw, E. A.: External-Ear Acoustic Models with Simple Geometry. J Acoust Soc
Am 44, 257–263 (1968).
94. Trapeau, R., Aubrais, V., Schönwiesner, M.: Fast and persistent adaptation to new spectral cues
for sound localization suggests a many-to-one mapping mechanism. J Acous Soc America 140,
879–890 (2016).
95. Van Wanrooij, M. M., Van Opstal, A. J.: Relearning sound localization with a new ear. J of
Neuroscience 25, 5413–5424 (2005).
96. Voong, T. M., Oehler, M.: Tournament Formats as Method for Determining Best-fitting HRTF
Profiles for Individuals wearing Bone Conduction Headphones in Proceedings of the 23rd
International Congress on Acoustics : integrating 4th EAA Euroregio 2019 : 9–13 September
2019 in Aachen, Germany (eds Ochmann, M., Vorländer, M., Fels, J.) (Berlin, Germany, Sept.
9, 2019), 4841–4847.
97. Wan, Y., Zare, A., McMullen, K.: Evaluating the consistency of subjectively selected head-
related transfer functions (HRTFs) over time in Audio Engineering Society Conference: 55th
International Conference: Spatial Audio (2014).
98. Warusfel, O.: IRCAM Listen HRTF database http://recherche.ircam.fr/equipes/salles/listen.
2003.
99. Wenzel, E. M., Arruda, M., Kistler, D. J.,Wightman, F. L.: Localization using nonindividualized
head-related transfer functions. J Acous Soc America 94, 111–123 (1993).
100. Whitton, J. P., Hancock, K. E., Shannon, J. M., Polley, D. B.: Audiomotor perceptual training
enhances speech intelligibility in background noise. Current Biology 27, 3237–3247 (2017).
101. Wightman, F. L., Kistler, D. J.: The dominant role of low-frequency interaural time differences
in sound localization. J Acoust SocAm 91, 1648–1661 (Mar. 1992).
102. Wightman, F. L., Kistler, D. J.: Resolution of front-back ambiguity in spatial hearing by
listener and source movement. J Acoust Soc Am 105, 2841–2853 (1999).
103. Winn, M. B., Wendt, D., Koelewijn, T., Kuchinsky, S. E.: Best practices and advice for using
pupillometry to measure listening effort: An introduction for those who want to get started.
Trends in hearing 22, 1–32 (2018).
4 System-to-User and User-to-System Adaptations in Binaural Audio 143
104. Wright, B. A., Zhang, Y.: A review of learning with normal and altered sound-localization
cues in human adults. Intl J of Audiology 45, 92–98 (2006).
105. Xie, B.: Head-Related Transfer Functions and Virtual Auditory Display 2nd ed. (J. Ross
Publishing, Plantation, FL, USA, 2013).
106. Yairi, S., Iwaya,Y.,Yôiti, S.: Individualization feature of head-related transfer functions based
on subjective evaluation in 14th Intl Conf onAuditory Display (Paris, 2008).
107. Zacharov, N., Lorho, G.: What are the requirements of a listening panel for evaluating spatial
audio quality? in Proc. Int.Workshop on Spatial Audio and Sensory Evaluation Techniques
(2006).
108. Zagala, F., Noisternig, M., Katz, B. F.: Comparison of direct and indirect perceptual head-
related transfer function selection methods. J Acoust Soc Am 147, 3376–3389 (2020).
109. Zahorik, P., Bangayan, P., Sundareswaran, V.,Wang, K., Tam, C.: Perceptual recalibration in
human sound localization: Learning to remediate front-back reversals. J Acous Soc America
120, 343–359 (2006).
110. Ziegelwanger, H., Kreuzer, W., Majdak, P.: Mesh2HRTF: An open-source software package
for the numerical calculation of head-related transfer functions in 22nd International Congress
on Sound and Vibration (2015).
111. Ziegelwanger, H., Majdak, P., Kreuzer,W.: Numerical calculation of listenerspecific head-
related transfer functions and sound localization: Microphone model and mesh discretization.
J Acoust Soc Am 138, 208–222 (2015).
112. Zotkin, D., Hwang, J., Duraiswaini, R., Davis, L. S.: HRTF personalization using anthropo-
metric measurements in 2003 IEEEWorkshop on Applications of Signal Processing to Audio
and Acoustics (IEEE Cat. No. 03TH8684) (2003), 157–160.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 5
Audio Quality Assessment for Virtual
Reality
Fabian Brinkmann and Stefan Weinzierl
Abstract A variety of methods for audio quality evaluation are available ranging
from classic psychoacoustic methods like alternative forced-choice tests to more
recent approaches such as quality taxonomies and plausibility. This chapter intro-
duces methods that are deemed to be relevant for audio evaluation in virtual and
augmented reality. It details in how far these methods can directly be used for testing
in virtual reality or have to be adapted with respect to specific aspects. In addition,
it highlights new areas, for example, quality of experience and presence that arise
from audiovisual interactions and the mediation of virtual reality. After briefly intro-
ducing 3D audio reproduction approaches for virtual reality, the quality that these
approaches can achieve is discussed along with the aspects that influence the quality.
The concluding section elaborates on current challenges and hot topics in the field of
audio quality evaluation and audio reproduction for virtual reality. To bridge the gap
between theory and practice useful resources, software and hardware for 3D audio
production and research are pointed out.
5.1 Introduction
Over the past years, an increasing number of virtual and augmented reality (VR/AR)
applications emerged due to the advent of mobile devices such as smartphones and
head-mounted displays. Audio plays an important role within these applications that
is by far not restricted to conveying semantic information, for example, through
dialogues or warning sounds. Beyond that, audio holds information about the spa-
ciousness of a scene including the location of sound sources and the reverberance
or size of a virtual environment. In this way, audio can be regarded as a channel
F. Brinkmann (B
)·S. Weinzierl
Audio Communication Group, Technical University of Berlin, Einsteinufer 17, 10587 Berlin,
Germany
e-mail: fabian.brinkmann@tu-berlin.de
S. Weinzierl
e-mail: stefan.weinzierl@tu-berlin.de
© The Author(s) 2023
M. Geronazzo and S. Serafin (eds.), Sonic Interactions in Virtual Environments,
Human—Computer Interaction Series, https://doi.org/10.1007/978-3-031- 04021-4_5
145
146 F. Brinkmann and S. Weinzierl
to provide semantic information and spatial information and improve the sense of
presence and immersion at the same time. Due to the key role of audio in VR/AR,
this chapter gives an overview of methods for audio quality assessment in Sect. 5.2,
followed by a brief introduction of audio reproduction techniques for VR/AR in
Sect. 5.3. Readers who are familiar with audio reproduction techniques might skip
Sect. 5.3 and directly continue with Sect. 5.4 that gives an overview of the quality of
existing audio reproduction systems.
5.2 Perceptual Qualities and Their Measurement
Methods and systems for generating virtual and augmented environments can be
understood as a special case of (interactive) audio reproduction systems. Thus, in
principle, all procedures for the perceptual evaluation of audio systems can also
be used for the evaluation of VR systems [6]. These include the procedures for
the evaluation of “Basic Audio Quality”, which are standardized in various ITU
recommendations and focus on the technical system properties and signal processing,
as well as approaches with a wider focus on the listening situation and the presented
audio content, taking into account the “Overall Listening Experience”. In addition,
a number of measures have recently been proposed to more specifically determine
the extent to which technologies for virtual and augmented environments live up to
their claim of providing a convincing equivalent to physical acoustic reality. Finally,
in addition to these holistic measures for evaluating VR and AR, there are a number
of (VR-specific and VR-nonspecific) quality inventories that can be used to perform
a differential diagnosis of VR systems, highlighting the individual strengths and
weaknesses of the system and drawing conclusions for the targeted improvement.
5.2.1 Generic Measures
5.2.1.1 Basic Audio Quality
Since the mid-1990s, the Radiocommunication Sector of the International Telecom-
munication Union (ITU-R) has developed a series of recommendations for the “Sub-
jective assessment of sound quality”. The series includes an overview of the areas
of application of the recommendations with instructions for the selection of the
appropriate standard [35] as well as an overview of “general methods” which are
applied slightly differently in the different standards [36]. They contain instructions
for experimental design, selection of the listening panel, test paradigms and scales,
reproduction devices, and listening conditions up to the statistical treatment of col-
lected data. Originally, these recommendations were mainly used for the perceptual
evaluation of audio codecs, but later, they were also used for the evaluation of multi-
channel reproduction systems and 3D audio techniques. The central construct to be
5 Audio Quality Assessment for Virtual Reality 147
Fig. 5.1 User interfaces for ABC/HR and MUSHRA tests. Active conditions are indicated by
orange buttons; loop range and current playback position by orange boxes and lines. The ABC/HR
interface shows only one condition but versions with multiple conditions per rating screen are also
possible. If multiple conditions are displayed on a single screen, an additional button to sort the
conditions according to the current ratings might help subjects to establish more reliable ratings
(CC-BY, Fabian Brinkmann)
evaluated by all ITU procedures is “Basic Audio Quality” (BAQ). It can be eval-
uated either by direct scaling or by rating the “impairment” relative to an explicit
or implicit reference and caused by deficits of the transmission system such as a
low-bitrate audio codec or by limitations of the spatial reproduction. By definition
BAQ includes “all aspects of the sound quality being assessed”, such as “timbre,
transparency, stereophonic imaging, spatial presentation, reverberance, echoes, har-
monic distortions, quantisation noise, pops, clicks and background noise” [36,p.7],
In studies of impairment, listeners are asked “to judge any and all detected differ-
ences between the reference and the object” [34, p. 7]. In this case, the evaluation of
BAQ thus corresponds to a rating of general “similarity” or “difference”.
The most popular standards for BAQ are (cf. Fig. 5.1)
•ITU-R BS. 1116-3:2016 (Methods for the subjective assessment of small impair-
ments in audio systems) [34]. Listeners are asked to rate the difference between
an audio stimulus and a given reference stimulus using a continuous scale with
five labels (“Imperceptible”/“Perceptible, but not annoying”/“Slightly annoying”/
“Annoying”/“Very annoying”) used as “anchors”. Participants are presented with
three stimuli (A, B, C). A is the reference, and B and C are rated, with one of the
two stimuli again being the hidden reference (double-blind triple-stimulus with
hidden reference).
•ITU-R BS.1534 (Method for the subjective assessment of intermediate quality
level of audio systems) [37]. Unlike ITU-R BS. 1116-3, it is a multi-stimulus test
where direct comparisons between the different stimuli are possible. Quality is
rated on a continuous scale with five labels (“Excellent”/“Good”/“Fair”/“Poor”/
“Bad”). Participants are presented with a reference, no more than nine stimuli
148 F. Brinkmann and S. Weinzierl
under test, and two anchor signals (MUlti-Stimulus test with Hidden Reference
and Anchor, MUSHRA). The standard anchors are a low-pass filtered version of
the original signal with a cut-off frequency of 3.5 kHz (low-quality anchor) and
7 kHz (mid-quality anchor). Alternatively or additionally, further non-standard
anchors can be used; they should resemble the character of the systems’ artifacts
being tested and indicate how the systems under test compares to well-known
audio quality levels. Possible anchors in the context of spatial audio might be
conventional mono/stereo recordings or non-individual signals. Since listeners
can directly compare the signals under test with the reference and among each
other, more reliable ratings can be expected in situations where stimuli differ
significantly from the reference, but only slightly from each other.
Although BAQ is the standard attribute to be tested in both ITU recommendations,
other attributes are suggested to test more specific aspects of audio systems such as
spatial and timbral qualities. ITU-R BS.1284-2 contains a list of main attributes and
sub-attributes, from which one can choose those suitable for a particular test [36,
Attachment 1]. In this respect, both recommendations are often used only as an
experimental paradigm, but applied to qualities other than BAQ, e.g., those developed
in various taxonomies on the properties of VR systems (see Sect. 5.2.2.4).
A number of issues were raised addressing specific aspects of the ITU recom-
mendations [55]. One pertains to the scale labels being multidimensional, which
could distort the ratings. This can be avoided by using clearly unidimensional labels
at both ends, e.g., “imperceptible”/“very perceptible” for ABC/HR or “good”/“bad”
for MUSHRA and additional unlabeled lines for orientation. Another issue points out
that data from MUSHRA tests often violate assumptions for conducting an Analysis
of Variance (ANOVA), the most common means for statistical analysis of the results.
This can be considered by using general linear models for the analysis, that are more
flexible than ANOVA and pose less requirements on the input data [33].
5.2.1.2 Overall Listening Experience
The construct of “Overall Listening Experience” (OLE) [70] was derived from the
concept of “Quality of Experience”, which in the context of quality management
describes “the degree of delight or annoyance of the user of an application or ser-
vice” [11], considering not only the technical performance of a system but also the
expectations and personality and current state of the user as influencing factors. In
contrast to listening tests according to the ITU recommendations, the musical content
is thus explicitly part of the judgment that listeners make about the OLE.
A measurement of the OLE can be a useful alternative or supplement to purely
system-related evaluations insofar as, for example, the difference between different
playback systems for music may very well be audible in a direct comparison, but
hardly relevant for everyday music consumption, also in comparison to the liking
of the music played. In this respect, an evaluation according to ITU may possibly
convey a false picture of the general relevance of technical functions. This becomes
5 Audio Quality Assessment for Virtual Reality 149
Fig. 5.2 Results of a listening test (z-standardized scores) of Basic Audio Quality (BAQ) and Over-
all Listening Experience (OLE) for three different spatial audio systems (2.0 stereo, 5.0 surround,
22.2 sound referred to as “3D Audio”). BAQ ratings were given according to ITU-R BS.1534 rela-
tive to the “3D audio” condition as an explicit reference, whereas OLE ratings were given without
a reference stimulus [71,p.84]
evident, for example, in a direct comparison between BAQ and OLE ratings of spatial
audio systems, where the differences between BAQ ratings are generally larger than
between OLE ratings. In a listening test, both BAQ ratings according to ITU-R BS.
1534 with explicit reference and OLE ratings (“Please rate for each audio excerpt
how much you enjoyed listening to it”) without explicit reference were collected for
three different spatial audio systems (2.0 stereo, 5.0 surround, 22.2 surround [71]).
While the difference between 2.0 and 5.0 was equally visible in BAQ and OLE, the
difference between 5.0 and 22.2 was clearly audible in a direct comparison (BAQ),
but did obviously not result in a significant increase in listening pleasure (OLE,
Fig. 5.2).
5.2.2 VR/AR-Specifc Measures
5.2.2.1 Authenticity
A simulation that is indistinguishable from the physical sound field it is intended to
simulate could be termed authentic. The term could be used in a physical sense; then
it would aim at the identity of sound fields, be it the identity of sound pressures in the
ear canal (binaural technology) or the identity of sound fields in an extended spatial
area (sound field synthesis). Since no technical system is currently able to guarantee
such an identity, and since such a physical identity may also not be required for the
users of VR/AR systems, the term authenticity is mostly used in the psychological
sense. In this sense, it denotes a simulation that is perceptually indistinguishable
from the corresponding real sound field [8].
150 F. Brinkmann and S. Weinzierl
The challenge in determining perceptual authenticity is not to let the presence
of a simulation or the physical reference in the listening test become recognizable
solely through the environment of the presentation, i.e., by wearing headphones
as opposed to listening freely in the physical sound field, or by listening in a studio
environment that does not correspond to the simulated space even purely visually. For
this reason, a determination of the authenticity of loudspeaker-based systems such
as Wave Field Synthesis (WFS) or Higher-Order Ambisonics (HOA) can hardly be
carried out in practice, because even if one were to suppress the visual impression
by means of a blindfold, the listener would have to be brought from the playback
room of the synthesis into the real reference room, which would no longer allow a
direct comparison due to the temporal delay. Setting up a sound field synthesis in the
corresponding physical room, on the other hand, would be prohibited, since the room
acoustics of the physical room would influence the sound field of the loudspeaker
synthesis.
A determination of authenticity is simpler for binaural technology systems. By
using open headphones that are largely transparent to the external sound field and
whose influence can possibly be compensated by an equalization filter, a direct com-
parison can be made by switching back and forth between a physical sound source
and its binaural simulation [8]. The influence of the headphones on the external
sound field can be further minimized by using extra-aural headphones suspended a
few centimeters in front of the ear [18]. Such an influence can also come from other
VR devices such as head-mounted displays that are close to the ear canal [27]. An
example of a listening test setup is shown in Fig. 5.3.
As a paradigm for the listening test, classical procedures such as ABX with explicit
reference [12,44] or forced-choice procedures (N-AFC) with non-explicit reference
[21] can be used, which have proven suitable for detecting small differences between
two stimuli. It should be noted that, especially in the case of minor differences, the
presentation mode can have a great influence on the recognition rate, such as the
fact whether the two stimuli (simulation and reference) can be heard by the test
Fig. 5.3 Listening test setup
for testing authenticity and
plausibility. For seamless
switching between audio
from the loudspeakers and
their binaural simulation, the
subject is wearing extra-aural
headphones that minimize
distortions of exterior sound
fields. The head position of
the subject is tracked by an
electromagnetic sensor pair
mounted on the top of the
chair and headphones. See
also Sect. 5.4.1.1 (CC-BY,
Fabian Brinkmann)
5 Audio Quality Assessment for Virtual Reality 151
Fig. 5.4 User interfaces for
testing authenticity with an
ABX test (also termed
2-interval/2-alternative
forced choice, 2i/2AFC) test
and testing plausibility with
a yes/no paradigm.
Responses/active conditions
are indicated by orange
buttons; loop range and
current playback position by
orange boxes and lines. In
case of the test for
plausibility, the audio starts
automatically and can only
be heard once (CC-BY,
Fabian Brinkmann)
participants only once or as often as desired [8, p. 1793 f]. An example of a user
interface is given in Fig. 5.4.
Binaural representations can also be used to make comparisons of physical sound
fields and simulations based on loudspeaker arrays [85]. For this purpose, the mea-
sured or numerically simulated sound field of a loudspeaker array at a given listening
position can be presented in the listening test as a binaural synthesis, thus avoiding
the problems described above when comparing physical and loudspeaker-simulated
sound fields. It should be noted, however, in this case, the simulation (binaural syn-
thesis) of a simulation (sound field synthesis) becomes audible, so it may be difficult
to separate the artifacts of the two methods.
5.2.2.2 Plausibility
While the authenticity of virtual environments can be determined by the (physical
or perceptual) identity of physical and simulated sound fields, plausibility has been
proposed as a measure of the extent to which a simulation is “in agreement with
the listener’s expectation towards a corresponding real event” [47]. Plausibility thus
does not address the comparison with an external, presented reference, but the con-
sideration against the background of an inner reference that reflects the credibility
of the simulation, based on the listener’s experience and expectations of the internal
structure of acoustic scenes or environments. The operationalization of this construct
thus does not require a comparative evaluation, but a yes–no decision.
By analyzing such yes–no decisions with the statistical framework of signal detec-
tion theory (SDT, [84]), one can separate the response bias, i.e., a general, subjective
tendency to consider stimuli as “real” or “simulated”, from the actual impairments of
the simulation. Signal detection theory is originally a method for determining thresh-
152 F. Brinkmann and S. Weinzierl
old values. For example, the absolute hearing threshold of sounds can be determined
by the statistical analysis of a 2x2 contingency table in which two correct answers
(sound present and heard, sound absent and not heard, i.e., hits and correct rejections)
and two incorrect answers (sound present and not heard, sound absent and heard,
i.e., misses and false alarms) occur. By contrasting these response frequencies, the
response bias, i.e., a general tendency to mark sounds as “heard,” can be separated
from actual recognition performance. The latter is represented by the sensitivity d
which can be converted to a corresponding 2AFC detection rate. A number of at
least 100 yes–no decisions per subject is considered necessary for obtaining stable
individual SDT parameters [40].
This approach can be applied to the evaluation of virtual realities, in that the
artifacts caused by deficits in the simulation take on the role of a stimulus to be
discovered, and listeners are asked to identify the environment as “simulated” if they
notice them. The prerequisite for such an experiment is, however, that—similar to an
experiment on “authenticity”—one can present both physically “real” and simulated
sound fields without the nature of the stimulus already being recognizable on the basis
of the experimental environment, for example, by providing a visual representation of
the physical sound source also in the simulated case, or by conducting the experiment
with closed or blindfolded eyes.
5.2.2.3 Sense of Presence and Immersion
A central function of VR systems is to create a “sense of presence”, i.e., the feeling of
being or acting in a place, even when one is physically situated in another location and
the sensory input is known to be technically mediated. The concept of presence, also
called “telepresence” in older literature in reference to teleoperation systems used to
manipulate remote physical objects [58], has given rise to its own research direction
and community in the form of presence research, which is organized in societies such
as the International Society for Presence Research (ISPR) and conferences such as
the biennial PRESENCE conference.1
To measure the degree of presence, different questionnaires have been developed.
For an overview see [72]. The instrument of Whitmer and Singer [87], one of the most
widely used questionnaires, contains 28 questions such as “How much were you able
to control events?”, “How responsive was the environment to actions that you initiated
(or performed)?”, “How natural did your interactions with the environment seem?”,
or “How completely were all of your senses engaged?”. Analyzing the response pat-
terns in these questionnaires, different dimensions such as “Involvement”, “Sensory
fidelity”, “Adaptation/immersion”, and “Interface quality” have emerged in factor
analytic studies [86].
Other approaches to measuring presence include behavioral measurements. If one
assumes that presence is given if the reactions to a virtual environment correspond
to the behavior in physical environments, then for example, the swaying caused by
1https://ispr.info (last access 2022/06/17).
5 Audio Quality Assessment for Virtual Reality 153
a moving visual scene or ducking in response to a flying object can be used as an
indicator for the degree of presence [19]. As a prerequisite for such realistic behavior,
Slater considers two aspects: The sensation of being in a real place (“place illusion”)
and the illusion that the scenario being depicted is actually occurring (“plausibility
illusion”) [75]. Note, however, that “plausibility” is used here, in comparison with
the understanding used in Sect. 5.2.2.2, in a narrower sense with a slightly different
meaning.
A similar idea is behind the use of psychophysiological measures. If the normal
physiological response of a person to a particular situation is replicated in a VR
environment, this can be considered as an indicator of presence. Although physio-
logical parameters have been used to measure various functions and applications of
VR systems [28], they have also been used to measure presence in several studies.
Depending on the scenario presented, the Electroencephalogram (EEG) [5], heart
rate (HR) [14], or skin conductance and heart rate variability [13] were shown to be
indicators of different degrees of presence. The exact correlations, however, seem
to depend very much on the scenario presented in each case, and in any case, com-
parative values from a corresponding real-life stimulus are required to calibrate the
measurement. Also breaks in presence (BIPs), i.e., moments where the users become
aware of the mediatedness of the VR experience due to shortcomings of the system
becoming suddenly obvious seem to be associated with physiological responses [76].
In general, these approaches seem to be limited to situations in which physiological
reactions are sufficiently pronounced, such as anger, fear, or stress [54], whereas
reactions are less pronounced when the person is predominantly an observer of a
scene that has little emotional impact. This may be the reason why manipulations
to the level of presence in these studies were almost exclusively realized through
changes to the visual display and user interaction, while physiological parameters
were hardly used to evaluate the degree of presence in acoustic virtual environments.
The sense of presence, long used as a measure for evaluating VR and AR sys-
tems alone, has recently gained increasing attention as a general neuropsychological
phenomenon evolving from biological as well as cultural factors [68]. From the
perspective of evolutionary psychology, the sense of presence has evolved not to
distinguish between real and virtual conditions, but to distinguish the external world
from phenomena attributable to one’s own body and mind. On such a theoretical
basis, it seems consequent that for achieving a high presence not only the sensory
plausibility and the naturalness of the interaction but also the meaning and relevance
of the scene for the respective user is essential. The degree of presence in a virtual
scene will remain limited if the content is irrelevant to the respective user [66].
Related to the sense of presence, but less consistently used, is the concept of
“immersion”. In some literature, it is treated as an objective property of VR and
AR systems [77]. According to this technical understanding, a 5-channel system is
considered more “immersive” than a two-channel system, simply because it is able
to present a wider range of sound incidence directions to the listener. In other works,
however, immersion is treated as psychological construct, i.e., a human response to
a technical system [87], shifting the meaning of “immersion” closer to the concept
of presence [74]. Finally, in many works, especially in the field of audio, it remains
154 F. Brinkmann and S. Weinzierl
unclear whether the reasoning about immersion is on a technical or psychological
level. Chapter 11 discusses more in depth the aforementioned issue focusing on
audiovisual experiences.
5.2.2.4 Attributes and Taxonomies
With properties such as authenticity, plausibility, or the sense of presence, a global
assessment of VR systems is intended. In order to obtain indications of the strengths
and weaknesses of these systems and to draw appropriate conclusions for improve-
ment, however, a differential diagnosis is required that separately assesses different
qualities of the respective systems. To distinguish these perceptual qualities from
technical parameters of the system that may have an influence on them, the former
is also referred to as “Quality features” and the latter as “Quality elements” in the
Context of Product-Sound Quality [38].
For this purpose, different taxonomies for the qualities of virtual acoustic envi-
ronments, 3D audio or spatial audio systems have been developed. Some of these
are based on earlier collections of attributes for sound quality and spatial audio qual-
ity [42] which were clustered in sound families using semantic analyses such as
free categorization or multidimensional scaling (MDS) [43]. Pedersen and Zacharov
(2015) [62] developed a sound wheel to present such a lexicon for reproduced sound.2
The wheel format has a longer tradition in the domain of food quality and sensory
evaluation [60] as a structured and hierarchical form of a lexicon of different sensory
characteristics. The selection of the items and the structure of the wheel in [62]are
based on empirical methods such as hierarchical cluster analysis and measures for
discrimination, reliability, and inter-rater agreement of the individual items.
While the taxonomies mentioned above were developed for spatial audio sys-
tems and product categories such as headphones, loudspeakers, multi-channel sound
in general, others were generated with a stronger focus on virtual acoustic envi-
ronments. Developed by qualitative methods such as expert surveys (DELPHI
method [73]) and expert focus groups [48], they contain between 7 [73] and 48
attributes [48], from which those relevant to the specific experiment can be selected.
Examples of a VR/AR specific taxonomy and a rating interface are shown in
Figs. 5.5 and 5.6.
5.2.3 VR/AR-Specific User Interfaces, Test Procedures,
and Toolkits
While the quality measures introduced so far can theoretically be directly transferred
for testing in VR and AR, there are specific features that should be addressed: The
2Currently maintained under https://forcetechnology.com/en/articles/gated-content-senselab-
sound-wheel (last access 2022/06/17).
5 Audio Quality Assessment for Virtual Reality 155
Fig. 5.5 SAQI wheel for the evaluation of virtual acoustic environments, structured into informal
categories (inner ring) and attributes (outer ring). For definitions and sound examples refer to
depositonce.tu-berlin.de/handle/11303/157.2 (CC-BY, Fabian Brinkmann)
Fig. 5.6 User interface for
conducting a SAQI test. The
interface is similar to that of
a MUSHRA test shown in
Fig. 5.1 with the difference
that the current quality to be
rated is given together with
the possibility to show its
definition (info button) and
that the rating scale can also
be bipolar. In any case, zero
ratings indicated no
perceivable difference
(CC-BY, Fabian Brinkmann)
156 F. Brinkmann and S. Weinzierl
test method and interface, the technical administration of the test, and the effect of
added degrees of freedom on the subjects.
First, most of the test methods and user interfaces were developed to be accessed
on a computer with a mouse as a pointing and clicking device. The rating procedure
and the elements on the user interface might thus not be optimal for testing in VR/AR.
This might be less relevant for simple paradigms such as ABX or yes/no tests but
can certainly become an issue for rating the quality of multiple test conditions.
Two approaches were suggested to account for this. Völker et al. [81] suggested a
modified MUSHRA to simplify the rating interface and make it easier to establish an
order between test conditions, especially if many test conditions are to be compared
against the reference and each other (cf. Fig. 5.7). The idea is to unify playback
and rating by making use of drag and drop actions, where the playback is triggered
when the subject drags a button corresponding to a test condition, and the rating is
achieved by dropping the button on a two-dimensional scale. Ratings obtained with
the modified interface were comparable to those obtained with the classic interface
in terms of test–retest reliability and discrimination ability. At the same time, the
modified interface was preferred by the subjects, and subjects needed less time to
complete the rating task. Note that the Drag and Drop MUSHRA could be easily
adapted for testing quality taxonomies introduced in Sect. 5.2.2.4.
A VR/AR-tailored approach to further simplify the rating procedure and interface
was suggested by Rummukainen et al. [67]. They designed a simple and easy-to-
operate interface, where the subject eliminates the conditions one after another in
the order from worst to best (cf. Fig. 5.8). The elimination constitutes a rank order
Fig. 5.7 Interface of the
Drag and Drop MUSHRA
after [81]. The currently
playing condition is
indicated by the orange
button; the loop range and
playback position by the
orange box and line (CC-BY,
Fabian Brinkmann)
Fig. 5.8 Interface of the
elimination task after [67].
The currently playing
condition is indicated by the
orange button (CC-BY,
Fabian Brinkmann)
5 Audio Quality Assessment for Virtual Reality 157
between the stimuli from which interval scaled values—similar to Basic Audio Qual-
ity ratings—were obtained by fitting Plackett–Luce models to the ranking vectors.
As with the Drag and Drop MUSHRA, the elimination task could be adapted for
testing against a reference and using taxonomies.
Classic tests of Basic Audio Quality are most often conducted for (static) audio-
only conditions and a variety of software solutions is available to conduct such
tests [6, Sect. 9.2.3]. In contrast, tests in VR/AR require the experimental control of
complex audiovisual scenes. In addition, the display of rating interfaces might affect
the Quality of Experience (QoE) of interactive environments due to their potentially
negative effect on the perceived presence [65]. An emerging tool to account for these
aspects of AR/VR is the Quality of Experience Evaluation Tool (Q.ExE) currently
developed by Raake et al. [65].
A third VR/AR-specific aspect is the possibility of freely exploring an audiovisual
scene in six degrees of freedom (6DoF). Introducing 6DoF clearly affects the rating
behavior of subjects [67] and might thus be considered problematic at first glance. An
unrestricted 6DoF exploration is, however, the most realistic test condition. While
this might introduce additional variance in the results, it might also be argued that
results are more comprehensive and reflect more aspects of the audiovisual scene
due to free exploration. Whether or not the exploration should be restricted will thus
ultimately depend on the aim of an investigation.
5.3 Audio Reproduction Techniques
Two fundamentally different paradigms can be distinguished in audio reproduction
for VR/AR that can be illustrated with the help of Fig. 5.9. The picture shows a
simple sound field of a point source being reflected by an infinite wall.
The first paradigm is to reproduce the entire sound field in a controlled zone,
which has two advantages. First, multiple listeners can freely explore the sound field
at the same time, and second, the reproduction is already individual as every lis-
tener naturally perceives the sound through their own ears. However, there are three
disadvantages. First, reproducing the entire sound field requires tens or hundreds of
loudspeakers depending on the reproduction algorithm and the size of the listening
area. Second, it requires an acoustically treated environment to avoid detrimental
effects due to reflections from the reproduction room itself. Third, it is often chal-
lenging to achieve a correct reproduction covering the entire hearing range from
approximately 20 Hz to 20 kHz. In the following, this reproduction paradigm will
be referred to as sound field synthesis (SFS).
The second paradigm is to only reproduce the sound field at the listeners’ ears.
The three advantages of this approach are that it can be realized with a single pair
of headphones or loudspeakers, that at least headphone-based reproduction does not
pose any demands on the reproduction room, and that a broad frequency range can
be correctly reproduced. In turn, two disadvantages arise. First, the position and
head orientation of the listeners must be tracked to enable a free exploration of the
158 F. Brinkmann and S. Weinzierl
Fig. 5.9 Sound field of a
point source reflected by an
infinite wall. The direct and
reflected sound fields are
shown as red and blue circles
and the direct and reflected
sound paths to the listener as
red and blue dashed lines.
The image of the head in
gray denotes the listening
position. (CC-BY, Fabian
Brinkmann)
sound field. Second, the individualization of the ear signals is challenging. Often,
the reproduced signals stem from a dummy head, which can cause artifacts such as
coloration and increased localization errors in case the ears, head, and torso of the
listener differ from the dummy head. This reproduction paradigm will be referred to
as binaural synthesis in the following.
It is interesting to see that the advantages and disadvantages of the two paradigms
are exactly contrary thus generating a strong bond between the application and repro-
duction paradigm, whereas binaural synthesis is the apparent option for any applica-
tion on mobile devices, sound field synthesis is appealing for public or open spaces
such as artistic performances and public address systems. The next sections will
introduce the two paradigms in more detail. We focus on technical aspects but start
with brief theoretical introductions to foster a better understanding of the subject as
a whole.
5.3.1 Sound Field Analysis and Synthesis
The idea behind sound field analysis and synthesis (SFA/SFS) is to reproduce a
desired sound field within a defined listening area using a loudspeaker array. The
example in Fig. 5.10 shows this for the simple case of a plane wave traveling in the
normal direction of a linear array.
Two fundamentally different SFA/SFS approaches can be distinguished. Physi-
cally motivated algorithms aim at capturing and reproducing sound fields physically
correct, while perceptually motivated methods aim at capturing and synthesizing
sound field properties that are deemed to be of high perceptual relevance.
5.3.1.1 Sound Field Acquisition and Analysis
Sound field synthesis requires a sound field that should be reproduced and there
are two options for its acquisition: through measurement or simulation. Measured
sound fields can have a high degree of realism and can, for example, be used for
broadcasting concerts, while simulated sound fields offer more flexibility in the
5 Audio Quality Assessment for Virtual Reality 159
Fig. 5.10 Sound field
synthesis of a plane wave
traveling from bottom to top
(red fat lines) by a linear
point source array (blue
points and blue thin
semi-circles) flush-mounted
into a sound hard wall (gray
line) (CC-BY, Fabian
Brinkmann)
design of the auditory scene and are thus often used in game audio engines (please
refer to Chap. 3for an introduction to interactive auralization). The description and
evaluation of sound field simulation techniques is beyond the scope of the article and
the interested reader is kindly referred to related review articles [10,79].
Sound fields are usually measured through microphone arrays, i.e., spatially dis-
tributed microphones that are in most cases positioned on the surface of a rigid or
imaginary sphere. They can be used to directly record sound scenes such as concerts.
In some cases, however, a direct recording will be limiting as it does not allow to
change the audio content once the recording is finished. This can be realized if so-
called spatial room impulse responses (SRIRs) are measured, i.e., impulse responses
that describe the sound propagation between sound sources and each microphone of
the array.
A common method for physically motivated SFA is the plane wave decomposition
(PWD), which applies Fourier Transforms with respect to time and space to the
acquired sound field [64, Chap. 2]. It derives a spatially continuous description of the
analyzed sound field containing information on the times and directions of arriving
plane wave. If the analyzing array has sufficiently many microphones, PWD can
yield a physically correct and complete description of the sound field.
Popular approaches for perceptually motivates SFA are spatial impulse response
rendering (SIRR), directional audio coding (DirAC), and the spatial decomposition
method (SDM) [64,78, Chaps. 4–6]. These approaches use a time–frequency analysis
to extract the direction of arrival and in case of SIRR and DirAC also the residual
diffuseness for each time–frequency slot. The intention of this is to extract these
information from signals recorded with only a few microphones—typically between
4 and 16—and reproduce the signals with an increased resolution using methods
introduced in the following sections. SIRR and SDM only work with SRIRs, while
PWD and DirAC also work with direct recordings. While SDM uses a broadband
frequency analysis and extremely short time windows, the remaining methods use
perceptually motivated time and frequency resolutions. SDM is able to extract a single
prominent reflection per time window while the PWD and higher order realizations
of SIRR and DirAC can detect multiple reflections in each time–frequency slot.
160 F. Brinkmann and S. Weinzierl
5.3.1.2 Physically Motivated Sound Field Reproduction
The two methods for physically motivated sound field reproduction are wave field
synthesis (WFS, works with linear, planar, rectangular, and cubic loudspeaker arrays)
and near-field compensated higher order Ambisonics (NFC-HOA, works with cir-
cular and spherical arrays) [1]. Both methods can reproduce plane waves and point
sources by filtering and delaying the sounds for each loudspeaker in the array. In the
simple case shown in Fig. 5.10, all loudspeakers play identical signals. Because of
their high computational demand, WFS and NFC-HOA are rarely used with mea-
sured sound fields that consist of hundreds of sources/waves. One possible approach
is to use only a few point sources for the direct sound and early reflections, and a
small number of plane waves for the reverberation.
5.3.1.3 Perceptually Motivated Sound Field Reproduction
The most common methods for perceptually motivated sound field reproduction
are vector-based amplitude panning (VBAP), multiple direction amplitude panning
(MDAP), and Ambisonics panning, which aim at reproducing point-like sources [89,
Chaps. 1, 3, and 4]. VBAP is extensions of stereo panning to arbitrary loudspeaker
array geometries. It uses one to three speakers that are closest to the position of
the virtual source to create a phantom source. MDAP creates a discrete ring of
phantom sources—each realized using VBAP—around the position of the virtual
source to achieve that the perceived source width becomes almost independent from
the position of the virtual source. Ambisonics panning could be thought of as a
beamformer that uses all loudspeakers of the array simultaneously to excite circular
or spherical sound field modes. In this case, the position of the virtual source is given
by the position of the beam. Similar to MDAP, Ambisonics yields virtual sources with
an almost position-independent perceived width. In all cases, the degree to which the
width of the sources can be controlled increases with the number of loudspeakers.
In many applications, these methods are used as a means to reproduce sound
fields that were analyzed using SIRR, SDM, and DirAC. Two reasons for this are
their computational efficiency and the fact that they are relatively robust against irreg-
ular loudspeaker arrays (non-spherical, missing speakers), which are advantages over
physically motivated approaches. VBAP and MDAP are robust to irregular arrays by
design (they do not pose any demands on the array geometry). This is not generally
true for Ambisonics panning, however, the state-of-the-art All-Round Ambisoncs
Decoder (AllRAD, [89, Sect. 4.9.6]), which combines VBAP and Ambisonics pan-
ning, can well handle irregular arrays.
5 Audio Quality Assessment for Virtual Reality 161
5.3.2 Binaural Synthesis
The fundamental theorem of binaural technology is that recording and reproducing
the sound pressure signals at a listener’s ears will evoke the same auditory perception
as if the listener was exposed to the actual sound field. This is because all acoustic
cues that the human auditory system exploits for spatial hearing are contained in the
ear signals. These cues are interaural time and level differences (ITD, ILD), spectral
cues (SC), and environmental cues. ITD and ILD stem from the spatial separation
of the ears and the acoustic shadow of the head and make it possible to perceive the
position of a source in the lateral dimension (left/right). Spectral cues originate from
direction-dependent filtering of the outer ear and enable us to perceive the source
position in the polar dimension (up/down). The most prominent environmental cue
might be reverberation from which information about the source distance and the
size of a room can be extracted. For more information please refer to Blauert [7]
and to Chap. 4of this volume.
An example of a binaural processing pipeline with headphone reproduction is
shown in Fig. 5.11. The processed binaural signals are stored or directly streamed
to the listener whereby the signals are selected and/or processed according to the
current position and head orientation of the listener. In any case, a physically cor-
rect simulation requires compensating the recording and reproduction equipment
(loudspeakers, microphones, headphones) to assure an unaltered reproduction of the
binaural signals. These compensation filters are usually separated for signal acquisi-
tion and reproduction to maximize the flexibility of the pipeline. For the same reason,
anechoic or dry audio content is often convoluted with acquired binaural impulse
responses, which makes it possible to change the audio content, without changing
the stored binaural signals. The next sections detail the blocks of the introduced
reproduction pipeline one by one.
Fig. 5.11 Example of a headphone-based pipeline for binaural synthesis. Dashed lines indicate
acoustic signals; black lines indicate digital signals; gray lines indicate movements in 6DoF. Hc
denote compensation filters for the recording (yellow) and reproduction equipment (red, CC-BY,
Fabian Brinkmann)
162 F. Brinkmann and S. Weinzierl
5.3.2.1 Signal Acquisition and Processing
The most basic technique is to directly record sound events—for example a concert—
with a dummy head, i.e., a replica of a human head (and torso) that is equipped with
microphones at the positions of the ear channel entrance or inside artificial ear chan-
nels. This requires a straightforward compensation of the recording microphones by
means of an inverse filter, whereas the sources are considered to be a part of the scene
and thus remain uncompensated. This approach is, however, very inflexible because
the position and orientation of the listener and sources can not be changed during
reproduction. It is thus more common to measure or simulate spherical sets of head-
related impulse responses (HRIRs) that describe the sound propagation between a
free-field sound source and the listeners ears (cf. [88, Chaps. 2 and 4] and Fig. 5.12).
In this case, the sound source has to be compensated as well. The gain in flexibility
stems from the possibility to use anechoic or dry audio content and select the HRIR
according to the current source and head position of the listener. While HRIRs are
not often directly used because anechoic listening conditions are unrealistic for most
applications, they are essential for room acoustic simulations [80]. Acoustic simula-
tions can be used to obtain binaural room impulse responses (BRIRs) that describe
the sound propagation between a sound source in a reverberant environment and the
listeners ears. BRIRs can also be measured, thereby increasing the degree of real-
ism at the cost of increasing the effort to measure BRIRs for multiple positions and
orientations of the listener to enable listener movements during playback.
Fig. 5.12 HRIR measurement system at the Technical University of Berlin with details of the
position procedure using cross line lasers. During the measurement, the subjects are wearing in-ear
microphones, are sitting on the chair in the center of the loudspeaker array, and are continuously
rotated to measure a full spherical HRIR data set. In addition, the wire frames on the floor are
covered with absorbing material (CC-BY, Fabian Brinkmann)
5 Audio Quality Assessment for Virtual Reality 163
5.3.2.2 Head Tracking
Tracking the head position of the listener is required for dynamic binaural reproduc-
tion, i.e., a reproduction that accounts for movements of the listener by providing
binaural signals according to the angle and distance between the source and the lis-
tener’s head. While it will be sufficient for some applications to only track the head
orientation, the general VR/AR case requires six degrees of freedom (6DoF, i.e.,
translation and rotation in x, y, and z).
In general, two tracking approaches exist. Relative tracking systems track the
position of the listener with respect to a potentially unknown starting point, while
absolute tracking systems establish a world coordinate system within which the
absolute position of the listener is tracked. Relative systems usually use inertial
measurement units (IMU) to derive the listener position from combined sensing of
a gyroscope, an accelerometer, and possibly a magnetometer. Absolute systems can
use optical tracking by deriving the listener position from images of a single or
multiple (infrared) cameras, or GPS data.
Artifact-free rendering requires a tracking precision of 1◦and 1 cm [32,46], and
a total system latency of about 50 ms [45]. Note that a significantly lower latency
of about 15 ms is required for rendering visual stimuli in AR applications [39]. A
challenge for relative tracking systems is to control long-term drift of the IMU unit,
while visual occlusion is problematic for optical absolute tracking systems.
5.3.2.3 Reproduction with Headphones
Headphone reproduction requires a compensation of the headphone transfer function
(HpTF) by means of an inverse filter to deliver the binaural signals to the listener’s ear
without introducing additional coloration. However, the design of the inverse filter is
not straightforward. Two aspects are problematic. First, the HpTF considerably varies
across listeners and headphone models, which may require the use of listener and
model-specific compensation filters depending on the demands of the application.
Second, the low-frequency response and the center frequency and depth of high-
frequency notches in the HpTF strongly depend on the fit of the headphone and may
considerably change if the listener re-positions the headphones (cf. Fig. 5.13). To
account for the variance, the average HpTF can be used to design the inverse filter,
and the filter gain at low and high frequencies can be restricted using regularized
inversion [24,46]. Once calculated, the static headphone filter can be applied to the
binaural signals by means of convolution.
In addition to this static convolution, a dynamic convolution is often required to
render the current HRIR or BRIR. Since real-time audio processing works on blocks
of audio, this is simply achieved by using the current HRIR as long as the listener
does not move. If the listener moves, the past and current HRIR are both convolved
simultaneously and a cross fade with the length of one audio block is applied between
the two [82].
164 F. Brinkmann and S. Weinzierl
Fig. 5.13 Headphone
transfer functions of subject
6 from the HUTUBS HRTF
database for the left ear of a
Sennheiser HD650
headphone [9]. Gray lines
show the effect of
re-positioning. Black lines
show the averaged HpTF
(CC-BY, Fabian Brinkmann)
100 1k 10k 20k
Frequency in Hz
-20
-10
0
10
Amplitude in dB
5.3.2.4 Reproduction with Loudspeakers
While delivering binaural signals through headphones is the most obvious solution
due to the one-to-one correspondence between the two ears and two speakers of the
headphone, two approaches for transaural reproduction using loudspeakers are also
available.
The first approach uses only two loudspeakers. In analogy to headphone reproduc-
tion, there is a one-to-one correspondence between the ear signals and speakers, and
the filter for the left loudspeaker compensates for the transfer function between the
speaker and the left ear. In contrast to headphone reproduction, however, this requires
an additional filter for cross-talk cancellation (CTC) between the right speaker and
the left ear (the filters for the right ear work accordingly). This requires an iterative
design of the compensation filters for all possible positions of the head with respect
to the loudspeakers and thus a dynamic convolution already for the compensation
filters [51]. Optionally, more loudspeakers can be used to optimize the system for
different listening positions or frequency ranges.
The second approach uses linear or circular loudspeaker arrays. Here, the idea is
to shoot two narrow audio beams in the direction of the listener’s ears. Because the
beams concentrate most of their energy towards the listener’s ears, a high separa-
tion between the left and right ear beams can be achieved depending on the array
geometry [20]. In this case, a one-to-one correspondence is established between the
two beams and the ears, and cross-talk compensation is not required if the beams
are sufficiently narrow. In this case, a dynamic convolution is required to update the
beamformers according to the listener’s position.
5.3.3 Binaural Reproduction of Synthesized Sound Fields
It is worth to note that SFS approaches can be combined with binaural reproduc-
tion, either by virtualizing the loudspeaker array with an array of HRIRs or through
binaural processing stages that build upon the sound field analysis (c.f., [2], [64,
Sect. 6.4.2] and [89, Sect. 4.11]). This makes binaural reproduction the prime frame-
work for rendering spatial audio in AR/VR and SFS a versatile tool within the frame-
5 Audio Quality Assessment for Virtual Reality 165
work: First, SFS makes it possible to efficiently render binaural signals for arbitrary
head orientations from a single SRIR (might require pre-processing to achieve a rea-
sonable quality as detailed in Sect. 5.4.3). Second, SFS makes it possible to include
listener movements (translation)—to a limited extent—and thus enables rendering
with 6DoF. The realization of 6DoF rendering depends on the sound field representa-
tion, which strongly differs across SFS approaches. However, the general idea agrees
in many cases. Head rotations can be realized by an inverse rotation of the sound
field. For perceptually motivated SFS methods, translation can be realized by manip-
ulating the directions and times of arrival that were obtained through SFA according
to the listener’s movements (e.g., [41]). The possibility of realizing translation with
physically motivated SFS approaches and measured sound fields is, however, rather
limited as this would require arrays with hundreds if not thousands of microphones.
5.4 System Performance
This section details the quality that can be achieved with the different reproduction
paradigms, starting with binaural synthesis. This is the most common approach, and
in case it is used in combination with SFS, it also limits the maximally achievable
quality of the SFS.
5.4.1 Binaural Synthesis
The authenticity and plausibility of a reproduction system are without a doubt the
most integral and comprehensive quality measures and are thus discussed first. How-
ever, it is also important to shed light on the relevance of individual components in
the reproduction pipeline. While there are many small pieces that contribute to the
overall quality, the most relevant might be the individualization of binaural signals,
head tracking, and audiovisual stimulation, which are discussed separately.
5.4.1.1 Authenticity and Plausibility
Headphone-based individual dynamic binaural synthesis can be authentic if reverber-
ant environments and real-life signals, such as speech, are simulated. For this typical
use case, 66% of the subjects in Brinkmann et al. [8] could not hear any differ-
ences between a real loudspeaker and its binaural simulation (cf. Fig. 5.14, bottom).
However, differences such as coloration become audible if simulating anechoic envi-
ronments or artificial noise signals. Remaining differences stem from accumulated
measurement errors in the range of 1 dB mostly related to the positioning of the
subject and the in-ear microphones during the experiment (cf. Fig. 5.3, top). Clearly,
these differences can be detected more easily with steady broadband signals such
166 F. Brinkmann and S. Weinzierl
as noise. The effect of reverberance might be twofold. First, the reverberation might
be able to mask audible coloration in the direct sound, and second, reverberant parts
of the BRIR might be less prone to coloration artifacts because measurement errors
could cancel across reflections arriving from multiple directions.
Loudspeaker-based individual binaural synthesis by means of CTC can be authen-
tic in anechoic reproduction rooms [59]. However, the quality drastically decreases
if the CTC system is set up in reverberant environments, thus limiting the usability
of this approach. The decrease in quality is caused by undesired reflections from the
reproduction room that can not be compensated in practice due to uncertainties in
the exact position of the listener [69].
Non-individual dynamic binaural synthesis is not authentic but can be plausible,
i.e., matching the listeners expectation towards the acoustic environment. This means
that differences between a real sound field and a non-individual simulation are audible
in a direct comparison, but they are not large enough for the simulation to be detected
0° 90° 0° 90° 0° 90° 0° 90° 0° 90° 0° 90°
source azimuth in degree
50
62.5
75
87.5
100
percentage of correct answers
7 7 5
2
5 7
2
7
2
3
2
3
2
2
2
2
2
5
anechoic anechoicdry drywet wet
12
14
16
18
20
22
24
number of correct answers
noise speech
frequency in Hz
100
0
2
4
-4
-2
1k 10 k 20 k
amplitude in dB
Fig. 5.14 Results of the test for authenticity. Top: Range of differences between the sound field of the
real and virtual frontal loudspeakers across head-above-torso orientations. Data was measured at the
blocked ear channel entrance and is shown as 12th (light blue) and 3rd octave (dark blue) smoothed
magnitude spectra. Bottom: 2-Alternative Forced Choice detection rates for all participants, two
audio contents, source positions in front (0◦) and to the left (90◦), and three different acoustical
environments (cf. Fig. 5.3). The size of the dots and the numbers next to them indicates how many
participants scored identical results. Results on or above the dashed line are significantly above
chance, indicating that differences between simulated and real sound fields were reliably audible.
50% correct answers denotes guessing (CC-BY, Fabian Brinkmann)
5 Audio Quality Assessment for Virtual Reality 167
as such in an indirect comparison. Although the plausibility was only shown for
headphone base reproduction of reverberant environments [47,63], it is reasonable to
assume that this also holds the simulation of anechoic environments and loudspeaker
based reproduction in anechoic environments. Remaining differences between real
sound fields and binaural simulations are discussed in the following section.
An example setup for testing authenticity and plausibility is shown in Fig. 5.14.It
is important to note that authentic simulations can only be achieved under carefully
controlled laboratory conditions. Otherwise, the placement of the headphones will
already introduce audible artifacts that would be hard to control in any consumer
application [61]. It can, however, be assumed that such artifacts are irrelevant for
the vast majority of VR/AR applications, where plausibility is a sufficient quality
criterion.
5.4.1.2 Effect of Individualization
Binaural signals (binaural recordings, HRIRs, BRIRs) are highly individual, i.e.,
they differ across listeners due to different shapes of the listeners‘ ears, heads, and
bodies. As a consequence, listening to non-individual binaural signals decreases the
audio quality and can be thought of as listening through someone else’s ears. While
the decrease in quality could already be seen in the integral measures authenticity
and plausibility, this section will look at differences in more detail.
The most discussed degradation caused by non-individual signals is increased
uncertainty in source localization [57]. Using individual head-related transfer func-
tions (HRTFs, the frequency domain HRIRs), median route mean squared localiza-
tion errors are approximately 27◦for the polar angle, which denotes the up/down
source position, and 15◦for the lateral angle, which denotes the left/right position.
Quadrant errors, which are a measure for front–back and up–down confusions (and
mixtures thereof), occur in only 4% of the cases. A drastic increase of the quadrant
error by a factor of 5 to about 20% and the polar error by a factor of 1.5 to about 40◦
can be observed if using non-individual signals. Because source localization in the
polar dimension relies on high-frequency cues in the binaural signal, the increased
errors can be attributed to differences in ear shapes, which have the strongest influ-
ence on binaural signals at high frequencies. The lateral error increases by only 2◦.
In this case, the auditory system exploits interaural cues (ITD, ILD) for localiza-
tion, which stems from the overall head shape. The fact that head shapes differ less
between listeners than ear shapes explains the relatively small changes in this case.
Whereas localization might be one of the most important properties of audio
in virtual acoustic realities, it is by far not the only aspect that degrades due to
non-individual signals. An extensive qualitative analysis is shown in Fig. 5.15.The
results were obtained with pulsed pink noise as audio content in a direct comparison
between a frontal loudspeaker- and headphone-based dynamic binaural syntheses
using the setup shown in Fig. 5.3. Apart from qualities related to the scene geometry
(localization, externalization, etc.), considerable degradations can also be observed
for aspects related to the tone color. In sum, this also lead to a larger overall difference
168 F. Brinkmann and S. Weinzierl
-1
0
1
difference
GENERAL
clarity naturalness presence liking
-1
0
1
bright/dark low freq. color mid freq. color high freq. color sharpness roughness comb-filter metallic
-1
0
1
tonalness
TONALNESS
pitch doppler effect level of rev.
ROOM
duration of rev. envelopment externalization
GEOMETRY
distance
-1
0
1
localizability hor. direction vert. direction front/back pos. depth width height spat. disinteg.
-1
0
1
pre-echoes
TIME
post-echoes temp. disinteg. crispness responsiveness loudness
DYNAMICS
dynamic range compression
-1
0
1
pitched artif.
ARTIFACTS
impulsive artif. noise-like artif. alien source ghost source distortion tactile vibration
TONE COLOR
GEOMETRY
Fig. 5.15 Perceived differences between a real sound field and the individual (blue, left) and non-
individual (red, right) dynamic binaural simulation thereof. Results are pooled across an anechoic,
dry, and wet acoustic environment. The horizontal lines show the medians, the boxes the interquartile
ranges, and the vertical lines the minimum and maximum perceived differences. Scale labels were
omitted for clarity and can be found in [48] (CC-BY, Fabian Brinkmann)
and subjects rated the non-individual simulation to be less natural and clear than its
individual counterpart. As a result, the individual simulation was generally preferred
(attribute liking), however, the presence was not affected. Because the similarity
between the individual BRIRs and the non-individual BRIRs used in the test depends
on the listener, the results for non-individual synthesis have considerably higher
variance (indicated by the interquartile ranges).
Differences for individual binaural synthesis are small compared to non-individual
synthesis. In this case, noteworthy differences only remain for the tone color. These
differences stem from measurement uncertainties that arise mostly due to positioning
inaccuracies of the subjects and in-ear microphones. As mentioned above, these
differences become inaudible if using speech signals instead of pulsed noise.
5 Audio Quality Assessment for Virtual Reality 169
Individualization is not only important for HRIRs and BRIRs but also for the
headphone compensation (HpC). The examples above either used fully individ-
ual (individual HRIRs/BRIRs and HpC) or fully non-individual (non-individual
HRIRs/BRIRs and HpC) simulations. Combinations of these cases were investigated
by Engel et al. [15] and Gupta et al. [26]. As expected, fully individual simulations
always have the highest quality, and considerable degradations can be observed if
using individual signals with a non-individual HpC. If an individual HpC is not
feasible, differences between individual and non-individual signals were only sig-
nificant for the source direction but not for the perceived distance, coloration, and
overall similarity. In any case, at least a non-individual HpC should be used because
differences are the largest for simulations without HpC.
Many individualization approaches are available that mitigate the detrimental
effects of non-individual signals to a certain degree [25]. However, they demand
additional action from the listener to obtain individual or individualized signals. It
is thus worth noting—and discussed in the next sections—that head tracking and
visual stimulation are two means to mitigate some effects that do not require actions
from the listener.
5.4.1.3 Effect of Head Tracking
Without head tracking, the auditory scene will move if the listeners move their head,
which is a very unnatural behavior for most VR/AR applications. Head-tracked
dynamic simulations in which the auditory scene remains stable during head move-
ments have thus become the standard. Besides the general improvement of the sense
of presence and immersion, this has at least two more benefits.
First, localization errors for non-individual signals decrease if head tracking is
enabled [52]. While the lateral localization errors remain largely unaffected, front–
back confusion completely disappears if the listeners rotate their head by 32◦or more
to the left or right. This can be explained by movement-induced dynamic changes
in the binaural signals. As listeners move their head to the left, the left ear moves
away from the source if it is in front, and the right ear moves towards it. Because
this behavior would be exactly reversed for a source behind the listener, the auditory
system is able to resolve the front–back confusion through the head motion. Up–
down confusion can be resolved in analogy through head nodding to the left or right.
Additionally, the elevation error decreases by a third for head rotations of 64◦to the
left or right. This can be explained by the fact that dynamic changes in the binaural
signals are largest for a frontal source and almost disappear for a source above and
below the listener.
The second benefit pertains to the externalization of non-individual virtual
sources [31]. While sources to the side are well externalized even with non-individual
signals, sources to the front and rear were often reported to be perceived as being
inside the head. The most likely reason for this is that signals for sources close to
the median plane are similar for the left and right ears. In contrast, the ear signals
differ in time and level for sources to the side. These differences stem from the spa-
170 F. Brinkmann and S. Weinzierl
tial separation of the ears and the acoustic shadow of the head and might provide
the auditory system with evidence of the presence of an external source. If listeners
perform large head rotations to the left and right, dynamic binaural cues are induced
and the externalization of frontal and rear sources significantly increases.
Despite the positive effects of head tracking, it has to be kept in mind that listeners
will not always perform large head movements just because they can. The actual
benefit might thus often be smaller than reported above. However, dynamic cues
that are similar to those of head movements can also be induced by a moving source,
which was shown to have a similarlypositive effect on externalization [30]. An effect
of source movements for localization has not yet been extensively investigated. For
the case of distance localization, it was already shown that active self-motion is more
efficient than passive self-motion and source motion [22].
5.4.1.4 Effect of Visual Stimulation
Because VR/AR applications usually provide congruent audiovisual signals, it is
worth to consider the effect of visual stimulation on the audio quality. Interestingly—
and in contrast to head tracking—visual stimulation can have positive and negative
effects.
The possibly most important positive aspect is the ventriloquism effect, which
describes the phenomenon that a fused audiovisual event is perceived at the loca-
tion of the visual stimuli even if the position of the auditory event deviates from
that of the visual event. Median thresholds below which fusion appears are approx-
imately 15◦in the horizontal plane and 45◦in the median plane if presenting a
realistic stereoscopic 3D video of a talker [29]. Comparing this to localization errors
reported in Sect. 5.4.1.2, it can be hypothesized that localization errors will drasti-
cally decrease if not completely disappear even for non-individual binaural synthesis
due to audiovisual fusion and the ventriloquism effect if a source is visible and in
the field of view. It has to be kept in mind, however, that the degree of realism of
the visual stimulation—termed compellingness in [29]—affects the strength of the
ventriloquism effect. Thus, fusion thresholds can decrease for less realistic visual
stimulation.
Quality degrading effects can occur if the (expected) acoustics of the visually
presented room does not match the acoustics of the auditorily presented room—an
effect termed room divergence. This effect is especially relevant for AR applications
where listeners can naturally explore real audiovisual environments to which artifi-
cial auditory or audiovisual events are added. However, room divergence can also
appear in VR applications for example due to badly parameterized room acoustic
simulations. Room divergence is not extensively researched up to date, but it was
already shown that it can affect distance perception and externalization [23,83].
While degradations with respect to these qualities might as well be mitigated by the
ventriloquism effect [56], the room divergence might also affect higher level qualities
such as plausibility and presence.
5 Audio Quality Assessment for Virtual Reality 171
5.4.2 Sound Field Synthesis
The discussion of SFA/SFS is limited to perceptually motivated approaches because
they are predominantly used in VR/AR applications. In-depth evaluations of phys-
ically motivated approaches were, for example, conducted by Wierstorf [85] and
Erbes [17].
5.4.2.1 Vector-Based and Ambisonics Panning
The most important quality factor for loudspeaker-based reproduction approaches is
the number of loudspeakers L. In case of Ambisonics, there is a strict dependency
between Land the achievable spatial resolution, which is determined by the so-called
Ambisonics order N(L+1)2. Intuitively, the spatial resolution increases with
increasing Ambisonics order. For the amplitude panning methods, the fluctuation
of the perceived source width across source positions (VBAP) and the minimally
achievable source width that is independent of the source position (MDAP) increase
with L.
Both approaches—vector-based and Ambisonics panning—have distinct disad-
vantages at very low orders N2, i.e., for arrays consisting of only about four to
nine loudspeakers. In this case, Ambisonics and MDAP have a rather limited spatial
resolution and Ambisonics additionally exhibits a dull sound color. For VBAP, on
the other hand, the source width heavily depends on the position of the virtual source.
Using state-of-the-art Ambisonics decoders, the differences between the approaches
decrease at orders N3, i.e., for arrays consisting of 16 loudspeakers or more.
For such arrays, all methods are able to produce virtual sources whose width and
loudness are independent of the source position. For an in-depth discussion of these
properties the interested reader is referred to Zotter and Frank [89, Chaps. 1 and 3]
and Pulkki et al. [64, Chap. 5].
5.4.2.2 SIRR, SDM, and DirAC
Different versions of SIRR and DirAC have been proposed over the past years. The
two most advanced versions are the so-called Virtual Microphone DirAC, which
improved the rendering of diffuse sound field components over the original DirAC
version, and higher order DirAC/SIRR, which make it possible to estimate more
than one directional component for each time frame to improve the rendering of
challenging acoustic scenes [53,64, Chaps. 5 and 6]. For an array consisting of
16 loudspeakers that are set up in acoustically treated environments (anechoic or
very dry), SIRR and DirAC can achieve a high audio quality of about 80–90% on a
MUSHRA-like rating scale (cf. Sect. 5.2.1.1). Best results are obtained for idealized
microphone array signals, i.e., if the SIRR/DirAC input signals are synthetically
172 F. Brinkmann and S. Weinzierl
BEMA Raw SHF SHF+SAF Tap.+SHF MagLS
Huge
Significant
Moderate
Small
No
N = 3
N = 5
N = 7
Fig. 5.16 Perceived differences between a reference and order limited binaural renderings of micro-
phone array recordings. For details refer to [49] (CC-BY, Tim Lübeck)
generated instead of recorded with a real microphone. Using a real microphone
array decreased the audio quality by about 10% on average.
Similar audio qualities were obtained for SDM [78] and binaural SDM [2]. The
latter study showed that binaural SDM has a plausibility score similar to sound
fields emitted by real loudspeakers. Although the plausibility score differs from the
definition of plausibility in Sect. 5.2.2.2, it is reasonable to assume that SDM—and
also SIRR and DirAC—can be plausible, however, not authentic.
So far, perceptual evaluations were conducted in acoustically treated listening
rooms and it is plausible to expect that the quality decreases with an increasing
degree of reverberation in the listening environment. Moreover, a comprehensive
comparative evaluation of SIRR and SDM is missing to date and existing studies
sometimes used test conditions that might have favored one approach over the others.
SIRR, SDM, and DirAC might be the most common, but by far, not the only
methods for perceptually motivated SFS. Broader overviews are, for example, given
by Pulkki et al. [64, Chap. 4] and Zotter and Frank [89, Sect. 5.8].
5.4.3 Binaural Reproduction of Synthesized Sound Fields
As mentioned before, SFS approaches can be reproduced via headphones if virtual-
izing the loudspeaker array with a set of HRTFs. The virtualization is uncritical if
the number of virtual loudspeakers can be freely selected, which often is the case for
SIRR, SDM, and DirAC. The situation is more difficult, however, for Ambisonics
signals which are typically order limited to 1 N7. The challenge in this case
is to derive an Ambisonics version of the HRTF data set with the same order restric-
tion. Without specifically tailored algorithms, an order of N≈35 is required for
an authentic Ambisonics representation of HRTFs and simply restricting the order
causes clearly audible artifacts [3].
5 Audio Quality Assessment for Virtual Reality 173
A variety of methods have been proposed to mitigate these artifacts. This com-
prises a global spectral equalization with or without windowing (tapering) of the
spherical harmonics coefficients or a separate treatment of the HRTF phase by means
of (frequency-dependent) time alignment or finding an optimal phase that reduces
errors in the HRTF magnitude [3,89, Sect. 4.11]. A comparative study of these algo-
rithms was conducted by Lübeck et al. [49]. As shown in Fig. 5.16, the differences
between a reference and binaural renderings are small already for N=3, at least for
the best algorithms.
Another benefit of headphone reproduction is that different reproduction tech-
niques can be combined to fine-tune the trade-off between perceptual quality and
computational efficiency. One possible solution is to use HRTFs with a high spatial
resolution for direct sound rendering (high computational cost, high quality) com-
bined with Ambisonics-based rendering of reverberant components (cost and quality
adjustable by means of the SH order) [16]. This exploits the fact, that the spatial res-
olution of the auditory system is higher for the direct sound than for reverberant
components [50].
5.5 Conclusion
Section 5.2 gave an overview of existing quality measures for evaluating 3D audio
content and it became apparent that the underlying concepts can also be used to assess
audio quality in audiovisual virtual reality. Good suggestions were made to adapt
the application of these measures for AR/VR by simplifying the associated rating
interfaces and/or adapting methods for the statistical analysis. Open questions in this
field mainly seem to relate to the higher level constructs of QoE and presence. It will
be interesting to see how these can be measured with less intrusive user interfaces
or—in the best case—with indirect physiological or psychological measures. If such
methods would be established, it would also be possible to further investigate how
far these higher level constructs are affected by specific aspects of audio quality.
Sections 5.3 and 5.4 introduced selected approaches for generating 3D audio for
AR/VR and reviewed their quality. The current best practice of using non-individual
binaural synthesis with compensated headphones for audio reproduction can gener-
ate plausible simulations and can significantly benefit from additional information
provided by 3D visual content. Recent advances in signal processing fostered the
combination of SFS and binaural reproduction. This improved the efficiency—a key
factor for enabling 3D audio rendering in mobile applications—without introducing
significant quality degradations. One current hot topic in the combination of SFS and
binaural reproduction is clearly 6DoF rendering. Many algorithms were suggested
for this, however, their development and even more so their perceptual evaluation
are still under investigation in the majority of cases. The interested reader may have
a look at recent articles as a starting point for discovering this field (e.g., [4,41]).
A second hot topic is the individualization of binaural technology. The effects of
individualization were discussed and it was shown that this makes it possible to cre-
174 F. Brinkmann and S. Weinzierl
ate simulations that are perceptually identical to a real sound field. Approaches for
individualization were, however, not detailed and the interested reader is referred to
the overview of Guezenoc and Renaud [25].
From the user perspective, it is worth to note that an increasing pool of software and
hardware is available for 3D audio reproduction.3State-of-the-art audio processing
and reproduction methods are available as plug-ins that can easily be integrated into
the production workflow as well as in toolboxes that can be used for further research
and product development. This is complemented by VR/AR-ready hardware such as
microphone arrays as well as head-mounted displays and headphones with build-in
head trackers.
References
1. Ahrens, J.: Analytic methods of sound field synthesis 1st Edition (eds Möller, S., Küpper, A.,
Raake, A.) (Springer, Heidelberg, Germany, 2012).
2. Amengual Garí, S. V., Arend, J. M., Calamia, P. T., Robinson, P. W.: Optimizations of the
Spatial Decomposition Method for Binaural Reproduction. J. Audio Eng. Soc. 68, 959–976
(Dec. 2020).
3. Arend, J. M., Brinkmann, F., Pörschmann, C.: Assessing Spherical Harmonics Interpolation of
Time-Aligned Head-Related Transfer Functions. J. Audio Eng. Soc. 69, 104–117 (Feb. 2021).
4. Arend, J. M., Garí, S. V. A., Schissler, C., Klein, F., Robinson, P. W.: Six- Degrees-of-Freedom
Parametric SpatialAudio Based on One MonauralRoom Impulse Response. Journal of the
Audio Engineering Society 69, 557–575 (July 2021).
5. Athif, M. et al.: Using Biosignals for Objective Measurement of Presence in Virtual Reality
Environments in 2020 42nd Annual International Conference of the IEEE Engineering in
Medicine & Biology Society (EMBC) (2020), 3035–3039.
6. Bech, S. N. Z.: Perceptual audio evaluation. Theroy, method and application (John Wiley &
Sons, West Sussex, England, 2006).
7. Blauert, J.: Spatial Hearing. The psychophysics of human sound localization Revised (MIT
Press, Cambridge, Massachusetts, 1997).
8. Brinkmann, F., Lindau, A., Weinzierl, S.: On the authenticity of individual dynamic binaural
synthesis. J.Acoust. Soc. Am. 142, 1784–1795 (Oct. 2017).
9. Brinkmann, F. et al.: A cross-evaluated database of measured and simulated HRTFs including
3D head meshes, anthropometric features, and headphone impulse responses. J. Audio Eng.
Soc. 67, 705–718 (Sept. 2019).
10. Brinkmann, F. et al.: A round robin on room acoustical simulation and auralization. J. Acoust.
Soc. Am. 145, 2746–2760 (Apr. 2019).
11. Brunnström, K. et al.: Qualinet white paper on definitions of quality of experience in 5th
Qualinet meeting (Novi Sad, Serbia, 2013).
12. Burstein, H.: Approximation formulas for error risk and sample size in abx testing. Journal of
the Audio Engineering Society 36, 879–883 (1988).
3A list of available tools can, for example, be found at https://www.audio-technology.info/ under
the Resources section of the Binaural Technology chapter (last access 2022/06/17).
5 Audio Quality Assessment for Virtual Reality 175
13. Deniaud, C., Honnet, V., Jeanne, B., Mestre, D.: An investigation into physiological responses
in driving simulators: An objective measurement of presence in 2015 Science and Information
Conference (SAI) (2015), 739–748.
14. Dey, A., Phoon, J., Saha, S., Dobbins, C., Billinghurst, M.: Neurophysiological Effects of
Presence in Calm Virtual Environments in 2020 IEEE Conference on Virtual Reality and 3D
User Interfaces Abstracts and Workshops (VRW) (2020), 744–745.
15. Engel, I., Alon, D. L., Robinson, P. W., Mehra, R.: The Effect of Generic Headphone Compen-
sation on Binaural Renderings in AES International Conference on Immersive and Interactive
Audio (Audio Engineering Society, York, UK, Mar. 2019).
16. Engel, I., Henry, C., Amengual Garí, S. V., Robinson, P. W., Picinali, L.: Perceptual implications
of different Ambisonics-based methods for binaural reverberation. J. Acoust. Soc. Am. 149,
895–910 (Feb. 2021).
17. Erbes,V.:Wave field synthesis in a listening room Doctoral Thesis (University of Rostock,
Rostock, Germany, Aug. 2020).
18. Erbes, V., Schultz, F., Lindau, A., Weinzierl, S.: An extraaural headphone system for optimized
binaural reproduction in Fortschritte der Akustik -DAGA 2012 (Darmstadt, Germany, Mar.
2012), 313–314.
19. Freeman, J., Avons, S. E., Meddis, R., Pearson, D. E., IJsselsteijn, W.: Using behavioral realism
to estimate presence: A study of the utility of postural responses to motion stimuli. Presence:
Teleoperators&Virtual Environments 9, 149–164 (2000).
20. Gálvez, M. F. S., Menzies, D., Fazi, F. M.: Dynamic Audio Reproduction with Linear Loud-
speaker Arrays. J. Audio Eng. Soc. 67, 190–200 (Apr. 2019).
21. Gelfand, S. A.: Hearing: An introduction to psychological and physiological acoustics (CRC
Press, 2017).
22. Genzel, D., Schutte, M., Brimijoin, W. O., MacNeilage, P. R., Wiegrebe, L.: Psychophysical
evidence for auditory motion parallax. Proceedings of the National Academy of Sciences of
the United States of America 115, 4264–4269 (Apr. 2018).
23. Gil-Carvajal, J. C., Cubick, J., Santurette, S., Dau, T.: Spatial Hearing with Incongruent Visual
or Auditory Room Cues. Scientific Reports 6, 37342 EP (Nov. 2016).
24. Gomez-Bolaños, J., Mäkivirta, A., Pulkki, V.: Automatic regularization parameter for head-
phone transfer function inversion. J. Audio Eng. Soc. 64, 752–761 (Oct. 2016).
25. Guezenoc, C., Séguier, R.: HRTF Individualization: A Survey in 145th AES Convention (New
York, NY, USA, Oct. 2018), Paper 10129.
26. Gupta, R., Ranjan, R., He, J., Gan, W.-S.: Study on differences between individualized and
non-indiviudalized hear-thorough equalization for natural augmented listening. In: AES Conf.
on Headphone Technology, San Francisco, CA, USA (Aug. 2019).
27. Gupta, R., Ranjan, R., He, J., Woon-Seng, G.: Investigation of effect of VR/AR headgear on
Head related transfer functions for natural listening in Audio Engineering Society Conference:
2018 AES International Conference on Audio for Virtual and Augmented Reality (2018).
28. Halbig, A., Latoschik, M. E.: A Systematic Review of Physiological Measurements, Factors,
Methods, and Applications in Virtual Reality. Frontiers in Virtual Reality 2, 89 (2021).
29. Hendrickx, E., Paquier, M., Koehl, V., Palacino, J.: Ventriloquism effect with sound stimuli
varying in both azimuth and elevation. J. Acoust. Soc. Am. 138, 3686–3697 (2015).
30. Hendrickx, E. et al.: Improvement of Externalization by Listener and Source Movement Using
a “Binauralized” Microphone Array. J. Audio Eng. Soc. 65, 589–599 (July 2017).
31. Hendrickx, E. et al.: Influence of head tracking on the externalization of speech stimuli for
non-individualized binaural synthesis. J. Acoust. Soc. Am. 141, 2011–2023 (Mar. 2017).
32. Hiekkanen, T., Mäkivirta, A., Karjalainen, M.: Virtualized listening tests for loudspeakers. J.
Audio Eng. Soc. 57, 237–251 (2009).
33. Hox, J. J.: Multilevel Analysis. Techniques and Apllications Second (ed Marcoulides, G. A.)
(Routledge, New York, Hove, 2010).
34. ITU-R BS.1116-3: Methods for the subjective assessment of small impairments in audio sys-
tems (ITU, Geneva, Switzerland, 2015).
176 F. Brinkmann and S. Weinzierl
35. ITU-R BS.1283-2: Guidance for the selection of the most appropriate ITU-R Recommenda-
tion(s) for subjective assessment of sound quality (ITU, Geneva, Switzerland, 2019).
36. ITU-R BS.1284-2: General methods for the subjective assessment of sound quality (ITU,
Geneva, Switzerland, 2019).
37. ITU-R BS.1534-3: Methods for the subjective assessment of intermediate quality level of audio
systems (ITU, Geneva, Switzerland, 2015).
38. Jekosch, U.: Basic Concepts and Terms of. acta acustica united with Acustica 90, 999–1006
(2004).
39. Jerald, J., Whitton, M.: Relating Scene-Motion Thresholds to Latency Thresholds for Head-
Mounted Displays in 2009 IEEE Virtual Reality Conference (Mar. 2009), 211–218.
40. Kadlec, H.: Statistical properties of dand βestimates of signal detection theory. Psychological
Methods 4, 22 (1999).
41. Kentgens, M., Jax, P.: Comparison of Methods for Plausible Sound Field Translation in
Fortschritte der Akustik - DAGA 2021 (Vienna, Austria, Aug. 2021), 302–305.
42. Le Bagousse, S., Colomes, C., Paquier, M.: State of the art on subjective assessment of spatial
sound quality inAudio Engineering Society Conference: 38th International Conference: Sound
Quality Evaluation (2010).
43. Le Bagousse, S., Paquier, M., Colomes, C.: Families of sound attributes for assessment of
spatial audio in 129th AES Convention (2010), Convention-Paper.
44. Leventhal, L.: Type 1 and type 2 errors in the statistical analysis of listening tests. Journal of
the Audio Engineering Society 34, 437–453 (1986).
45. Lindau, A.: The perception of system latency in dynamic binaural synthesis in NAG/DAGA
2009, International Conference on Acoustics (Rotterdam, Netherland, 2009), 1063–1066.
46. Lindau, A., Weinzierl, S.: On the spatial resolution of virtual acoustic environments for head
movements on horizontal, vertical and lateral direction in EAA Symposium on Auralization
(Espoo, Finland, June 2009).
47. Lindau, A., Weinzierl, S.: Assessing the plausibility of virtual acoustic environments. Acta
Acust. united Ac. 98, 804–810 (Sept. 2012).
48. Lindau, A. et al.: A Spatial Audio Quality Inventory (SAQI). Acta Acust. united Ac. 100,
984–994 (Sept. 2014).
49. Lübeck, T., Helmholz, H., Arend, J. M., Pörschmann, C., Ahrens, J.: Perceptual Evaluation of
Mitigation Approaches of Impairments due to Spatial Undersampling in Binaural Rendering
of Spherical Microphone Array Data. J. Audio Eng. Soc. 68, 428–440 (June 2020).
50. Lübeck, T., Pörschmann, C., Arend, J. M.: Perception of direct sound, early reflections, andre-
verberation in auralizations of sparsely measuredbinaural room impulse responses in AES Int.
Conf. Audio for Virtual and Augmented Reality (AVAR) (Aug. 2020).
51. Majdak, P., Masiero, B., Fels, J.: Sound localization in individualized and non-individualized
crosstalk cancellation systems. J. Acoust. Soc. Am. 133, 2055–2068 (Apr. 2013).
52. McAnally, K. I., Martin, R. L.: Sound localization with head movement: implications for 3-d
audio displays. Frontiers in Neuroscience 8, 210 (2014).
53. McCormack, L., Pulkki, V., Politis, A., Scheuregger, O., Marschall, M.: Higher-Order Spatial
Impulse Response Rendering: Investigating the Perceived Effects of Spherical Order, Dedicated
Diffuse Rendering, and Frequency Resolution. J. Audio Eng. Soc. 68, 338–354 (May 2020).
54. Meehan, M., Insko, B., Whitton, M., Brooks Jr, F. P.: Physiological measures of presence in
stressful virtual environments. Acm transactions on graphics (tog) 21, 645–652 (2002).
55. Mendonça, C., Delikaris-Manias, S.: Statistical Tests with MUSHRA Data in 144th AES Con-
vention (Milan, Italy, May 2018), Paper 10006.
56. Mendonça, C., Mandelli, P., Pulkki, V.: Modeling the perception of audiovisual distance:
Bayesian causal inference and other models. PLoS ONE 11, e0165391 (2016).
57. Middlebrooks, J. C.: Virtual localization improved by scaling nonindividualized external-ear
transfer functions in frequency. J. Acoust. Soc. Am. 106, 1493–1510 (Sept. 1999).
58. Minsky, M.: Telepresence. Omni, 45–51 (1980).
59. Moore, A. H., Tew, A. I., Nicol, R.: An initial validation of individualised crosstalk cancellation
filters for binaural perceptual experiments. J. Audio Eng. Soc. 58, 36–45 (Jan. 2010).
5 Audio Quality Assessment for Virtual Reality 177
60. Noble, A. C. et al.: Modification of a standardized system of wine aroma terminology. American
journal of Enology and Viticulture 38, 143–146 (1987).
61. Paquier, M., Koehl, V.: Discriminability of the placement of supra-aural and circumaural head-
phones. Applied Accoustics 93, 130–139 (2015).
62. Pedersen, T. H., Zacharov, N.: The development of a sound wheel for reproduced sound in
Audio Engineering Society Convention 138 (2015).
63. Pike, C., Melchior, F.,Tew,T.: Assessing the plausibility of non-individualised dynamic binaural
synthesis in a small room in AES 55th International Conference (Helsinki, Finland, 2014).
64. Parametric time-frequency domain spatial audio First (eds Pulkki, V., Delikaris-Manias, S.,
Politis, A.) (Wiley, Hoboken, NJ, USA, 2018).
65. Raake, A., Rummukainen, O. S., Habets, E. A. P., Robotham, T., Singla, A.: QoEvaVE - QoE
Evaluation of Interactive Virtual Environments with Audiovisual Scenes in Fortschritte der
Akustik - DAGA 2021 (Vienna, Austria, Aug. 2021), 1332–1335.
66. Riva, G., Waterworth, J. A., Waterworth, E. L.: The layers of presence: a bio-cultural approach
to understanding presence in natural and mediated environments. CyberPsychology & Behavior
7, 402–416 (2004).
67. Rummukainen, O. et al.: Audio Quality evaluation in virtual reality: Multiple stimulus ranking
with behaviour tracking in AES Int. Conf. on Audio for Virtualand Augmented Reality (AVAR)
(Redmond, USA, Aug. 2018).
68. Sanchez-Vives, M. V., Slater, M.: From presence to consciousness through virtual reality.
Nature Reviews Neuroscience 6, 332–339 (2005).
69. Schlenstedt, G., Brinkmann, F., Pelzer, S., Weinzierl, S.: Perceptual evaluation of transaural
binaural synthesis under consideration of the playback room [German: Perzeptive Evaluation
transauraler Binauralsynthese unter Berücksichtigung des Wiedergaberaums] in Fortschritte
der Akustik - DAGA 2016 (Aachen, Germany, Mar. 2016), 561–564.
70. Schoeffler, M., Herre, J.: About the different types of listeners for rating the overall listening
experience in Proceedings of the ICMC|SMC (Athens, Greece, 2014), 886–892.
71. Schoeffler, M., Silzle, A., Herre, J.: Evaluation of spatial/3D audio: Basic audio quality versus
quality of experience. IEEE Journal of Selected Topics in Signal Processing 11, 75–88 (2016).
72. Schwind, V., Knierim, P., Haas, N., Henze, N.: Using presence questionnaires in virtual reality
in Proceedings of the 2019 CHI conference on human factors in computing systems (2019),
1–12.
73. Silzle, A.: Quality taxonomies for auditory virtual environments in Audio Engineering Society
Convention 122 (2007).
74. Slater, M.: Measuring presence: A response to the Witmer and Singer presence questionnaire.
Presence 8, 560–565 (1999).
75. Slater, M.: Place illusion and plausibility can lead to realistic behaviour in immersive virtual
environments. Phil. Trans. R. Soc. B 364, 3549–3557 (2009).
76. Slater, M., Brogni, A., Steed, A.: Physiological responses to breaks in presence: A pilot study
in Presence 2003: The 6th annual international workshop on presence 157 (2003).
77. Slater, M., Wilbur, S.: A framework for immersive virtual environments (FIVE): Speculations
on the role of presence in virtual environments. Presence: Teleoperators & Virtual Environments
6, 603–616 (1997).
78. Tervo, S., Pätynen, J., Kuusinen, A., Lokki, T.: Spatial decomposition method for room impulse
responses. J. Audio Eng. Soc. 61, 17–28 (Jan. 2013).
79. Välimäki,V., Parker, J., Savioja, L., Smith, J. O.,Abel, J.: More Than 50Years of Artificial-
Reverberation in 60th Int. AES Conf.DREAMS(Dereverberation and Reverberation of Audio,
Music, and Speech) (Leuven, Belgium, Feb. 2016).
80. Välimäki, V., Parker, J. D., Savioja, L., Smith, J. O., Abel, J. S.: Fifty Years of Artificial
Reverberation. IEEE Transactions on Audio, Speech, and Language Processing 20, 1421–1448
(July 2012).
81. Völker, C., Bisitz, T., Huber, R., Kollmeier, B., Ernst, S. M. A.: Modifications of the MUlti
stimulus test with Hidden Reference and Anchor (MUSHRA) for use in audiology. Int. J.
Audiology (2016).
178 F. Brinkmann and S. Weinzierl
82. Wefers, F.: Partitioned convolution algorithms for real-time auralization PhD thesis (RWTH
Aachen University, Aachen, Germany, Sept. 2014).
83. Werner, S., Klein, F., Mayenfels, T., Brandenburg, K.: Asummary on acoustic room divergence
and its effect on externalization of auditory events in 8th Int. Conf. Quality of Multimedia
Experience (QoMEX) (Lisbon, Portugal, June 2016).
84. Wickens, T. D.: Elementary Signal Detection Theory (Oxford University Press, Oxford et al.,
2002).
85. Wierstorf, H.: Perceptual assessment of sound field synthesis Doctoral Thesis (Technical Uni-
versity of Berlin, Berlin, Germany, Sept. 2014).
86. Witmer, B. G., Jerome, C. J., Singer, M. J.: The factor structure of the presence questionnaire.
Presence: Teleoperators & Virtual Environments 14, 298–312 (2005).
87. Witmer, B. G., Singer, M. J.: Measuring presence in virtual environments: A presence ques-
tionnaire. Presence 7, 225–240 (1998).
88. Xie, B.: Head-related transfer function and virtual auditory display Second (J. Ross Publishing,
Plantation, FL, USA, 2013).
89. Zotter, F., Frank, M.: Ambisonics. A practical 3D audio theroy for recording, studio production,
sound reinforcement, and virtual reality (Springer Open, Cham, Switzerland, 2019).
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Part III
Sonic Interactions
Chapter 6
Spatial Design Considerations
for Interactive Audio in Virtual Reality
Thomas Deacon and Mathieu Barthet
Abstract Space is a fundamental feature of virtual reality (VR) systems, and more
generally, human experience. Space is a place where we can produce and transform
ideas and act to create meaning. It is also an information container. When work-
ing with sound and space interactions, making VR systems becomes a fundamen-
tally interdisciplinary endeavour. To support the design of future systems, designers
need an understanding of spatial design decisions that impact audio practitioners’
processes and communication. This chapter proposes a typology of VR interactive
audio systems, focusing on their function and the role of space in their design. Spa-
tial categories are proposed to be able to analyse the role of space within existing
interactive audio VR products. Based on the spatial design considerations explored
in this chapter, a series of implications for design are offered that future research can
exploit.
6.1 Introduction
Technologies like virtual reality (VR) offer many ways of using space that could
benefit creative audio production and immersive experience applications. Using VRs
affordances for embodied interaction and spatial user interfaces, new forms of spatial
expression can be explored. Running parallel to VR research efforts in sonic interac-
tion in virtual environments(SIVE), much of sonic practice exists as applied design,
either as music making tools [110], experiential products [106], or games [102].
Commercial work is influenced by academia, but it is also based on broader pro-
fessional constituencies and practices not related to sound and music interaction
design.
T. Deacon (B
)
Media and Arts Technology CDT, Queen Mary University of London, London, United Kingdom
M. Barthet
Centre for Digital Music, Queen Mary University of London, London, United Kingdom
e-mail: m.barthet@qmul.ac.uk
© The Author(s) 2023
M. Geronazzo and S. Serafin (eds.), Sonic Interactions in Virtual Environments,
Human—Computer Interaction Series, https://doi.org/10.1007/978-3-031- 04021-4_6
181
182 T. Deacon and M. Barthet
Much of VR design practice is communicated as professional dialogues, such
as platform or technology best practice guides [120,121], or reviews of “lessons-
learned” in industrial settings [105,122]. Within these professional dialogues, previ-
ous research, new technological capabilities, and commercial user research are col-
lected together to inform communities on how to best support users and task domains.
For the field of SIVE, and sound and music Computing(SMC) more broadly, there
is still work to be done to bridge commercial practice and academic endeavours.
Despite recent works [6,77], there is a paucity of design recommendations and anal-
ysis regarding how to build spaces, interfaces, and spatial interactions with sound.
For the potential of VR to be unlocked as a creative medium, multi and interdis-
ciplinary work must be undertaken to bring together the disciplines that touch on
space, interaction, and sound.
Studying how people make immersive tools, in commercial and academic settings,
requires a means of framing how spatial design decisions impact users. This brings up
two problems, what role do commercial artefacts have in broadening research under-
standing, and how is relevant knowledge generated from such products? Objects,
prototypes, and artefacts create a context for forming new understanding [46]. By
analysing an artefact design, research can discover (recover and invent) requirements
to create technological propositions related to domain-specific concerns [82]. This is
because an artefact collects designers judgements about specific design spaces [33],
for instance how to solve interaction problems, and what aspects are of priority to
users at different points in an activity. However, this means we cannot recover the
needs of design by direct questioning the users alone. A broader research picture
is needed, one that integrates action with tools, users, and reflection on devices.
So, to develop an understanding for future design interventions, research should
gather diverse data to understand the existing practice and perceived professional
constituencies.1
Section 6.2 sets out the problem of space in more detail, highlighting important
contributions to the design of VR sound and music interaction systems. Section 6.2
also describes the suitability of typologies to spatial analysis for this research. Fol-
lowing on from this, Sect. 6.3.1 outlines the approach taken to the design review
and typology, indicating how relevant work was identified, selected, and coded.
Section 6.3 sets out a typology of interactive audio systems in VR, and presents case
studies of spatial design in the field. Section 6.5 looks across analyses and offers ways
to understand the design space of VR for SMC. Based on findings and reflections,
Sect. 6.6 proposes actionable design outcomes for further research, then Sect. 6.7
draws the work to a close.
1Prototypes are any representation of a design idea, regardless of medium, and an artefact is a
product or interactive system created for a design intervention/experiment [46].
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 183
6.2 Background
6.2.1 Terminology
This chapter analyses the spatial design of interactive audio systems (IAS) in VR.
IAS refers to any sound and music computing system that involves human interac-
tion that can modify the state of the sound or music system, however, we do not
review information-only auditory displays or audio-rendering technologies. While
both auditory displays and rendering technologies do include interactivity in their
operation, this chapter is interested in the use of interactive sound as the primary
function in the VR application, rather than when sound is used as an information
medium or renderer of spatial sounds without interactive feedback beyond head rota-
tion. No doubt there are significant overlaps in theory and application, that would
be valuable to explore, but trying to address all aspects in one chapter requires a
different focus.
The following research areas pertain to spatial interaction with user interfaces
(UI)s:
•Spatial user interface (SUI): Human-computer interaction (HCI) with 3D or 2D
UI that is operated through spatial interaction, graphically or otherwise [59].
•Three-dimensional user interface (3DUI): A UI that involves 3D interaction [16].
•Distributed user interface (DUI): UIs that are distributed across devices, users, or
spatial access points [89].
There are also many terms to describe virtual spaces used for sound and music;
in particular, this research is concerned with immersive VR technology, following
the definition provided in [6]:
•Virtual—to be a virtual reality, the reality must be simulated (e.g. computer-
generated).
•Immersive—to be a virtual reality, the reality must give its users the sensation of
being surrounded by a world.
•Interactive—to be a virtual reality, the reality must allow its users to affect the
reality in some meaningful way.
The term VR can refer to the hardware systems for delivering immersive experiences
and to refer to the immersive experiences themselves. Hardware systems can include
commercial head-mounted display (HMD) technology, such as Oculus or HTC Vive,
through to complex stereographic projection-based Cave Automatic Virtual Environ-
ment (CAVEs) [12]. The key thing is that in these immersive environments the visual
system and interaction capacities are mediated through technological means. In the
case of social virtual reality (SVR), described in Chap. 8of this volume, commu-
nication layers (speech, posture, and gesture) may or may not be mediated through
technological means, for instance co-located users may share a virtual world via
184 T. Deacon and M. Barthet
HMD but speech communication is unmediated. Or remote SVR users’ communica-
tion could be completely mediated by avatar representations and voice over internet
protocol (VoIP) technology.
6.2.2 Standing on the Shoulders of Giants, but Which Ones?!
SMC and SIVE are linked to the larger research field of HCI, so it is common practice
to adopt HCI research findings on how best to design systems. Below, Sect. 6.2.2.1
describes two examples of how interaction methods are used in the design of VR
for IAS. But as research in VR for SMC has developed, researchers have needed to
define and collect design principles specific to sound and music in VR, this work is
reviewed in Sect. 6.2.2.2.
6.2.2.1 Adapting Existing VR HCI Frameworks to Audio System
Design
To establish a dialogue around spatial considerations, there is a need to adopt find-
ings from other VR HCI disciplines. But as with the adoption of HCI evaluation
frameworks within new interfaces for musical expression (NIME) [78,91,98], crit-
ical understanding of the target domain (SMC) needs to be established [70,81]. For
instance, making expressive systems for musical creation or sonic experiences has
different design requirements than usability engineering [42], or demonstrations of
interaction techniques [8]. This is not to say that usability engineering is not impor-
tant, but rather the goal of design and evaluation needs to expand to include sonic
aesthetic qualities for audio-first spatial scenarios.
Selection and Manipulation Techniques
Object selection and manipulation is fundamental to VR environments where users
perform spatial tasks [52]. At a basic level, there are two main categories that describe
3D interaction for VR: Direct and indirect interaction techniques [5]. Object manip-
Fig. 6.1 Selection and manipulation mechanics in VR
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 185
ulation examples of direct and indirect techniques can be seen in Fig. 6.1. Direct
interaction refers to having ‘virtual hands’; similar to touching and grabbing objects
in the real world. A benefit of direct interaction is that control maps virtual tasks iden-
tically with real tasks, resulting in more natural interaction [5]. Indirect interaction
refers to virtual pointing; like using a laser pointer (ray-casting) that can pickup and
drop objects in space. Indirect interaction lets users select objects beyond their area
of reach and require relatively less physical movement. Overcoming the physical
constraints of the real world provides substantial benefits for the design of virtual
spaces, as the arrangement of elements can expand beyond body-scaled interaction.
Across both direct and indirect mechanics, interaction should be rapid, accurate,
error proof, easy to understand and control, and aim for low levels of fatigue [5].
Depending on how they are designed, both direct and indirect interactions enable
spatial transformations of objects, including rotation, scaling, and translation.
In adapting this research to sound and music interfaces, we must ask how tech-
niques impact musical processes and practices. For example, [13] describes the trade-
offs designers make when picking different control systems for virtual reality music
instrument (VRMIs). Work that has received less attention in SMC includes how
to design for some of the unique properties of VR media. The affordances of VR
expand into non-real interaction, so there is a fuzzy middle ground between direct
and indirect interaction. For instance, the Go-Go technique enlarges a user’s limbs
to be able to ‘touch’ distal objects [74]. In broader VR research, techniques like
the Go-Go are described under the term homuncular flexibility [93]; the ability to
augment proprioceptive perception of action capacity in VR, adapting interaction to
include novel bodies that have extra appendages or appendages capable of atypical
movements. An example of this type of research into IAS can be found in [27],
where magical indirect interaction was implemented to have audio control objects
float towards the user based on pinch actions (via Leap Motion sensor attached to
the HMD).
User Interface Elements
Reviewing 3DUI for immersive music production interfaces, [11] proposes three
categories of representation for sound processes and parameters: Virtual sensors like
buttons and sliders, dynamic/reactive widgets, spatial structures; Fig. 6.2 provides
examples. These different representation categories provide a set of design templates
for audio production SUIs. For instance, fine-grained individual parameter control
may be better suited to sensor devices with precise control relationships. Whereas, if
spatio-visual feedback is required about an audio process being applied, a dynamic
widget is a suitable device to explore. Spatial structures can be used to represent
sequencers and relationships between parameters; as Sect. 6.4 indicates later, several
VR audio systems use these to represent either modular synthesis units or whole
musical sequencers.
186 T. Deacon and M. Barthet
6.2.2.2 Audio-Specific Design Frameworks
Design for IASs in VR is a developing field, surfacing the potential for new forms
of sound and music experience [20]. But the opportunities and constraints of VR
require critical analysis. For instance, embodied interfaces may offer benefits in
productivity and creative expression [62], but we still do not know if the same effects
are gathered by embodied interfaces in VR. Alongside this gap, there are gaps in
design understanding, with only a few design frameworks addressing how to create
VR interfaces and interactions for sound and music [6,11,77]. Across these works, a
deep level of design analysis around the fundamentals of perception, technology, and
action is prevalent. But, in terms of design knowledgeto aid designers conceptualising
space, and the construction of audio interactions and experiences in it, information is
limited. Below is a review of the spatial aspects implicated in the design guidelines
of existing VR music system research.
Reviewing VRMI case studies, Serafin et al. outline nine principles to guide
design, focusing on immersive visualisation from performers’ viewpoint [77]. Design
principles support design focus on levels of abstraction, immersion, and imagination.
Their review of works features many examples of hybrid virtual-physical systems
and also highlights that VRMI are well suited to multi-process instruments given SUI
affordances. Regarding system design their principles offer robust advice for musical
performance but there is a lack of detail on how to go about designing different types
of spaces and interactions. For instance, within the principles, an emphasis is put on
making experiences social, but no guidance is provided on the design or evaluation
of social experience in VR. However, aspects of the case studies do draw attention
to spatial factors such as menu design can ‘cloud’ the performance space; in large
interfaces, the mixture of control device and interface design means arm movements
Fig. 6.2 Types of spatial UI for sound processes. Images from Leap motion VR UI design sprint,
reproduced with permission from owner, Ultraleap limited
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 187
and travel distances can be tiring; and the inclusion of physical control systems
supports natural, body-based interaction.
Addressing Artful Design for VR sound interaction, Atherton and Wang describe
a series of design lenses with subordinate principles using case study analysis [6].
Their work focuses on the idea of creating totally immersive sonic VR. A central
concept of their work is the difference between designing for doing as distinct from
being in VR: “doing is taking action with a purpose; intentionally acting to achieve
an intended outcome. In contrast, we define being as the manner in which we inhabit
the world around us” [6]. Expanding on [77]’s suggestion to exploit the ‘magical’
opportunities of VR, Atherton and Wang highlight that designers should experiment
with virtual physics,scale and user perspective, and time, however, these seem to
be general principles for VR interaction rather than sound-specific opportunities.
Within their discussions spatial concepts emerge, for instance, designers can phase
levels interactivity to create different spaces for action in a scene. An actionable
design idea relating to this is to guide gaze attention throughout a space related to
narrative elements; want people to stop doing and slow down, just put something in
the sky above them, as it is not an ideal place to work or interact. Atherton and Wang
highlight that designers need to determine different languages of interaction. Design
concepts should move beyond functional language towards things that map well
to sonic expressions, e.g. instead of physical descriptors like speed of movement
and gravity on an object, an interaction language would be intensity and weight
and weightlessness. For Atherton and Wang, play, and particularly social play, is
a synthesis of doing and being, as it is both an activity and a state. Designers can
support play by:
1. the lowering users’ inhibitions and encouraging them to play;
2. engaging users in diverse movement;
3. allowing users to be silly;
4. making opportunities for discovery in virtual space.
Related to play and interaction, on the social level, designers should provide sub-
spaces within larger worlds and engineer collective interaction scenarios.
6.2.3 Typologies and Spatial Analysis
A typology is a classification of individual units within a set of categories that are
useful for a particular purpose. Typologies support the evaluation of a number of
different indicators in an integrated manner, based on the identification of relevant
links or themes. Within architecture, design typologies are a common method of
spatio-visual analysis [24,72]. The teaching of architectural systems uses an ordered
set of types to define areas of interlocking design [22], for instance, in Fig. 6.3 the
concept of form is described using a series of types and representative examples.
But typologies can also represent ‘spatial qualities’ regarding interaction, see
Fig. 6.4 where different creative spaces (meeting rooms, maker spaces) can possess
188 T. Deacon and M. Barthet
positive and negative attributes for certain activities (socially inviting or separating,
playful or serious) [84]. It is this interpretive layer within a set of similar objects that
makes typologies a valuable analysis method. We can step out from just the formal
representation of space and shape and ask, how does this form or behaviour impact
human needs and experience.
Compared to a systematic literature review, a design typology includes references
to artefacts regardless of whether it has received formal user evaluation or received
previous research analysis. The reasoning is that much of the work happening in
the VR music field is happening outside academia, so rather than reflecting design
parameters only within previous academic dialogues, design understanding should
also be based on practice.
Compared to a taxonomy, typology is preferred for this work, as the separation of
types is non-hierarchical and potentially multi-faceted. Classification is done accord-
ing to structural features, common characteristics, or other forms of patterns across
instances. Within a typology, there is no implicit or explicit hierarchy connecting
different research artefacts and products in VR. Also depending on the granularity
of the type suggested, a single artefact may exist within two types simultaneously.
Using typologies, themes of significance can be traced across systems, these patterns
may describe best practices, observe patterns in interaction, explain good designs,
or capture experience or insight so that other people can reuse these solutions.
6.3 Design Analysis
6.3.1 Methodology
As a formal process the typology was built upon identification, selection, and coding
of audio-visual virtual spaces.
Identification: Literature gathering was achieved by parsing VR examples from
the Musical XR literature dataset. Practice and product examples were gathered
across the first author’s thesis research period using search engines, internet forums,
interviews, and social media [25].2
Fig. 6.3 Example of a spatial typology of form within architecture, adapted from [22]
2https://github.com/lucaturchet/Musical_XR_publication_database.
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 189
Fig. 6.4 Example of a spatial typology within design, taken from [84]. Reprinted from design
studies, 56, Thoring et al., creative environments for design education and practice: A typology of
creative spaces, 54–83, Copyright (2018), with permission from Elsevier and Katja Thoring
Selection: Findings were assessed for relevance to the analysis. Cases were
included on the basis of the following criteria; (1) Is the system based on immersive
VR technology via an HMD? (2) Is the primary function or design intention of the
artefact related to sound or music?
Coding: A form of deductive and inductive thematic coding was undertaken,
based upon thematic analysis [17]. An inductive approach involves allowing the
data to determine your themes, whereas a deductive approach involves coming to
the data with some preconceived themes you expect to find reflected there, based
on theory or existing knowledge. For this research, the deductive element was the
setting of top-level coding categories (UI, Space Use, Social Engagement, Skill
Level, Interactions) that probe how a VR IAS was constructed, the questions used
are available in Table 6.1. The inductive coding reflects themes within the deductive
categories based on the interface designs. Coding sources would involve: Use of the
VR system where possible; review online video sources; analysis of images; and
review of documentation and published literature. In each activity, notes and open-
coding were undertaken on system design using qualitative data analysis software.
190 T. Deacon and M. Barthet
Table 6.1 Coding system developed for typology. Bold codes indicate deductive code categories,
italics are inductive themes
Code Description
UI What are the types of UI exploited in the VR interface?
Screen-like 3D or 2D UI is used in VR that behaves like a standard screen
menu or workspace
3D Objects 3D UI is used for information and action
None No conventional UI or SUI is provided to users, such as an open
world terrain or an external musical controller
Physical No functional UI or SUI offered inside is VE, but external
hardware musical controller used
Space Use How is space used in this device?
Sonic The positions of people or objects in space has an impact on sound
processing, space as a functional element of the sound design
process
Visual Interactive visual feedback provided based on positions or
orientation in space of people or objects
Social Engagement What number of users was the application designed to support?
Solo Single user spaces with no intelligent-agent interaction
Collaborative Multi-user or single/multi user with intelligent-agent interactions
Collective Massive multiplayer environments, both human and agent-based
Skill Level Was the system designed for novices, experts or both?
Novice
Expert
NA No formal user study conducted
Interactions What is the flow of action and the related system response?
Sonic-Visual Coupling between sound and visual features, where sound changes
visual features
Visual-sonic Coupling between visual and auditory information, where the
visual information changessound properties
Sonic-sonic Audio input used to control system features that relate to sound
After this, the deductive sweep was undertaken where the sources, open-codings and
notes were reviewed in the context of each deductive category, and this resulted in
the inductive themes that can be found in Table 6.1.
6.3.2 Typology of Virtual Reality Interactive Audio Systems
Here a typology of VR IASs is proposed, delineating how different systems overall
function and the use of space in their design. The referencing of work in this section
differentiates between commercial products and academic publications, using two
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 191
different reference sections for clarity. The typology is split into two broad categories
within which VR products and research are discussed:
1. Type of Experience/Application—here we collate instances of products and
research by their function as a sound and music system in VR.
2. Role of Space—in this phase we look across the different types of systems to
suggest how the design of space can be categorised.
6.3.2.1 Type of Experience
Most implementations of interactive VR sound and music systems fall into one or
several of the categories in the subsequent list. Many cited products have no formal
user testing results available.
•Audio-Visual Performance Environment: Audience-oriented systems for play-
back or live performance of compositions with audio-visual interactions [14,51,
101,109]. For audience-oriented systems, interactivity is related to being part of
a social group of spectators, rather than being able to interact sonically.
•Augmented Virtuality (AV): A VR HMD acts as a visual output modality along-
side physical controllers or smart objects, creating a AV system [34,43,100].
This descriptor excludes augmented reality (AR) technologies, such as HoloLens,
as the visual overlay effect is considered different to the total re-representation of
visual stimuli that occur in VR [99].
•Collaborative: Some form of collaborative interaction occurs in the VR audio sys-
tem (human or agent-based). The interaction must be to directly make sound/music
together [12,25,51,63,103,110,119], rather than more presentational systems
like an audience cohabiting with performers in a virtual shared space; denoted
by the Audio-Visual Performance Environments category. Examples and design
considerations are described in Sect. 6.4.
•Conductor: Controlling audio-visual playback characteristics of pre-existing
composition [51,117].
•Control Surface: VR as a visual and interactive element to manipulate an existing
digital audio workstation (DAWs) functionality, e.g., Reaper [104].
•Generative Music System: Partial or total algorithmic music composition, where
the sound is experienced in VR space, and/or controlled by spatial interaction in
VR [57,116].
•Learning Interface: VR systems to support the learning of music, either as per-
formance tutoring, theory, or general concepts in music such as genre [48].
•Music Game: Systems where gameplay is oriented around the player’s interactions
with a musical score or individual songs. A good example is Beat Sabre [102], the
highest selling VR game of all time at the time of publication.
•Narrative and Soundscape: Pieces that integrate interactive audio in virtual real-
ity [85,116].
•Physics Interaction: Physics-based sonic interaction systems [27,106].
192 T. Deacon and M. Barthet
•Sandbox3: Designed like visual programming languages for digital sound
synthesis—such as Pure Data, Max/MSP, and VCVRack—these VR sandboxes
use patching together of modules to create sound. [112–114]
•Sequencer: Drum and music sequencers in VR. As sequencing is a common thing
in many musical applications, this category refers to interfaces that are either just
a sequencer or use sequencing somewhere within their interaction design [27,63,
103,110,112,119].
•Spatial Audio Controller: Mixer style control of spatial audio characteristics of
sources and effects [9,25,27,43,69,90,104].
•Sounding Object: Virtual object manipulation with parametric sound output [67,
68].
•Scientific Instrument: VR systems designed to test an audio or interaction
tool/feature, a good example is a VR-based binaural spatialisation evaluation sys-
tem [35,73].
•VR DAW: Virtual audio environment, multi-process 3D interfaces for creation and
manipulation of audio. Important feature is the recording of either audio or perfor-
mance data from real-time interaction. Interface abstraction and control metaphors
may differ significantly to conventional desktop DAWs [12,27,88,103,110,119].
•VRMI: Virtual modelling and representations of existing acoustic instruments or
synthesis methods [9,12,19,31,34,51,56,61,66,68,71,80,110,114,118].
Overlaps and Contrasts
Due to the broad design scopes of some systems, an artefact can appear in mul-
tiple categories, or exist in a space between two categories. For instance, [51]is
in Audio-Visual Performance Environments,Collaborative,Conductor, and VRMI.
While [12] is a technically a VR DAW, the audio and interaction design concept
is highly idiosyncratic, so it becomes closer to a VRMI. The following statements
intend to clarify any issues regarding overlaps in terminology.
•Sounding Objects vs. Physics Interaction: Both types refer to physics-based inter-
actions, sounding objects are when the mesh structures of objects are the source
of sound generation/control (e.g. scanned synthesis of an elastic mesh), whereas
physics interactions include collision-based interactions for sound generation or
use of physics systems to control single or multiple audio features (e.g. parameters
or spatialisation). The interested reader might refer to Chap. 2for more details on
these topics.
•VRMI vs. Sandbox: While both can refer to synthesis methods, sandboxes are
specifically modular construction environments, whereas synthesis methods in
VRMIs would be a closed form of synthesiser e.g. playing a DX7 emulator in
virtual reality.
3Category name and description sourced from [4].
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 193
6.3.2.2 Role of Space
Many of the systems outlined above offer novel interaction methods coupled with
3D visualisation. Looking at how space is used in VR music and audio systems
provides a different way to group research and design contributions. For simplicity,
the following categories are presented as discrete areas, but dimensions would also
be suitable (i.e. systems could belong to several categories, see [15,39] for examples
of dimension-based classification for digital musical instrument (DMI)).
Space as a holder of elements for musical input/sonic control The most domi-
nant form of spatial design is to use space as a container for interactive elements
that either produce sound or control sound in some way. Within this category,
key differences are whether menu-based SUI is used [103], or more object-based
3DUI is exploited [12]; this is discussed further in the next section. Other works
include: [19,27,31,56,61,63,66–69,71,88,100]. [104,109,110,112–114,
118]
Space as a medium of sonic experience In these sorts of systems, space is woven
into every aspect of user experience or system design. For instance, in [9], the sonic
operation of the VR system makes no sense if users do not engage in collaborative
spatial behaviours [9]. In this category, the relationship of spatial interaction to
system feedback can be predominantly passive, like a recorded soundscape [85],
or fully interactive, like an audio-visual arts piece that maps spatial input to output
modalities [90]. In some cases, visual space may only be a supporting medium for
a spatial sonic experience [85]. It is worth noting that spatial audio controllers
are not instantly considered as part of this category. As spatial audio controllers
deal with controlling and manipulating elements, they are considered to be part of
the Space as a holder of elements for musical input/sonic control category. Rather,
this category holds experiences where spatiality is more intrinsically involved in
the interaction between elements and user experience, whereas in a controller
system it is a functional relationship. Other works include: [43,57,80,106].
Space as a visual resource to enhance musical performance In this category,
space is primarily used for its visual and spatial representation opportunities
rather than as a direct control system or as an intrinsic part of the sonic expe-
rience derived from the system. Designers use space as an extra layer to a music
performance or system, for example, this can be to:
1. Present performers’ with enhanced visual feedback related to their Playing
of a musical instrument [34];
2. Provide a space for an audience to contribute to a collective experience of
musical performance [14]; or
3. Use space as a place for an audience to convene for a music performance in
VR [101,109].
194 T. Deacon and M. Barthet
6.4 Spatial Design Analysis Case Studies
The state of the art in VR audio production and immersive musical experiences
include single-user and collaborative approaches. In the following case studies,
the spatial and social design decisions are discussed; noting that each of the sys-
tems serves different purposes as musical experiences. Our motivation is to further
detail design typology categories, by understanding and comparing the decisions VR
designers make. Reviews are broken into four areas: single user systems,collabora-
tive systems,collective systems, and spatial audio production systems. The reason
we focus on these previous areas, only within immersive music and interactive sound
production, is so that design comparisons and implications can have some level of
shared context. We chose the field of immersive music as a point of shared interest
between academia and industry. But it would be valuable to probe design decisions
comparatively between broader fields of SIVE design, for instance, auditory display
and sound production systems; however, this would be a different contribution.
6.4.1 Single-User Systems
Figure 6.5 shows the music room [118], an instrument space containing multiple
VRMIs that are designed to be played with the VR controllers, following a DAW-
like workflow of perform and record, then arrange and edit. Instruments include a
drum-kit, laser harp, pedal steel guitar and a chord harp. The spatial setup mimics a
Fig. 6.5 Single-user VR spatial design considerations A—music room instrument space, with drum
kit instrument being used and the recording panel UI visible, displaying previously recorded data
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 195
Fig. 6.6 Single-user VR spatial design considerations B—sandboxes, node-edge structures and
modular systems
conventional studio. In Fig. 6.5 we can see spatial 2D graphical user interface (GUIs)
presenting recorded information and menu function, while 3DUI objects are used
to represent instruments, and a 360 photograph of a real studio provides the visual
backdrop. A design decision of the space was to situate all instruments in a circle
around the user, presumably to be able to play all the VRMIs in a small physical
space. Two areas are utilised for the UI, action space and display space. The action
space is for the VRMIs, and the display space, further away from the user, provides
a conventional GUI. To be able to interact with the distant GUI, laser pointers are
used.
Sound Stage [114] (Fig. 6.6a) and Mux [113] are modular instrument building
Sandboxes in VR. Users can define their own systems to perform music through
those systems. Both are multi-process VRMIs designed for room-scale interaction.
In these systems, a user surrounds themselves with modules and reactive widgets, and
‘patches’ them up using VR controllers. While stimulating and highly interactive,
the resulting virtual spaces can be complex and messy spatial arrangements (author’s
opinion); Fig. 6.6a shows an example of a sound system made with Mux, highlighting
the spatial-visual complexity. One possible reason arrangements become complex is
196 T. Deacon and M. Barthet
because spatial organisation is arbitrary and user-defined. A novel spatial feature is
that speaker scale controls source loudness, and this turns a slider or number UI into
a 3DUI interaction process.
The LyraVR [112] and Drops [106] are two examples of Sandbox systems
that build the temporal behaviour of the composition using spatial relationships.
Figure 6.6bandcshowLyraVR a musical ‘playground’ where users build music
sequences in space to create audio-visual compositions. The node-based sequencer
allows the creation of units in free space. Although aimed at single users, such inter-
action and playback method would be scalable to collaborative systems. Drops is a
VR ‘rhythm garden’, where a user creates musical patterns using the interaction of
objects and simulated gravity. The system requires setting up of object nodes (‘eggs’)
that releases ‘marbles’ that make a sound when they strike other surfaces—the size
and release frequency of marbles can be manipulated by the user. By adding more
surfaces and modifying planes of movement for marbles, the musical composition
is built using a ‘physical’ design process. In LyraVR,Mux, and Sound Stage,users
interact with sound elements via spatial node-edge structures, and this gains a level
of immediacy for musical changes at the cost of vision-spatial complexity. But the
embodied control of temporal musical behaviour via the arbitrary positioning 3DUI
does create an experimental creative process driven by interaction in space.
6.4.2 Collaborative Systems
Block Rocking Beats [103], LeMo [63], and Polyadic [25] are collaborative music
making (CMM) Sequencers . However, the systems have different approaches to
spatial design for collaborative interaction. Both LeMo and Polyadic are the only
collaborative systems in this review that have undergone formal user studies [25,63,
64].
Block Rocking Beats,Fig.6.7a and b, enables avatar-based (head and hands only)
remote collaborative music production in a virtual sound studio for up to three peo-
ple. The space is modelled like a futuristic studio, adapting a conventional layout
of production equipment areas and multiple screens. The environment provides a
sequencer interface for each user while project information is displayed on a single
large screen within the environment, and this provides some level of shared visual
information. Additionally, reactive systems alter environment appearance in sync
with music created. As a spatial layout, users’ positions are fixed in the space, a few
meters from each other in a semi-circle facing the front screen. The layout limits the
capacity to view each other’s workspaces and may inhibit forms of mutual monitor-
ing. Regarding avatar design, the character’s design is highly stylised, and the ‘hand’
representation is designed like a tapered wand. The taper is designed to enlarge the
usable sequencer area, as when buttons are designed at a normal scale the size of the
controller would hit multiple buttons.
The LeMo allows two co-located users avatar-based CMM in VR, using a variety
of sequencer instruments [63–65]. Depending on experimental condition, different
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 197
Fig. 6.7 Collaborative VR music production interfaces
198 T. Deacon and M. Barthet
spatial features would be activated, such as private workspace areas and spatially
reactive loudness. Studies of LeMo evaluated visual and sonic workspace design,
based on the concept of public and private territory, developing design implications
for SVR; for detailed findings please consult Chap. 8of this volume. Barring the
experimental findings, as a spatial design, compared to Block Rocking Beats and
Polyadic,LeMo allows users to move and rotate their workspaces to accommodate
social interaction around the task of music making, commonly using face-to-face or
side-by-side arrangements (see Fig. 6.7e). A novel design feature of note is that SUI
sequencers can be minimised into ‘bubbles’ to rearrange space. As these sounds are
spatially located, the bubble acts as both a UI and an audio object. Additionally, the
inclusion of 3D drawing as a communication medium enables a variety of annotation
behaviours. Like Block Rocking Beats and Polyadic, avatar design was rudimentary
offering a head with gaze direction, however, the use of Leap Motion as the input
device enables more detailed hand representations. These were used for functional
input and social communication, e.g. waving and pointing.
Polyadic enables collaborative composition of drum loops to accompany backing
tracks for two co-located participants [25]. The system is designed to be instantiated
in two user interface media, VR and Desktop. The design motivation of Polyadic was
to compare VR and Desktop media concerning usability, creativity support, and col-
laboration. In order to create a fair comparison of media, constraints were imposed
on the design of both media types. This limited the design of features to only use
control methods that could work equally across both conditions, namely a standard
step sequencer with per step volume and timing control. In the VR condition, the
environment uses fixed placement of 3DUI sequencers made up of virtual sensor
buttons and sliders, see Fig. 6.7f. Low fidelity avatars were utilised to allow rudi-
mentary social cues. Avatars used a sphere head with ‘sunglasses’ to indicate gaze
direction and two smaller spheres to indicate hands, enabling simple spatial refer-
encing. Additionally, each user’s workspace and interface actions were replicated
within the other users’ environment, enabling referencing and looking at what the
other is doing.
EXA [110], Fig. 6.7d, is a collaborative Instrument Space where multiple users
can compose, record, and perform music using instruments of their own design. EXA
differs from the previous examples as users input musical sequence information in
real time using drum-like instruments, rather than pressing sequencer buttons. Once
sequences are made they can be edited using menus and button presses. Similar to
LeMo, EXA allows users to freely organise their workspace in line with collaborative
needs. Also, the custom design of VRMIs introduces idiosyncratic uses of space in
order to perform each VRMIs. Like others, EXA utilises simple head and hands
avatars.
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 199
6.4.3 Collective Systems
The following reviews are special cases, social VR platforms designed for musical
experiences, pictured in Fig. 6.8. As predominantly music visualisations in VR, there
is limited sonic interactivity for users. So the focus is on how these spaces act as
collective social experiences in VR. For broader discussion of music visualisation in
XR, see [92]. While not sound production platforms in themselves, the experience
of a collective engagement in VR, related to audio-visual performance, is an area
of immersive entertainment where new production tools and design experience are
required.
The WaveVR [101], Fig. 6.8a, is a cross-platform social VR experience, like going
to a ‘gig’ in VR. Artists can use it to make audio-visual experiences for audiences
across the world. As a virtual space, the shared focus of a stage is used for most
performances, but the virtual space is reconfigured for each ‘gig’; similar to different
theatre performances all taking place on the same stage. In one instance, music
toy spaces were designed for the audience to interact with musical compositions,
these took the form of objects that change the level of audio effects based on spatial
position or touch interaction. As the objects cannot all be controlled by one person,
this creates a collective ‘remix’ of the content [111]. For further reviews of some
individual ‘gigs’ in The WaveVR see [6].
Volta is an immersive experience creation and broadcasting system [108]. Perfor-
mances are rendered in VR using artists’ existing tools and workflows, such as
parameter mapping a DAW to drive visual feedback systems. In addition to the
VR performance, a mixed reality (MR) experience is also broadcast to streaming
platforms like Youtube and Twitch. Volta differs to The WaveVR in its production
method for the artist. In The WaveVR developing a performance environment can
take a development team months to build, and a significant cost. Volta cuts down
production time by integrating existing tools with spatial experience design templates
(e.g. particle systems), into a streamlined production process for real-time virtual
performance environments. 4
6.4.4 Spatial Audio Production Systems
In the following review of spatial audio production systems in VR, all systems use
binaural spatial sound presented over headphones (Chaps. 3and 4provide an effec-
tive introduction to such audio technology). It is possible for some of the systems
(DearVR Spatial Connect,ObjectsVR) to be used with speaker arrays but the design
implications of this are not considered in this review.
Addressing spatial audio production, both the Invoke [25]&DearVR Spatial
Connect [104] systems allow users to record motion in VR to control sound objects.
4The first author supported the design of early prototypes of Volta XR, interested readers can review
the design development at https://thefuturehappened.org/Volta.
200 T. Deacon and M. Barthet
Fig. 6.8 Collective music experience spaces in VR
The main functional difference between the systems is that DearVR Spatial Connect
uses a DAW to host the audio session with the VR system acting as a control layer
for spatial and FX automation, while Invoke is a self-contained collaborative spatial
audio mixing system. The systems also differ in their design approach to space and
sonic interaction.
Figure 6.9ashowsInvoke, a collaborative system that focuses on expressive spatial
audio production using voice as an input method. The system utilises a mixture of
direct and indirect spatial interaction to record spatial-sonic relationships. A Vo i c e
Drawing feature allows for the specification of spatio-temporal sonic behaviour in a
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 201
Fig. 6.9 VR spatial audio production systems
202 T. Deacon and M. Barthet
continuous multimodal interaction. Voice input is recorded as loudness automation,
while a drawn trajectory controls the location of the spatialised audio over time. Using
an automated process the trajectory is segmented in a bézier curve with multiple
control points for further collaborative manipulation. The UI design uses a mixture
of 3DUI (audio objects, trajectories) and semi-transparent ‘screens-in-space’ (hand
menus, world-space menus). Spatially, users can navigate the virtual space using
teleport functionality, all menus travel with the user when they teleport. Invoke is
the only system in this review to implement more detailed avatar design, each user
is represented by a body, head and arms, utilising additional sensors on each user to
provide accurate body-to-avatar positioning. This enabled detailed forms of social
interaction and spatial awareness [25].
Figure 6.9bshowsDearVR Spatial Connect, a professional spatial audio produc-
tion application. The system uses indirect interaction method to control objects in
space; a laser pointer controls position while the VR controller thumb-stick controls
distance from the centre. The design of the surrounding space adds no features beyond
the interface panels and 3DUI (e.g. sound sources), as users commonly project a 360
video into the production space. Also, the user is ‘pinned’ to the centre of the space,
again in line with the rendering perspective of spatial audio for 360 video. One issue
of the central design is a lack of perspective on multiple objects that may be dis-
tant from the centre. Also, fatigue and motion noise (distant object ‘wobble’ more
spatially) impact control of objects at a distance (dependent on input device design
and user-based ergonomic factors like strength and motor control) [5]. Comparing
this to Invoke, which does not constrain users to the central listening position when
mixing audio objects, users can freely teleport around to gain different sonic and
visual/interaction perspectives. This is important as the spatio-temporal mixing of
sound creates a complex field of trajectories and sound objects [25].5
ObjectsVR is a system for expressive interaction with spatial sound objects. The
system provides spatio-temporal interaction with electronic music using 3D geo-
metric shapes and a series of novel interaction mappings, examples can be seen in
Fig. 6.10. User hand control is provided via a Leap Motion, and the experience is
rendered using a HMD. As a spatial audio control system, object positions were a
mixture of direct manipulation and ‘magical’ physics-based interaction. Users could
pick up and throw sounds around the space, but an orbiting mechanic meant that
sound objects would always move back within grabbing distance. A novel spatial
feature of this environment was the use of contextual UI when users grabbed certain
objects. When a user grabbed objects that had 3D mappings, a 3D grid of points
would appear to provide relative positioning guidance. When released the grid fades
away. System design and evaluation investigate users’ natural exploration and probes
the formation of understanding needed to interact creatively in VR, full details of the
evaluation can be found in [27]. 6
5The first author participated in formal beta testing of the DearVR Spatial Connect product.
6ObjectsVR was a single-user system designed and tested by the first author during a research
internship at a VR experience design firm.
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 203
Fig. 6.10 ObjectsVR interface user interaction examples
6.5 Discussion and Implications
6.5.1 Spatial Design Considerations
Consolidating the reviews of products and research, a series of design parameters
emerge.
Complexity of spatial representation
Based on the analysis of Sandbox systems (Mux and Sound Stage), it is suggested
that an unrestricted patching metaphor may be too visually complex for applications
like collaborative audio production in VR. Also, systems that build the timing of
compositions in space, LyraVR &Drops, suffer from spatial-visual complexity issues.
Similar to visual programming languages [36], when all points of state-change are
presented in one space (a low level of abstraction), the information becomes diffuse,
and errors may become more frequent. Also, when space is used for functional
relationships, like musical time, visual design cannot bracket the visual complexity
without the design of abstractions. Related to these issues, the impacts of these design
features is unknown for collaborative systems. Future research could design systems
to observe spatial organisation patterns undertaken by users to make sense of, and
work with arrangements.
Screens-in-space and workspace zones
For certain information (selection menus, settings, note sequences), systems use
either conventional 2D information presentation in a floating screen (Music Room,
Block Rocking Beats, EXA, DearVR Spatial Connect), or attempt to redesign infor-
mation using forms of 3DUI (Lyra, Mux). Also, as described in the Music Room
analysis, space can be delineated into different action or information presentation
spaces. The decision to locate functionality in screens or more novel 3DUI is an
204 T. Deacon and M. Barthet
important one for collaborative systems, as each different method offers different
access points and levels of shared visual information for collaboration. For instance,
in LeMo, each SUI could be minimised into a bubble for easy arrangement and organ-
isation. A temptation of VR design could be to embody all interaction in ‘physical’
3DUI, such as novel interaction widgets or spatially multiplexed 3DUI (see Fig. 6.2).
But this could result in added spatio-visual complexity like in Sandbox systems, to
deal with this there would be a need for contextual interaction layers (e.g. when I put
a cube here its different from when I put it there), or function navigation using but-
ton combinations on controllers (VR 3D modelling software do this [107]). Another
impact of using entirely 3DUI is that it could limit the amount of shared visual infor-
mation, as arrangements of ‘physical’ objects naturally obscure each other. However,
3DUI may provide more access points to embodied collaboration.
Level of acoustic spatial freedom
Related to spatial audio the ability to move from the centre position is a key design
decision that needs to be made, especially for collaborative audio production soft-
ware. For single-user apps, being able to manipulate arrangements, away from the
sweet spot is of value. For collaborative apps, multiple users located at the sweet
spot would severely impact normal social interaction.
Workspace organisation
For workspace organisation, it should be considered whether fixed or movable UI
is preferred for certain audio production tasks. For instance LeMo,EXA, and Invoke
each utilised methods for users to reorganise the SUI, while artefacts like Block
Rocking Beats and Polyadic did not.
Control, Play and Expression
Designers should consider how playful they make spatial audio experiences, or
whether specific control and sound automation is the design target. For instance,
in the ObjectsVR system spatial audio objects had ‘magical’ interaction, contrast-
ing this, DearVR Spatial Connect emulates DAW automation. What is missing here
is more examples of user experience in mixed systems, and environments to play-
fully explore spatial sound interactions with levels of direct control and serendip-
ity. Related to making experience of control more expressive, integrating different
modalities provides opportunities to expand on the DAW control paradigm, such as
in Invoke.
Egocentric spatial design
Related to the previous two features, some systems (e.g. Mux, Music Room) tend
towards egocentric spatial patterns, with devices and elements situated around the
user, oriented to one spatial viewpoint. While making sense for an individual applica-
tion, these forms of design decisions need to be carefully considered in collaborative
systems.
Avatar Design
An issue of importance to collaborative systems is avatar design and the spatial
behaviours that they enable. For instance, inside LeMo, the use of the Leap Motion
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 205
compared to standard VR controllers enabled more detailed forms of hand gesturing.
Within HCI work has already begun to evaluate avatars based on the constraints of
commercial VR [53]. What this area should focus on is moving beyond the so-called
Minimalist Immersion in VR using only simplistic avatar design. Within Invoke,the
avatar design utilised a more detailed body representation, offering beneficial charac-
teristics for social space awareness, as users can interpret gaze and body orientations
along with hand gestures. This highlights an important area of further research for
collaborative and collective systems, where there should be detailed evaluations of
the avatar designs’ impact music production activities.7
6.5.2 Role of Space and Interaction
Comparing the separation of the Role of Space with previous research on the space
of interaction [75], similarities emerge. River and MacTavish analyse space, time
and information concepts within HCI across a series of paradigms [75]:
•Media Spaces [86]—media types
•Windows, icons, menus, pointer (WIMP) [47]—user space management
•Tangible user interface (TUI) [44]—space-body-thing interaction
•Reality-based interaction (RBI) [49]—emerging embodied interaction styles
•Information spaces [10]—interaction trajectories and navigation of information
•Proxemic interactions [37]—social spatial relationships
The key spatial dimensions that emerge are:
Dimension 1 Media and Space Management ↔Meaning through interaction
Dimension 2 Personal and physical ↔Social and behavioural
Dimension 1 describes the difference between conventional GUI design (e.g.
WIMP) versus approaches using space and the embodiment of technology (e.g. RBI).
Dimension 1 relates to the previous analysis on the Role of Space (Sect. 6.3.2.2):
•Space as a holder of elements for musical input/sonic control
•Space as a medium of sonic experience
•Space as a visual resource to enhance musical performance
Dimension 2 highlights how space influences personal and social interactions. This
is because information is distributed across technologies and is also embedded into
contextual spaces, from immediate personal space through to social groups and larger
collective social interaction spaces. Looking at these ideas together, a framework of
research emerges for VR IAS spatial design. The functional uses of space in VR IAS
relates to traditional understanding in the design of media types, user space man-
agement, and TUI. While space as a medium of sonic experience can benefit from
7Preprint available at https://hal.archives-ouvertes.fr/hal-03099274.
206 T. Deacon and M. Barthet
Fig. 6.11 Spatial experience design in VR IAS Venn diagram
research in the areas of RBI, and information spaces. Finally, proxemic interaction
can inform things like social spaces for musical enhancement. But this doesn’t go far
enough. What needs to be included in space for interactive audio is an understanding
of architectural space. This is because VR designers must make important decisions
regarding space as an element of user experience. Regarding social aspects, as high-
lighted earlier in Fig. 6.4 [84], we can design space for functions, activities and for
their spatial quality. We must design spaces for intimate individual action, shareable
group interaction, and visibility and safety in large collective action spaces. Acousti-
cally the sorts of choices we make here matter too. For example, using simple voice
chat algorithms could make voice intelligibility poor and yield something similar
to ‘zoom fatigue’ [7]. Instead, we can utilise spatially aware audio communications
to deliver intelligible audio for each user in an area of space [60], a commercial
approach to this already exists that can handle hundreds of listener-sources across a
space [115].
We suggest that spaces need elevated priority in our VR design and evaluation
practices. To support this process, we suggest three top-level spatial categories that
need to be addressed through interdisciplinary design work: spaces/places, inter-
faces, interactions. Visualised in Fig. 6.11, some of the elements discussed in this
chapter are positioned within the different design spaces; for instance, VR selection
and manipulation techniques sit between interfaces and interactions. For brevity,
only the category of spaces/places is discussed in detail below, as previous research
within interfaces and interactions is already well documented in this chapter and other
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 207
research [6,12,77]. The categories scaffold future design by drawing together top-
ics, theories, and previous art. Addressing elements that overlap with spaces/places
in Fig. 6.11, we can use the Venn structure to ask new questions about the interaction
of spaces in feature design. For instance, context-aware on body UI refers to the idea
that if we have more specific spaces for interaction we can also tune the needs of UI
to be relevant to that moment in space and time. The notion of putting it on our body,
like a virtual smart watch, means that this design element is part of both interfaces,
interactions, and spaces/places. Implicit in such simple categories is the equalising
of spaces as a design concern alongside more thoroughly investigated work like spa-
tial interfaces and spatial interaction. Fully describing such a framework is out-with
the capacity of this chapter; instead, it is offered as a proposition for the research
field to further explore together.
6.5.2.1 Spaces/Places
Spaces are the architectural layouts and areas that form features of a virtual envi-
ronment used for sound and music activities in VR. An example of a space can be
seen in Fig. 6.12. In that figure a central production area is enclosed in a grid/cage
structure, bounding it off from the wider spatial setting of floating ‘sand-dunes’ and
night sky. But what does it mean to design for experience within space, and how
does this related to an IAS? Borrowing from human geography and architecture [22,
87], some spatial concepts to consider are:
1. Boundaries;
2. Form and space;
3. Organisations and arrangements;
4. Circulation (i.e. movement through space);
5. Proportion and scale;
6. Principles and metaphors (e.g. Symmetry, Hierarchy, Rhythm).
Places are spaces with fixed or emergent social meaning [32]. We can aim to
design the spatial qualities of spaces, for instance, the typology of [84]inFig.6.4,
gives designers ways to conceptualise creative spaces. We can ask, what is the space
type (e.g. personal or collaborative), and what is the intended spatial quality (e.g.
knowledge processor or process enabler)? Then we can ask, within those bound-
aries, what are other spatial characteristics i.e. comfort, sound, sight, spaciousness,
movement, aliveness/animus?
As architecture, human geography, and interior design are such deep disciplines,
interdisciplinary work needs to be done here to produce a dialogue around the design
of space for sonic and musical expression. One area of mutual influence to consider is
the design of immersive installations that involve technology to alter user experience.
VR can learn from techniques and theories in this area [3], as well as be used to
prototype systems for physical installation.
208 T. Deacon and M. Barthet
Fig. 6.12 Example of a VR IAS space, invoke artefact’s spatial audio composition area
6.6 Research Directions and Opportunities
6.6.1 Embodied Motion Design
Echoing the design principles within Atherton and Wang’s work [6], motion, embod-
iment and play are important design spaces to explore. However, human motion and
spatial analysis is not a new discipline for computing and technology, with special
research groups such as the International Conference on Movement and Computing
(MOCO) and the ACM SIGGRAPH Conference on Motion, Interaction and Games
(MIG). Within these existing dialogues, the role of embodiment is a central topic
of design [83] (see Chap. 7for further details). What would differ in virtual spaces
is a form of synthesis, or symbiosis, between visual and proprioceptive embodi-
ments. The plural is intentional, as virtual environments may introduce the idea that
embodiment is not a fixed state, with avatars and motion feedback being augmented
by the virtual setting. A research problem in this area is determining appropriate
vocabularies for low-level and high-level motion so that systems of motion analysis
and mapping can be utilised in an informed way. But the difficulty in VR IAS is
systems will often need to utilise data from only the headset and controllers, where
many previous approaches have been developed using high-resolution motion cap-
ture data [29]. Also, motion design is not just a single person experience. Take for
instance dancing in a crowd. Research into virtual togetherness through joint embod-
ied action is a rich direction for collaborative and collective systems to explore [40].
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 209
6.6.2 Designing for Collaborative Sound and Music in
Virtual Reality
There is a paucity of design and evaluation frameworks addressing social experiences
in sound and music VR. While work is ongoing in this area. For instance, Men
and Bryan-Kinns’ chapter in this volume (Chap. 8), to address the gap in design
knowledge for VR, design perspectives from other embodied CMM and HCI research
provide valid considerations for the design of SVR. The following integration of
research from other fields intends to offer SMC actionable research directions to
support collaboration in VR.
Adapting Tangible User Interface Research
An area of potential influence on spatial design for social VR is to look at how
TUIs are designed to support spatial collaboration. For example, [65]’s research on
CMM in VR shows similar results to co-located CMM using TUIs [96], regarding the
design of public and private workspaces. When designing TUIs for co-located CMM,
spatial orientation and configuration are important design areas. The Hitmachine is
a tangible music-making tool for children, focused on creating and understanding
collective interaction experiences [38]. To understand interactions with devices like
the Hitmachine, there is a need to design social interactions and technology together.
For designers, this means specifying and evaluating how people distribute attention,
share attention,dialogue, and engage in collective action. To analyse designs in
context, spatial formations of peoples’ positions and orientations can be analysed
to understand different constructions of social play in CMM [38]. Observations of
social engagement around Hitmachine found that the configuration of space (people,
furniture, and music interfaces) altered the level of social interaction. Also, regarding
the design of space in VR, research findings from VR CMM resemble the results
from the Hitmachine analysis [64]: How spatial encounters are set up for music
interaction impact social interaction. So, to design collective interaction spaces, how
basic spatial partitions are implemented matters.
Another TUI design principle of relevance is to provide multiple access points to
a collaborative task [45,76]. This means devising multiple spatial ways for different
users to act on the same object, creating a form of DUI. Research suggests that the
more access points participants have to a collaborative task improves how equitable
participation is [76]. Increasing the tangibility is also said to improve participation.
This is because users can complement what each other are doing in spatial tasks,
using space as an organiser of the shared activity [76]. Adapting tangibility to VR
means designing the affordances of objects appropriately to allow collective spatial
interaction, while keeping in mind that we can move beyond some of the constraints
embedded in physical reality. A good example of this is in VR Sandboxes. In physical
reality, physics governs layout patterns of blocks whereas in VR elements can be
placed in any part of 3D space. This in turn impacts the design of modules how users
connect them [6]. But as mentioned previously, idiosyncratic design patterns within
Sandboxes may need additional support for collaboration, and this is where previous
TUI work could be integrated [97].
210 T. Deacon and M. Barthet
Collectively, these similarities suggest that as a form of spatial collaboration, VR
CMM can benefit from other non-VR research findings regarding spatial interaction
to design systems. But, directly importing collaborative design concepts from other
media should be done carefully, and thoroughly evaluated for any differences in
results across media (see [25] for a media comparison study focusing on this).
Designing for Embodiment in Collaboration
Embodied spatial input and avatar representation are key features of VR for support-
ing intimacy [54], awareness and coordination [41], and control [1]. Spatial media,
such as VR, has the capacity for visual and spatial abstraction of UI, something
needed for the complex requirements of expert music production [28]. The follow-
ing examples highlight some specific opportunities to support spatial collaboration.
Augmented Object Interaction The affordances of embodied interaction in SUI
offer possibilities to transform how joint action on complex digital objects can
occur [1,2,8,21,55].
Awareness Support Embodied control and spatial representation in VR can ame-
liorate mutual understanding issues in shared workspaces compared to other
media [79]; support informal awareness to co-ordinate actions given shared visual
information [30]; provide pointer mechanisms that support referencing of con-
tent and environmental objects [23,94,95]; allow for the recording of embodied
motion, as a form of embodied memory within an environment [58,63]; provide
novel mechanisms for the division of labour and workspace organisation [64].
Spatial Problems Space is a powerful organiser of human memory and can change
how we solve problems [18,50], and VR, compared to WIMP systems, is sug-
gested to alter problem-solving strategy in spatial tasks [50].
These considerations have in common an influence on the interaction space in col-
laboration. This suggests that the collaborative process in sound and music production
could be improved by designing support for augmented interaction and awareness.
For example, in a common studio environment, usually, a shared screen (or set of
screens), a keyboard and mouse, mixing desk with dedicated audio outboards, are the
tools in the hands of audio producers. In contrast, in an embodied VR interface, the
possible interaction space can centre around collaborative spaces where functionality
is engineered to support mutual access and modification, adapting levels of visibility
and position based on collaborative needs.
6.6.3 Spatial Audio Production for Immersive Entertainment
VR provides an ostensibly promising environment for spatial audio production, it
is an example of professional workflow that could benefit from further research
into interaction methods in VR. The spatial nature of the technology, and action
in it, could support problem issues encountered when making audio compositions
in space (e.g. transformation of spatial reference frames between self and audience)
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 211
[25,26]. Regarding the previous analysis, a highly significant research area would be
the management of complexity in the information design of spatial representation.
The impact of these improvements would be felt within fields such as immersive
entertainment, where spatial audio technologies allow the engineering of sound-
scapes that represent real or imagined sonic worlds, using the location of sounds
in space as a critical component of audience experience. In particular, there is an
under-explored research opportunity in VR to enable more collaborative practice
for spatial audio production. This addresses a need in professional audio production
communities that look to make content for immersive entertainment.8
6.7 Conclusion
Much of how we design VR is based on borrowed design principles. We import ideas
from other disciplines and hope they ‘fit’. But to capitalise on any opportunities for
enhanced expression and new forms of sonic entertainment presented by VR, we
must set out how we design, what that involves, and what that excludes. Given such
a broad focus embedded in the concept of space, the first goal of any schematic
representation of design types and guidelines is to find suitable descriptors to collect
the features relevant to domains of research. For researchers, this means setting out
the design rationale behind systems clearly, so that over time we can understand the
emerging practice and propose novel directions. This research offered the beginning
of this process for the design of IAS for VR, setting out the different functional types
both research and commercial interests pursue while reflecting on the way space is
implicated in their design. This provides a framework for spatial design, highlighting
a set of actionable areas for future design research. From our perspective, a key
missing piece is guidance about how to design spatial social experiences in VR
for engagement with sound and music. We need to define the transitions between
individual, collaborative and collective interaction when it comes to audio interaction.
A stepping stone in this gap is more research into avatar design for SIVE, as to
start assessing spatial transitions in social activity we need to understand virtual
embodiment as the vessel that affords basic social communication beyond speech.
Looking forward, we should begin to think about what it means to be an immersive
application designer that is audio-first. Realising that practice will need to integrate
concepts from acoustics, architecture, phenomenology, HCI and SMC, this calls us
to think about transdisciplinary pedagogical models to support development in the
field.
8Narrative and physical experiences that engage an audience member in a fictional world, for
instance immersive VR theatre production.
212 T. Deacon and M. Barthet
References
1. Aguerreche, L., Duval, T., Lécuyer, A.: Comparison of Three Interactive Techniques for
Collaborative Manipulation of Objects in Virtual Reality in CGI (Computer Graphics Inter-
national) (Singapore, 2010).
2. Aguerreche, L., Duval, T., Lécuyer, A.: Reconfigurable Tangible Devices for 3D Virtual
Object Manipulation by Single or Multiple Users. Proceedings of the 17th ACM Symposium
on Virtual Reality Software and Technology, 227–230 (2010).
3. Akpan, I., Marshall, P., Bird, J., Harrison, D.: Exploring the effects of space and place on
engagement with an interactive installation in ACM Conference on Human Factors in Com-
puting Systems (CHI) (ACM Press, New York, New York, USA, Apr. 2013), 2213.
4. Andersson, N., Erkut, C., Serafin, S.: Immersive Audio Programming in a Virtual Reality
Sandbox English. in Audio Engineering Society Conference:2019 AES International Confer-
ence on Immersive and Interactive Audio (Audio Engineering Society, Mar. 2019).
5. Argelaguet, F., Andujar, C.: A survey of 3D object selection techniques for virtual environ-
ments. Computers and Graphics (Pergamon) 37, 121–136 (2013).
6. Atherton, J., Wang, G.: Doing vs. Being: A philosophy of design for artful VR. Journal of
New Music Research 49, 35–59 (2020).
7. Bailenson, J.N.: Nonverbal Overload:ATheoretical Argument for the Causes of Zoom Fatigue.
Technology, Mind, and Behavior 2 (Feb. 23, 2021).
8. Baron, N.: CollaborativeConstraint : UI for Collaborative 3D Manipulation Operations in
IEEE Symposium on 3D User Interfaces (2016), 269–270.
9. Barrass, S., Barrass, T.: Musical creativity in collaborative virtual environments. Virtual Real-
ity 10, 149–157 (2006).
10. Benyon, D., Höök, K., Nigay, L.: Spaces of Interaction in Proceedings of the 2010 ACM-
BCS Visions of Computer Science Conference (BCS Learning & Development Ltd., Swindon,
GBR, Apr. 2010), 1–7.
11. Berthaut, F.: 3D interaction techniques for musical expression. Journal of New Music Research
49, 60–72 (2020).
12. Berthaut, F., Desainte-Catherine, M., Hachet, M.: DRILE: an immersive environment for
hierarchical live-looping in Proceedings of the International Conference on New Interfaces
for Musical Expression (NIME) (2010), 192–197.
13. Berthaut, F., Hachety, M., Desainte-Catherine, M.: Piivert: Percussion-based interaction for
immersive virtual environments. IEEE Symposium on 3D User Interfaces (3DUI), 15–18
(2010).
14. Berthaut, F., Martinez, D., Hachet, M.: Reflets : Combining and Revealing Spaces for Musical
Performances. Proceedings of the International Conference on New Interfaces for Musical
Expression, 116–120 (2015).
15. Birnbaum, D., Fiebrink, R., Malloch, J.,Wanderley, M. M.: Towards a dimension space for
musical devices in Proceedings of the International Conference on New Interfaces for Musical
Expression (NIME) (2005), 192–195.
16. Bowman, D. a. et al.: New Directions in 3D User Interfaces. International Journal 5, 3–14
(2006).
17. Braun, V., Clarke, V.: Using thematic analysis in psychology. Qualitative research in psychol-
ogy 3, 77–101 (2006).
18. Burgess,N.: Spatial memory: how egocentric and allocentric combine. Trends in Cognitive
Sciences 10, 551–557 (2006).
19. Cabral, M. et al.: Crosscale: A 3D virtual musical instrument interface in 2015 IEEE Sympo-
sium on 3D User Interfaces (3DUI) (Mar. 2015), 199–200.
20. Çamcı, A., Hamilton, R.: Audio-first VR: New perspectives on musical experiences in virtual
environments. Journal of New Music Research 49, 1–7 (2020).
21. Chénéchal, M. L., Lacoche, J.: When the Giant meets the Ant An Asymmetric Approach
for Collaborative and Concurrent Object Manipulation in a Multi-Scale Environment. IEEE
Symposium on 3D User Interfaces, 277–278 (2016).
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 213
22. Ching, F. D. K.: Architecture: Form, Space, & Order Fourth edition (Wiley, Hoboken, New
Jersey, 2015).
23. Cockburn, A., Quinn, P., Gutwin, C., Ramos, G., Looser, J.: Air pointing: Design and evalu-
ation of spatial target acquisition with and without visual feedback. International Journal of
Human-Computer Studies 69, 401–414 (2011).
24. Colquhoun, A.: Typology and Design Method. Perspecta, 71–74 (1969).
25. Deacon, T.: Shaping Sounds in Space: Exploring the Design of Collaborative Virtual Reality
Audio Production Tools PhD thesis (Queen Mary University of London, 2020).
26. Deacon, T., Bryan-Kinns, N., Healey, P. G., Barthet, M.: Shaping sounds: The role of gesture
in collaborative spatial music composition in Creativity and Cognition (ACM, San Diego,
2019), 121–132.
27. Deacon, T., Stockman, T., Barthet, M. in Bridging People and Sound: 12th International
Symposium, CMMR 2016, Sáo Paulo, Brazil, July 5–8, 2016, Revised Selected Papers (eds
Aramaki, M., Kronland-Martinet, R., Ystad, S.) vol 10525, 192–216 (Springer International
Publishing, Cham, 2017).
28. Duignan, M., Noble, J., Biddle, R.: Abstraction and Activity in Computer- Mediated Music
Production. Computer Music Journal 34, 22–33 (2010).
29. Durupinar, F., Kapadia, M., Deutsch, S., Neff, M., Badler, N. I.: PERFORM: Perceptual
Approach for Adding OCEAN Personality to Human Motion Using Laban Movement Anal-
ysis. ACM Transactions on Graphics 36, 6:1–6:16 (Oct. 2016).
30. Ens, B. et al.: Revisiting collaboration through mixed reality: The evolution of groupware.
Computer Supported Cooperative Work 131, 81–98 (2019).
31. Fillwalk, J.: ChromaChord : A Virtual Musical Instrument in 2015 IEEE Symposium on 3D
User Interfaces, 3DUI 2015 - Proceedings (2015), 201–202.
32. Gardair, C., Healey, P. G. T., Welton, M.: Performing places. Proceedings of the 8th ACM
conference on Creativity and cognition - C&C ’11, 51 (2011).
33. Gaver, W. W.: What Should We Expect From Research Through Design? In ACM Conference
on Human Factors in Computing Systems (CHI) (2012), 937–946.
34. Gelineck, S., Böttcher, N., Martinussen, L., Serafin, S.: Virtual Reality Instruments capable
of changing Dimensions in Real-time in Enactive (2005).
35. Geronazzo, M. et al.: The Impact of an Accurate Vertical Localization with HRTFs on Short
Explorations of ImmersiveVirtual Reality Scenarios in 2018 IEEE International Symposium
on Mixed and Augmented Reality (ISMAR) (Oct. 2018), 90–97.
36. Green, T. R. G., Petre, M.: Usability Analysis of Visual Programming Environments: A
’Cognitive Dimensions’ Framework. Journal of Visual Languages and Computing 7, 131–
174 (1996).
37. Greenberg, S., Marquardt, N., Ballendat, T., Diaz-Marino, R., Wang, M.: Proxemic Interac-
tions: The New Ubicomp? Interactions 18, 42–50 (Jan. 2011).
38. Grønbæk, J. E. et al.: Designing for Children’s Collective Music Making: How Spatial Ori-
entation and Configuration Matter in Nordic Conference on Human-Computer Interaction
(NordiCHI) (2016), 23–27.
39. Hattwick, I., Wanderley, M. M.: A Dimension Space for Evaluating Collaborative Musical
Performance Systems in Proceedings of the International Conference on New Interfaces for
Musical Expression (NIME) (2012), 429–432.
40. Himberg, T., Laroche, J., Bigé, R., Buchkowski, M., Bachrach, A.: Coordinated Interpersonal
Behaviour in Collective Dance Improvisation: The Aesthetics of Kinaesthetic Togetherness.
en. Behavioral Sciences 8, 23 (Feb. 2018).
41. Hindmarsh, J., Fraser, M., Heath, C., Benford, S., Greenhalgh, C.: Object focused interaction
in collaborative virtual environments. ACM Transactions on Computer-Human Interaction 7,
477–509 (2000).
42. Hix, D., Gabbard, J. L. in Handbook of Virtual Environments chap. 28 (2014).
43. Honigman, C.: The Third Room : A 3D Virtual Music Paradigm in Proceedings of the Inter-
national Conference onNewInterfaces for Musical Expression (NIME) (2011).
214 T. Deacon and M. Barthet
44. Hornecker, E., Buur, J.: Getting a grip on tangible interaction in ACM Conference on Human
Factors in Computing Systems (CHI) (2006), 437.
45. Hornecker, E., Marshall, P., Rogers, Y.: From Entry and Access - How Shareability Comes
About in Designing pleasurable products and interfaces (2007).
46. Houde, S., Hill, C.: What do prototypes prototype? Handbook of Human Computer Interac-
tion, 1–16 (1997).
47. Hutchings, D. R., Stasko, J.: Revisiting Display Space Management: Understanding Current
Practice to Inform next-Generation Design in Proceedings of Graphics Interface 2004 (Cana-
dian Human-Computer Communications Society, Waterloo, CAN, May 2004), 127–134.
48. Innocenti, E. D. et al.: Mobile Virtual Reality for Musical Genre Learning in Primary Educa-
tion. en. Computers & Education 139, 102–117 (Oct. 2019).
49. Jacob, R. et al.: Reality-based interaction: a framework for post-WIMP interfaces in ACM
Conference on Human Factors in Computing Systems (CHI) (2008), 201–210.
50. Jin, Y., Lee, S.: Designing in virtual reality: a comparison of problem-solving styles between
desktop and VR environments. Digital Creativity 6268 (2019).
51. Jung, B., Hwang, J., Lee, S., Kim, G. J., Kim, H.: Incorporating Co-Presence in Distributed-
Virtual Music Environment in Proceedings of theACMSymposium onVirtualReality Software
and Technology (Association for Computing Machinery, New York, NY, USA, Oct. 2000),
206–211.
52. Jung, J. et al.: A Review on Interaction Techniques in Virtual Environments. Proceedings of
the 2014 International Conference on Industrial Engineering and Operations Management,
1582–1590 (2014).
53. Kolesnichenko, A., McVeigh-Schultz, J., Isbister, K.: Understanding Emerging Design Prac-
tices for Avatar Systems in the Commercial Social VR Ecology in Proceedings of the 2019
on Designing Interactive Systems Conference (Association for Computing Machinery, New
York, NY, USA, June 2019), 241–252.
54. Kolkmeier, J., Vroon, J., Heylen, D.: Interacting with virtual agents in shared space: Single and
joint effects of gaze and proxemics. Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10011 LNAI,
1–14 (2016).
55. Lages, W.: Ray, Camera, Action ! A Technique for Collaborative 3D Manipulation. IEEE
Symposium on 3D User Interfaces, 277–278 (2016).
56. Lages, W., Nabiyouni, M., Tibau, J., Bowman, D. A.: Interval Player: Designing a virtual
musical instrument using in-air gestures in 2015 IEEE Symposium on 3D User Interfaces,
3DUI 2015 - Proceedings (2015), 203–204.
57. Le Groux, S., Manzolli, J., Verschure, P. F. J.: VR-RoBoser: Real-Time Adaptive Sonifica-
tion of Virtual Environments Based on Avatar Behavior in Proceedings of the International
Conference on New Interfaces for Musical Expression (NIME) (2007).
58. Lilija, K., Pohl, H., Hornbæk, K.: Manipulation Who Put That There? Temporal Naviga-
tion of Spatial Recordings by Direct Manipulation in CHI Conference on Human Factors in
Computing Systems (Association for Computing Machinery, 2020).
59. Lubos, P., Bruder, G., Ariza, O., Steinicke, F.:Touching the Sphere: Leveraging Joint-Centered
Kinespheres for Spatial User Interaction. Proceedings of the ACM Symposium on Spatial User
Interaction (SUI’16), 13–22 (2016).
60. Lugasi, M., Rafaely, B.: Speech Enhancement Using Masking for Binaural Reproduction of
Ambisonics Signals. IEEE/ACM Transactions on Audio, Speech, and Language Processing
28, 1767–1777 (2020).
61. Mäki-patola, T., Laitinen, J., Kanerva, A., Takala, T.: Experiments with virtual reality instru-
ments in Proceedings of the International Conference on New Interfaces for Musical Expres-
sion (NIME) (2005), 11–16.
62. Melchior, F., Pike, C., Brooks, M., Grace, S.: Sound Source Control in Spatial Audio Systems
in Audio Engineering Society Convention (Rome, Italy, 2013).
63. Men, L., Bryan-Kinns, N.: LeMo: Supporting Collaborative Music Making in Virtual Reality.
IEEE 4TH VR Workshop SIVE (2018).
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 215
64. Men, L., Bryan-Kinns, N.: LeMo: Exploring virtual space for collaborative creativity in Cre-
ativity and Cognition (ACM, San Diego, USA, June 2019), 71–82.
65. Men, L., Bryan-Kinns, N., Bryce, L.: Designing spaces to support collaborative creativity in
shared virtual environments. PeerJ Computer Science 5, e229 (2019).
66. Moore, A. G., Howell, M. J., Stiles, A. W., Herrera, N. S., McMahan, R. P.: Wedge: A musical
interface for building and playing composition-appropriate immersive environments in 2015
IEEE Symposium on 3D User Interfaces (3DUI) (Mar. 2015), 205–206.
67. Mulder, A., Fels, S. S., Mase, K.: Mapping virtual object manipulation to sound variation.
IPSJ Sig Notes 97, 63–68 (1997).
68. Mulder, A., Fels, S. S., Mase, K.: Design of Virtual 3D Instruments for Musical Interaction.
Graphics Interface, 76–83 (1999).
69. Naef, M., Collicott, D.: A VR Interface for Collaborative 3D Audio Performance in Pro-
ceedings of the International Conference on New Interfaces for Musical Expression (NIME)
(2006).
70. O’Modhrain, S.: A Framework for the Evaluation of Digital Musical Instruments. en. Com-
puter Music Journal 35, 28–42 (Mar. 2011).
71. Palumbo, M., Zonta, A.,Wakefield, G.: Modular reality: Analogues of patching in immersive
space. Journal of New Music Research 49, 8–23 (2020).
72. Plowright, P. D.: Revealing Architectural Design: Methods, Frameworks and Tools (Rout-
ledge, 2014).
73. Poirier-Quinot, D., Katz, B.: Assessing the Impact of Head-Related Transfer Function Indi-
vidualization on Task Performance: Case of a Virtual Reality Shooter Game. en. Journal of
the Audio Engineering Society 68, 248–260 (May 2020).
74. Poupyrev, I., Billinghurst, M., Weghorst, S., Ichikawa, T.: The Go-Go Interaction Technique:
Non-Linear Mapping for Direct Manipulation in VR. Proc. UIST ’96 (ACM Symposium on
User Interface Software and Technology), 79–80 (1996).
75. River, J., MacTavish, T.: Research through Provocation: A Structured Prototyping Tool Using
Interaction Attributes of Time, Space and Information. The Design Journal 20, S3996–S4008
(July 2017).
76. Rogers, Y., Lim, Y.-k., Hazlewood,W. R., Marshall, P.: Equal Opportunities: Do Shareable
Interfaces Promote More Group Participation Than Single User Displays? Human-Computer
Interaction 24, 79–116 (2009).
77. Serafin, S., Erkut, C.,Kojs, J., Nilsson,N. C.,Nordahl, R.: VirtualReality Musical Instruments:
State of the Art, Design Principles, and Future Directions. Computer Music Journal 40, 22–40
(2016).
78. El-shimy, D., Cooperstock, J. R.: User-Driven Techniques for the Design and Evaluation of
New Musical Interfaces. Computer Music journal 39, 28–46 (2015).
79. Smith, H. J., Neff, M.: Communication Behavior in Embodied Virtual Reality in ACM Con-
ference on Human Factors in Computing Systems (CHI) (2018).
80. Snook, K. et al.: Concordia: A musical XR instrument for playing the solar system. Journal
of New Music Research 49, 88–103 (2020).
81. Stowell, D., Robertson, A., Bryan-Kinns, N., Plumbley, M. D.: Evaluation of live human-
computer music-making: Quantitative and qualitative approaches. International Journal of
Human-Computer Studies 67, 960–975 (2009).
82. Suchman, L., Trigg, R., Blomberg, J.: Working artefacts: ethnomethods of the prototype. The
British Journal of Sociology 53, 163–179 (2002).
83. Svanæs, D.: Interaction Design for and with the Lived Body : Some Implications of Merleau-
Ponty ’ s Phenomenology. ACM Transactions on Computer-Human Interaction 20, 1–30
(2013).
84. Thoring, K., Desmet, P., Badke-Schaub, P.: Creative Environments for Design Education and
Practice: A Typology of Creative Spaces. en. Design Studies 56, 54–83 (May 2018).
85. Trommer, M.: Points Further North: An acoustemological cartography of non-place. Journal
of New Music Research 49, 73–87 (2020).
216 T. Deacon and M. Barthet
86. Trumbo, J.: The Spatial Environment in Multimedia Design: Physical, Conceptual, Perceptual,
and Behavioral Aspects of Design Space. Design Issues 13, 19–28 (1997).
87. Tuan, Y.-F.: Space and Place: The Perspective of Experience. 4, 513 (1978).
88. Valbom, L., Marcos, A.: Wave: Sound and music in an immersive environment. Computers
& Graphics 29, 871–881 (2005).
89. Vanderdonckt, J.: Distributed user interfaces: how to distribute user interface elements across
users, platforms, and environments. Proc. of XI Interacción, 20–32 (2010).
90. Wakefield, G., Smith,W.: Cosm : a Toolkit for Composing Immersive Audio- Visual Worlds
of Agency and Autonomy in Proceedings of the International Computer Music Conference
(ICMC) (2011).
91. Wanderley, M. M., Orio, N.: Evaluation of Input Devices for Musical Expression : Borrowing
Tools from HCI. Computer Music Journal 26, 62–76 (2002).
92. Weinel, J. in Technology, Design and the Arts-Opportunities and Challenges 209–227
(Springer, Cham, 2020).
93. Won, A. S., Bailenson, J. N., Lanier, J. in Emerging Trends in the Social and Behavioral
Sciences 1–16 (2015).
94. Wong, N., Gutwin, C.: Where are you pointing? Proceedings of the 28th international con-
ference on Human factors in computing systems - CHI ’10, 1029 (2010).
95. Wong, N., Gutwin, C.: Support for Deictic Pointing in CVEs : Still Fragmented after All
These years? in Computer Supported Cooperative Work (2014), 1377–1387.
96. Xambó, A., Laney, R., Dobbyn, C., Jordá, S. P.: Multi-touch interaction principles for collab-
orative real-time music activities: towards a pattern language. Proc. of ICMC’11, 403–406
(2011).
97. Xambó, A. et al.: Exploring Social Interaction With a Tangible Music Interface. Interacting
with Computers 28 (2016).
98. Young, G., Murphy, D.: HCI Models for Digital Musical Instruments: Methodologies for
Rigorous Testing of Digital Musical Instruments. International Symposium on Computer
Music Multidisciplinary Research (CMMR) (2015).
99. Zhou, F., Dun, H. B. L., Billinghurst, M.: Trends in augmented reality tracking, interaction and
display: A review of ten years of ISMAR. Proceedings - 7th IEEE International Symposium
on Mixed and Augmented Reality 2008, ISMAR 2008, 193–202 (2008).
100. Zielasko, D. et al.: Cirque des Bouteilles : The Art of Blowing on Bottles in 2015 IEEE
Symposium on 3D User Interfaces, 3DUI 2015 - Proceedings (2015), 209–210.
Products and Grey Literature
101. Arrigo, A., Lemke, A.: Wave http://wavexr.com/. Austin, TX, USA, 2016.
102. Beat Saber - VR Rhythm Game Beat Games. 2019.
103. Block Rocking Beats http://blockrockingbeats.com/. 2016.
104. DearVR Spatial Connect https://www.dearvr.com/products/dearvrspatial- connect. Düssel-
dorf, Germany, 2018.
105. Designing For Virtual Reality en. https://www.ustwo.com/blog/designing-for-virtual-
reality/. 2015.
106. Drops https://drops.garden/. 2018.
107. Gravity Sketch https://www.gravitysketch.com/. Aug. 2017.
108. Kane, A.: Volta https://volta- xr.com/. 2021.
109. Kane, A.: Volta https://www.voltaaudio.com. London, UK, 2019.
110. Kinstner, Z.: EXA: The Infinite Instrument https://store.steampowered.com/app/606920/
EXA_The_Infinite_Instrument/. Grand Rapids, Michigan, USA, 2017.
111. Lee, J., Strangeloop: The Lune Rouge Experience The WaveVR. 2017.
6 Spatial Design Considerations for Interactive Audio in Virtual Reality 217
112. LyraVR http://lyravr.com/. 2018.
113. Mux https://store.steampowered.com/app/673970/MuX/ http:// playmux.com/. 2017.
114. Olson, L., Havok, R., Ozil, G., Fish, R.: Soundstage VR https://github.com/googlearchive/
soundstagevr. 2017.
115. Spatial Audio API https://www.highfidelity.com/ . 2021.
116. The Garden https://www.biomecollective.com/the-garden. Dundee, UK, 2019.
117. The Last Maestro https://www.maestrogames.com/. 2021.
118. The Music Room http://www.musicroomvr.com/. 2016.
119. Tranzient https://www.aliveintech.com. 2019.
120. Virtual Reality Best Practices en-US. https://docs.unrealengine.com/en-US/
SharingAndReleasing/XRDevelopment/VR/DevelopVR/ContentSetup/ index.html.
121. VR Best Practice https://learn.unity.com/tutorial/vr-bestpractice. 2017.
122. VR Design : Best Practices en-US. http://blog.dsky.co/2015/07/30/vr-design-best-practices/.
July 2015.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 7
Embodied and Sonic Interactions
in Virtual Environments: Tactics
and Exemplars
Sophus Béneé Olsen, Emil Rosenlund Høeg, and Cumhur Erkut
Abstract As the next generation of active video games (AVG) and virtual real-
ity (VR) systems enter people’s lives, designers may wrongly aim for an experi-
ence decoupled from bodies. However, both AVG and VR clearly afford opportuni-
ties to bring experiences, technologies, and users’ physical and experiential bodies
together, and to study and teach these open-ended relationships of enaction and
meaning-making in the framework of embodied interaction. Without such a frame-
work, an aesthetic pleasure, lasting satisfaction, and enjoyment would be impossible
to achieve in designing sonic interactions in virtual environments (SIVE). In this
chapter, we introduce this framework and focus on design exemplars that come
from a soma design ideation workshop and balance rehabilitation. Within the field
of physiotherapy, developing new conceptual interventions, with a more patient-
centered approach, is still scarce but has huge potential for overcoming some of the
challenges facing health care. We indicate how the tactics such as making space,
subtle guidance, defamiliarization, and intimate correspondence have informed the
exemplars, both in the workshop and also in our ongoing physiotherapy case. Impli-
cations for these tactics and design strategies for our design, as well as for general
practitioners of SIVE are outlined.
S. B. Olsen ·E. R. Høeg ·C. Erkut (B
)
Multisensory Experience Lab, Aalborg University Copenhagen, Copenhagen, Denmark
e-mail: cer@create.aau.dk
S. B. Olsen
e-mail: sbol13@student.aau.dk
E. R. Høeg
e-mail: erh@create.aau.dk
© The Author(s) 2023
M. Geronazzo and S. Serafin (eds.), Sonic Interactions in Virtual Environments,
Human—Computer Interaction Series, https://doi.org/10.1007/978-3-031- 04021-4_7
219
220 S. B. Olsen et al.
7.1 Introduction
I felt that there was an opportunity to create a new design discipline, dedicated to creating
imaginative and attractive solutions in a virtual world, where one could design behaviors,
animations, and sounds as well as shapes. This would be the equivalent of industrial design
but in software rather than three-dimensional objects. Like industrial design, the discipline
would start from the needs and desires of the people who use a product or service, and
strive to create designs that would give aesthetic pleasure as well as lasting satisfaction and
enjoyment [17].
Thus spoke the IDEO founder Bill Moggridge in his book Designing Interac-
tions (2007), on inventing the term “interaction design”. The field Sonic Interaction
Design was initially concerned with the aesthetic pleasure, lasting satisfac