Conference PaperPDF Available

Tools for the production of spatial audio within BINCI

Authors:

Abstract and Figures

With the recent introduction of support for immersive audiovisual technologies by some of the major content sharing platforms, such as YouTube and Facebook, and the establishment of standards for the efficient encoding and transmission of spatial audio, interest in the composition and employment of these audio formats has extended beyond the academic and research fields to include artists, sound engineers and professional content creators. Amongst the different existing spatial audio formats, the binaural technology presents the advantage of being suited for the reproduction of immersive audio content across multiple platforms and devices, requiring only a pair of headphones. Nonetheless, support for spatial audio reproduction using loudspeaker arrays and interoperability between formats are still expected. With this in mind, the need for an integrated and flexible solution for the production of spatial audio becomes clear. The development of such a set of tools, and their incorporation into established audio production work- flows, are the main goals of the BINCI project. In this paper, an overview of the tools being developed is given. Furthermore, an HRTF selection process and the employment of room impulse responses recorded with spherical microphone arrays are highlighted as relevant techniques adopted in the project.
Content may be subject to copyright.
Tools for the production of spatial audio within BINCI
Andr´e Kruh-Elendt1, Andr´e Fiebig1, Roland Sottek1, and Julien De Muynke2
1HEAD acoustics GmbH, 52134 Herzogenrath, Germany,
Email: {andre.kruh-elendt, andre.f iebig, roland.sottek}@head-acoustics.de
2Eurecat, Centre Tecnol`ogic de Catalunya, Multimedia Technologies Group, 08005 Barcelona, Spain,
Email: julien.demuynke@eurecat.org
The research leading to these results has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 732130 – BINCI project
Introduction
With the recent introduction of support for immersive
audiovisual technologies by some of the major content
sharing platforms, such as YouTube and Facebook, and
the establishment of standards for the efficient encod-
ing and transmission of spatial audio [1], interest in the
composition and employment of these audio formats has
extended beyond the academic and research fields to in-
clude artists, sound engineers and professional content
creators. Amongst the different existing spatial audio
formats, the binaural technology presents the advantage
of being suited for the reproduction of immersive audio
content across multiple platforms and devices, requiring
only a pair of headphones. Nonetheless, support for spa-
tial audio reproduction using loudspeaker arrays and in-
teroperability between formats are still expected.
With this in mind, the need for an integrated and flexi-
ble solution for the production of spatial audio becomes
clear. The development of such a set of tools, and their
incorporation into established audio production work-
flows, are the main goals of the BINCI project. In this
paper, an overview of the tools being developed is given.
Furthermore, an HRTF selection process and the employ-
ment of room impulse responses recorded with spherical
microphone arrays are highlighted as relevant techniques
adopted in the project.
Spatial audio production tools
Within BINCI, two main software modules are being de-
veloped, each one targeting specific user-groups and tasks
in the production and delivery of spatial audio content.
A brief overview of these tools is given below, while a
more detailed description can be found in [2].
The Binaural Home Studio (BHS) is a suite of pro-
duction and post-production tools targeting sound engi-
neers and creators involved in the process of designing,
editing and mixing audio with Digital Audio Worksta-
tions (DAWs) in typical studio environments.
The BHS is composed of four main components: The
Audio Processing Server, an ambisonics-based process-
ing unit performing encoding, manipulation and render-
ing operations on the audio input; a set of DAW Plug-ins,
which provide a simple graphical interface to control the
processing performed in the server; the Virtual Sound
Card, serving as a multichannel audio interface between
the DAW and the audio processing server; and a Visual-
izer, which gives graphical feedback to the end-user about
the current spatial configuration of the sound scene and
can be used to synchronize the produced audio with 360
video for audiovisual applications.
The use of a modular structure with plugins controlling
the background processing server allows a seamless in-
tegration of the spatial encoding and rendering engines
into the workflows typically used for the production of
stereo and multichannel content. Sound source panning
and clustering, and the application of modulation and
audio effects are some of the sound manipulations known
from typical audio production environments which have
been extended for spatial audio in the set of BINCI tools.
Fig. 1: Example of DAW plugins developed as part of BINCI
The Binaural Player (BP) is a cross-platform dy-
namic binaural rendering module designed for reproduc-
tion of ambisonics sound scenes. The BP is designed
primarily for playback of content created with the BHS,
nonetheless B-Format audio files and streams are also
accepted as an input. This ensures compatibility with
applications and platforms which support spatial audio
formats.
By using ambisonics as an intermediate format, both in
the BHS and the BP, sound scene manipulations, such
Fig. 2: Structure and main components of the Binaural Home Studio
as rotations, can be performed efficiently. Furthermore,
rendering audio to a three dimensional loudspeaker array
is possible, extending the flexibility for creators and end-
users to monitor and listen to the created content.
All tools developed within BINCI use dynamic binau-
ral rendering (i.e. head-tracking) by default and apply
headphone equalization whenever a matching filter for
the employed headphone model is available. By com-
bining these techniques with the use of high resolution
HRTFs, a HRTF selection process and the application
of acoustic room information, localization errors are re-
duced and the realism of binaural scenes is improved.
HRTF selection and individualization
The BHS and BP modules perform binaural rendering
using a database of high spatial resolution HRTFs (2,5
in both azimuth and elevation) measured for 25 adults.
Head and torso dimensions extracted according to [3] for
79 individuals were used in the selection of the 25 mea-
sured subjects. The selection process was based on an
iterative clustering intending to create maximal variance
of the extracted geometrical dimensions of the selected
adults. The extracted anthropometric features for the
selected subjects accompany each of the corresponding
HRTF sets. In addition to the measured HRTFs, the
potential for modelling HRTFs based on analytical mod-
els as described in [4] is currently being evaluated as a
solution for obtaining HRTFs for children.
With the purpose of providing a more realistic or plau-
sible binaural experience for the listener, a fast HRTF
selection and individualization procedure is included as
part of the tools developed in BINCI. The individual-
ization procedure consists of a pre-selection step, were
a reduced number of HRTF sets are presented to the
end-user based on the input of easily measurable head
dimensions, and an Interaural Time Differences (ITD)
adjustment step based on the method described in [5],
where the ITDs of the pre-selected HRTFs can be ad-
justed until a stable sound source is perceived at a given
location. Finally, the selected and adjusted HRTF set is
then exported for use in the binaural rendering process.
Fig. 3: Extraction of anthropometrical parameters for one of
79 subjects
The described ITD scaling and adjustment procedure
can be applied on ITD values extracted from the HRIRs
using either the threshold or cross-correlation with the
minimum-phase methods [6]. Alternatively, ITD values
can be modelled for any azimuth and elevation angles
using the provided head dimensions, thereby eliminating
discontinuities and exaggerated values for certain posi-
tions. Further investigations in the project will provide
information about the perceptual difference and possible
advantages for any of the two ITD computation methods,
singling out an approach for the final individualization
procedure.
Fig. 4: Comparison of modelled and extracted ITD values
for one subject and all measured azimuth positions
in the horizontal plane
Spatial Room Impulse Responses
One important feature of the spatial audio production
tools developed in BINCI is the capability to simulate
different environments (i.e. rooms). This is achieved by
convolving the audio input from the DAW with a filter
describing the acoustic properties in the selected envi-
ronment for different source-receiver configurations - a
so-called room impulse response (RIR). To increase com-
patibility with the employed spatial (ambisonics) format,
sets of RIRs have been recorded using spherical (am-
bisonics) microphone arrays as so-called Spatial Room
Impulse Responses (SRRs). Some examples of ambison-
ics RIRs have already been published in [7].
The filters are thereby separated into directional com-
ponents which contain each a RIR corresponding to a
specific direction of space. The number of components
depends on the ambisonics order of the SRRs, which in
turn depends on the model of ambisonics microphone
used for the measurements: Soundfield DSF-2 MKII and
Sennheiser Ambeo microphones are of order 1, Zylia ZM-
1 is of order 3 and MH Acoustics Eigenmike is of order
4.
The main advantage of using ambisonics RIRs (SRRs)
over omnidirectional RIRs is that the acoustic properties
of the room are reproduced in accordance with the lis-
tener’s head orientation with respect to the room. As an
example, a very asymmetric room like a long and nar-
row corridor has very different echoes patterns along the
X and the Y axis. Depending on whether the user is
looking along the X or Y axis, the series of echoes com-
ing from the sides (90oand 270o) and coming from the
front and the back (0oand 180o) are meant to be differ-
ent. The head-tracking based dynamic binaural synthesis
achieved by BINCI tools allows for a continuous change
of orientation of the RIR as the listener’s head rotates.
It should be noted that a convenient way to store, ex-
change and use SRRs data is proposed in [8] as a new
convention of SOFA file format [9].
Summary
A current overview of the development status for the spa-
tial audio production tools developed in BINCI has been
given. Some of the most relevant technologies employed
in the project and the advantages provided to members of
the creative industry are highlighted. Further improve-
ments and integration of the software solutions developed
in BINCI are the next steps in the project. Furthermore,
a current state of the production tools is being used by
selected content creators with the purpose of evaluating
their usability in creative environments and demonstrat-
ing the capabilities to the end-user. To this end, the
Fundaci´o Juan Mir´o in Barcelona, the Alte Pinakothek
in Munich and the St. Andrews Castle in St. Andrews
have agreed to serve as demonstration sites for BINCI
illustrating the potential benefit of binaural content for
museums and providing an option for usability tests at
large scale.
Acknowledgements
The authors would like to thank all partners involved
in the BINCI project for the productive and cooperative
work. Additionally, we thank Andreas Herweg, Fred-
eric Allion and Matthias Reffgen for their support in the
development of software prototypes, the optimization of
computation models and the execution of measurements
respectively.
The work presented in this document is part of BINCI, a
Horizon 2020 innovation project funded by the European
Union under the grant agreement No. 732130.
References
[1] J. Herre, J. Hilpert, A. Kuntz, and J. Plogsties,
“MPEG-H 3D Audio — The New Standard for Cod-
ing of Immersive Spatial Audio,” IEEE JOURNAL
OF SELECTED TOPICS IN SIGNAL PROCESS-
ING, vol. 9, no. 5, pp. 770–779, 2015.
[2] A. Garriga, M. E. Fuenmayor, and M. Caballero,
“Binaural tools for 3D audio production at home,”
NEM Summit, pp. 1–3, 2017.
[3] V. R. Algazi, R. O. Duda, and D. M. Thompson,
“The CIPIC HRTF Database,” Signal Processing,
no. October, pp. 99–102, 2001.
[4] R. Sottek and K. Genuit, “Physical modeling of in-
dividual HRTFs (head related transfer functions),”
DAGA 1999, 1999.
[5] A. Lindau, J. Estrella, and S. Weinzierl, “Individu-
alization of dynamic binaural synthesis by real time
manipulation of the ITD,” Proc of the 128th AES
Convention, no. January 2010, 2010.
[6] B. F. G. Katz and M. Noisternig, “A comparative
study of interaural time delay estimation methods,”
J. Acoust. Soc. Am., vol. 135, no. 6, 2014.
[7] “The openair database http://www.openairlib.
net.”
[8] A. P´erez and J. De Muynke, “Ambisonics Directional
Room Impulse Response as a new Convention of the
Spatially Oriented Format for Acoustics,” Engineer-
ing Brief, AES Convention 2018, to be published.
[9] “SOFA (spatially oriented format for acoustics)
https://www.sofaconventions.org.”
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The dynamic binaural synthesis of acoustic environments is usually constrained to the use non-individual impulse response datasets, measured with dummy heads or head and torso simulators. Thus, fundamental cues for localization such as interaural level differences (ILD) and interaural time differences (ITD) are necessarily corrupted to a certain degree. For ILDs, this is a minor problem as listeners may swiftly adapt to spectral coloration at least as long as an external reference is not provided. In contrast, ITD errors can be expected to lead to a constant degradation of localization. Hence, a method for the individual customization of dynamic binaural reproduction by means of real time manipulation of the ITD is proposed. As a prerequisite, subjectively artifact free techniques for the decomposition of binaural impulse responses into ILD and ITD cues are discussed. Finally, based on listening test results, an anthropometry-based prediction model for individual ITD correction factors is presented. The proposed approach entails further improvements of auditory quality of real time binaural synthesis.
Article
Full-text available
Applications of virtual auditory space need individual head-related transfer functions (HRTFs) to simulate realistic scenarios. To avoid costly measurements a physical model of HRTFs has been developed considering the influence of a few acoustically relevant objects. Individual variations of HRTFs correspond to variations of geometrical parameters. In a first step, the head has been modeled by a rigid sphere 1 and the pinna with cavum conchae by elliptical disks; the position of the ear reference point is one of the most important parameters [1]. The correlation between calculated and measured HRTFs is significantly higher if the influence of shoulder/torso is modeled by an additional rigid sphere. In a second step, the shapes of head and shoulder/torso have been approximated with oblate respective prolate spheroids in order to get even better results. Sound field calculations have been performed using the boundary element method and analytical methods (solution of the wave equation, use of Huygens' integral formula) [2]. Computation time has been reduced by a factor of about 100 using and optimizing the source simulation technique for the diffraction problem of spheroids.
Article
The science and art of Spatial Audio is concerned with the capture, production, transmission, and reproduction of an immersive sound experience. Recently, a new generation of spatial audio technology has been introduced that employs elevated and lowered loudspeakers and thus surpasses previous ‘surround sound’ technology without such speakers in terms of listener immersion and potential for spatial realism. In this context, the ISO/MPEG standardization group has started the MPEG-H 3D Audio development effort to facilitate high-quality bitrate-efficient production, transmission and reproduction of such immersive audio material. The underlying format is designed to provide universal means for carriage of channel-based, object-based and Higher Order Ambisonics based input. High quality reproduction is provided for many output formats from 22.2 and beyond down to 5.1, stereo and binaural reproduction—independently of the original encoding format, thus overcoming the incompatibility between various 3D formats. This paper provides an overview of the MPEG-H 3D Audio project and technology and an assessment of the system capabilities and performance.
Article
The Interaural Time Delay (ITD) is an important binaural cue for sound source localization. Calculations of ITD values are obtained either from measured time domain Head-Related Impulse Responses (HRIRs) or from their frequency transform Head-Related Transfer Functions (HRTFs). Numerous methods exist in current literature, based on a variety of definitions and assumptions of the nature of the ITD as an acoustic cue. This work presents a thorough comparative study of the degree of variability between some of the most common methods for calculating the ITD from measured data. Thirty-two different calculations or variations are compared for positions on the horizontal plane for the HRTF measured on both a KEMAR mannequin and a rigid sphere. Specifically, the spatial variations of the methods are investigated. Included is a discussion of the primary potential causes of these differences, such as the existence of multiple peaks in the HRIR of the contra-lateral ear for azimuths near the inter-aural axis due to multipath propagation and head/pinnae shadowing.
Binaural tools for 3D audio production at home
  • A Garriga
  • M E Fuenmayor
  • M Caballero
A. Garriga, M. E. Fuenmayor, and M. Caballero, "Binaural tools for 3D audio production at home," NEM Summit, pp. 1-3, 2017.
The CIPIC HRTF Database
  • V R Algazi
  • R O Duda
  • D M Thompson
V. R. Algazi, R. O. Duda, and D. M. Thompson, "The CIPIC HRTF Database," Signal Processing, no. October, pp. 99-102, 2001.
Ambisonics Directional Room Impulse Response as a new Convention of the Spatially Oriented Format for Acoustics
  • A Pérez
  • J De Muynke
A. Pérez and J. De Muynke, "Ambisonics Directional Room Impulse Response as a new Convention of the Spatially Oriented Format for Acoustics," Engineering Brief, AES Convention 2018, to be published.