Article

Object-Based Spatial Audio: Concept, Advantages, and Challenges

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

One of the primary objectives of modern audiovisual media creation and reproduction techniques is realistic perception of the delivered contents by the consumer. Spatial audio-related techniques in general attempt to deliver the impression of an auditory scene where the listener can perceive the spatial distribution of the sound sources as if he/she were in the actual scene. Advances in spatial audio capturing and rendering techniques have led to a new concept of delivering audio which does not only aim to present to the listener a realistic auditory scene just as captured but also gives more control over the delivered auditory scene to the producer and/or the listener. This is made possible by being able to control the attributes of individual sound objects that appear in the delivered scene. In this section, this so-called object-based approach is introduced, with its key features that distinguish it from the conventional spatial audio production and delivery techniques. The related applications and technologies will also be introduced, and the limitations and challenges that this approach is facing will follow. © 2014 Springer Science+Business Media New York. All rights are reserved.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

... In an object-based broadcasting chain, however, there is the possibility for content to be adapted at the receiver end so as to account for such factors. This is achieved by representing audio content as separate objects with corresponding metadata and then rendering these objects at the receiver end [11] (see Tim Walton t.walton3@ncl.ac.uk 1 Open Lab, Newcastle University, Newcastle upon Tyne, UK 2 BBC Research and Development, Salford, UK 3 Northumbria University, Newcastle upon Tyne, UK Section 1.2). With object-based audio, it may therefore be possible to adapt an audio mix at the point of consumption so as to improve the listening experience when in noisy environments. ...
... Object-based audio is a method of representing audio content as separate elements (or 'objects') with corresponding temporal, positional and other/semantic metadata, which are then rendered at the receiver end. This paradigm is linked to advances in spatial audio reproduction [8,11,14] as, unlike traditional channel-based methods, the reconstruction of a virtual sound scene can be optimised to a given reproduction setup or listening environment [23]. Along with the advantages it brings to spatial audio reproduction, objectbased audio offers possibilities for greater personalisation, interaction and adaptation of content [1] [18]. ...
Article
Full-text available
Mobile devices enable users to consume media with audio content in a wide range of contexts, with environmental noise being present in many of these. Several methods exist that aim to improve the experience of mobile listening by utilising information about the environmental noise, such as volume and dynamic range adaptation. This paper explores a fundamentally different approach to improving the mobile listening experience by using the object-based audio paradigm, where individual audio sources are mixed in response to each specific listening context. Three experimental studies, containing both quantitative and qualitative aspects, are presented which investigate whether environmental noise influences preference of background-foreground audio object balance in a mix. The results indicate that environmental noise can influence the preferred audio mix and that the nature of the adaptations made is dependent upon both audio content and user. Additionally, qualitative analysis provides an understanding of the role of environmental noise on preferred audio mix. It is believed that the content adaptation method explored in this paper is a simple yet useful tool for adapting content to suit both the context and the user.
... Depending on the application, users on the rendering side have permission to control the audio objects. That is, the users can adjust the scene description before the final rendering [10]. Thus, the object distribution or characteristics in the auditory scene can be changed based on user interaction. ...
Article
Full-text available
Object-based audio techniques provide more flexibility and convenience for personalized rendering under various playback configurations. Many methods have been proposed to encode and transmit multiple audio objects at a low bit-rate. However, the recovered audio objects have severe frequency aliasing distortion, which will destroy the immersive sound quality. This paper describes a new structure to reduce every object’s aliasing distortion. In this method, we extract residual and gain parameters of all objects after N-step operation and use singular value decomposition to compress the residual matrices. The residual matrices can compensate for aliasing distortion in the decoding part. Moreover, we find a proper ordering strategy experimentally to determine the object coding order because it will affect the final decoded quality. From experiment results, the energy sorting strategy is chosen as the best ordering strategy, and the residual information bit-rate can be reduced from 14.11 kbps/per object to 5.87 kbps/per object. Compared with previous studies, our method gets better performance in objective and subjective experiments. The proposed N-step residual compensating structure can reduce every object’s aliasing distortion better than the state-of-the-art methods.
... Another approach is object-based audio. This is a method of representing sound as separate elements (or "objects") with corresponding temporal, positional and other/semantic metadata, so that the objects can be rendered with a large degree of flexibility at the user's end (Herre et al., 2015;Kim, 2014;Melchior, Churnside, and Spors, 2012). For example, instead of mixing a source to a certain loudspeaker channel, the source object is transmitted with positional metadata which the renderer on the user's end can use to reproduce the intended source position. ...
Thesis
Full-text available
The next generation of audio reproduction technology has the potential to deliver immersive and personalised experiences to the user; multichannel with-height loudspeaker arrays and binaural techniques offer 3D audio experiences, whereas objectbased techniques offer possibilities of adapting content to suit the system, context and user. A fundamental process in the advancement of such technology is perceptual evaluation. It is crucial to understand how listeners perceive new technology in order to drive future developments. This thesis explores the experience provided by next generation audio technology by taking a quality of experience (QoE) approach to evaluation. System, context and human factors all influence QoE and in this thesis three case studies are presented to explore the role of these categories of influence factors (IFs) in the context of next generation audio evaluation. Furthermore, these case studies explore suitable methods and approaches for the evaluation of the QoE of next generation audio with respect to its various IFs. Specific contributions delivered from these individual studies include a subjective comparison between soundbar and discrete surround sound technology, the application of the Open Profiling of Quality method to the field of audio evaluation, an understanding of both how and why environmental noise influences preferred audio object balance, an understanding of how the influence of technical audio quality on overall listening experience is related to a range of psychographic variables and an assessment of the impact of binaural processing on overall listening experience. When considering these studies as a whole, the research presented here contributes the thesis that to effectively evaluate the perceived quality of next generation audio, a QoE mindset should be taken that considers system, context and human IFs.
Article
Full-text available
This paper presents a series of experiments to determine a categorization framework for broadcast audio objects. Object-based audio is becoming an evermore important paradigm for the representation of complex sound scenes. However, there is a lack of knowledge regarding object level perception and cognitive processing of complex broadcast audio scenes. As categorization is a fundamental strategy in reducing cognitive load, knowledge of the categories utilized by listeners in the perception of complex scenes will be beneficial to the development of perceptually based representations and rendering strategies for object-based audio. In this study expert and non-expert listeners took part in a free card sorting task using audio objects from a variety of different types of program material. Hierarchical agglomerative clustering suggests that there are seven general categories, which relate to sounds indicating actions and movement, continuous and transient background sound, clear speech, non-diegetic music and effects, sounds indicating the presence of people, and prominent attention grabbing transient sounds. A three-dimensional perceptual space calculated via multidimensional scaling suggests that these categories vary along dimensions related to the semantic content of the objects, the temporal extent of the objects, and whether the object indicates the presence of people.
Book
Full-text available
The field of spatial hearing has exploded in the decade or so since Jens Blauert's classic work on acoustics was first published in English. This revised edition adds a new chapter that describes developments in such areas as auditory virtual reality (an important field of application that is based mainly on the physics of spatial hearing), binaural technology (modeling speech enhancement by binaural hearing), and spatial sound-field mapping. The chapter also includes recent research on the precedence effect that provides clear experimental evidence that cognition plays a significant role in spatial hearing. The remaining four chapters in this comprehensive reference cover auditory research procedures and psychometric methods, spatial hearing with one sound source, spatial hearing with multiple sound sources and in enclosed spaces, and progress and trends from 1972 (the first German edition) to 1983 (the first English edition)—work that includes research on the physics of the external ear, and the application of signal processing theory to modeling the spatial hearing process. There is an extensive bibliography of more than 900 items.
Conference Paper
Full-text available
SpatDIF, the Spatial Sound Description Interchange For-mat, is an ongoing collaborative effort offering a seman-tic and syntactic specification for storing and transmit-ting spatial audio scene descriptions. The SpatDIF core is a lightweight minimal solution providing the most es-sential set of descriptors for spatial sound scenes. Addi-tional descriptors are introduced as extensions, expanding the namespace and scope with respect to authoring, scene description, rendering and reproduction of spatial audio. A general overview of the specification is provided, and two use cases are discussed, exemplifying SpatDIF's potential for file-based pieces as well as real-time streaming of spa-tial audio information.
Article
Full-text available
The introduction of new techniques for audio reproduction such as HRTF-based technology, wave field synthesis and higher-order Ambisonics is accompanied by a paradigm shift from channel-based to object-based transmission and storage of spatial audio. Not only is the separate coding of source signal and source location more efficient considering the number of channels used for reproduction by large loudspeaker arrays, it also opens up new options for a user-controlled interactive sound field design. This article describes the need for a common exchange format for object-based audio scenes, reviews some existing formats with potential to meet some of the requirements and finally introduces a new format called Audio Scene Description Format (ASDF) and presents the SoundScape Renderer, an audio reproduction software which implements a draft version of the ASDF.
Article
Full-text available
The acoustics in auditoria are determined by the properties of both the direct sound and the later arriving reflections. If electroacoustic means are used to repair disturbing deficiencies in the acoustics, one has to cope with unfavorable side effects such as localization problems and artificial impressions of the reverberant field (electronic flavor). To avoid those side effects, the concept of electroacoustic wave front synthesis is introduced. The underlying theory is based on the Kirchhoff-Helmholtz integral. In this new concept the wave fields of the sound sources on stage are measured by directive microphones; next they are electronically extrapolated away from the stage, and finally they are re-emitted in the hall by one or more loudspeaker arrays. The proposed system aims at emitting wave fronts that are as close as possible to the real wave fields. Theoretically, there need not be any differences between the electronically generated wave fields and the real wave fields. By using the image source concept, reflections can be generated in the same way as direct sound.
Article
Full-text available
This paper gives HRTF magnitude data in numerical form for 43 frequencies between 0.2---12 kHz, the average of 12 studies representing 100 different subjects. However, no phase data is included in the tables; group delay simulation would need to be included in order to account for ITD. In 3-D sound applications intended for many users, we want might want to use HRTFs that represent the common features of a number of individuals. But another approach might be to use the features of a person who has desirable HRTFs, based on some criteria. (One can sense a future 3-D sound system where the pinnae of various famous musicians are simulated.) A set of HRTFs from a good localizer (discussed in Chapter 2) could be used if the criterion were localization performance. If the localization ability of the person is relatively accurate or more accurate than average, it might be reasonable to use these HRTF measurements for other individuals. The Convolvotron 3-D audio system (Wenzel, Wightman, and Foster, 1988) has used such sets particularly because elevation accuracy is affected negatively when listening through a bad localizers ears (see Wenzel, et al., 1988). It is best when any single nonindividualized HRTF set is psychoacoustically validated using a 113 statistical sample of the intended user population, as shown in Chapter 2. Otherwise, the use of one HRTF set over another is a purely subjective judgment based on criteria other than localization performance. The technique used by Wightman and Kistler (1989a) exemplifies a laboratory-based HRTF measurement procedure where accuracy and replicability of results were deemed crucial. A comparison of their techniques with those described in Blauert (1983), Shaw (1974), Mehrgardt and Mellert (1977), Middlebrooks, Makous, and Gree...
Article
A variety of trade-offs seem to be possible in HRTF processing for binaural audio. With careful equalization the timbral quality of sounds can be improved without totally destroying the accuracy of spatial imaging, for example. While using individuallymeasured HRTFs may be an ideal scenario, there is some evidence that good spatial quality can be obtained using selected HRTFs that are different to one's own. This could reduce the complexity of system implementation while still delivering acceptable results to the majority of listeners.
Article
There are a number of different cues the ear-brain combination uses to determine the position of a sound source. Although there are may be other, more subtle mechanisms, those we will be most concerned with as recording engineers are; 1. The time of arrival of the wave front of a sound event at the ears, or more specifically, the difference in arrival times at the two different ears. A sound source anywhere on a line from due front, through due above to due back (the median plane) will have its wave front arrive at the two ears simultaneously. Move the source away from this line and one ear will begin to receive the wave front before the other. This is known as the Interaural Time Delay or ITD. This effect is only usable up to a frequency where the wavelength of the sound approaches twice the distance between the ears. Over that, it provides only ambiguous cues. 2. Sound from a source to the left of the head, for example, will arrive directly at the left ear, but will have to travel "through" (!) the head – actually it is diffracted round – to get to the right ear and will also have to travel further. It will thus be quieter at the right ear than the left, both as a result of the screening effect of the head and, to a lesser extent, due to the extra distance travelled. 3. The shape of the head and the external part of the ears results in a frequency dependent response which varies with sound position. This is known as the Head Related Transfer Function or HRTF. For positions where ILD's or ITD's give ambiguous or non existent differences between ear signals (such as median plane signals) this is the main positional sensing mechanism. For a sound source not placed symmetrically with respect to the two ears will further result in a different response at each ear.
Article
In 2007 the ISO/MPEG Audio standardization group started a new work item on efficient coding of sound scenes comprising a multitude of simultaneous audio objects by parametric coding techniques. Finalized in the summer of 2010, the resulting MPEG 'Spatial Audio Object Coding' (SAOC) specification allows the representation of such scenes at bit rates commonly used for coding of mono or stereo sound. At the decoder side, each object can be interactively rendered, supporting applications like user-controlled music remixing and spatial teleconferencing. This paper summarizes the results of the standardization process, provides an overview of MPEG SAOC technology, and illustrates its performance by means of the results of the recent verification tests. The test includes operation modes for several typical application scenarios that take advantage of object-based processing.
Article
ISO/IEC 13818-7:2006 specifies MPEG-2 Advanced Audio Coding (AAC), a multi-channel audio coding standard that delivers higher quality than is achievable when requiring MPEG-1 backwards compatibility. It provides ITU-R "indistinguishable" quality at a data rate of 320 kbit/s for five full-bandwidth channel audio signals. ISO/IEC 13818-7:2006 also supplements information on how to utilize the bandwidth extension technology (SBR) specified in ISO/IEC14496-3 in conjunction with MPEG-2 AAC.
Article
We present an overview of the AudioBIFS system, part of the Binary Format for Scene Description (BIFS) tool in the MPEG-4 International Standard. AudioBIFS is the tool that integrates the synthetic and natural sound coding functions in MPEG-4. It allows the flexible construction of soundtracks and sound scenes using compressed sound, sound synthesis, streaming audio, interactive and terminal-dependent presentation, three-dimensional (3-D) spatialization, environmental auralization, and dynamic download of custom signal-processing effects algorithms. MPEG-4 sound scenes are based on a model that is a superset of the model in VRML 2.0, and we describe how MPEG-4 is built upon VRML and the new capabilities provided by MPEG-4. We discuss the use of structured audio orchestra language, the MPEG-4 SAOL, for writing downloadable effects, present an example sound scene built with AudioBIFS, and describe the current state of implementations of the standard
SpatDIF: principles, specification, and examples. In: The 9th sound and music computing conference
  • N Peters
  • T Lossius
  • Jc Schacher
Information technology-computer graphics and image processing-the virtual reality modeling language (VRML)-part 1: functional specification and UTF-8 encoding
  • Iso Iec