Conference Paper

The Effect of Immersive Audio Rendering on Listeners’ Emotional State

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Immersive audio has received significant attention in the past decade. The emergence of a few groundbreaking systems and events (Dolby Atmos, MPEG-H, VR/AR, AI) contributes to reshaping the landscape of this field, accelerating the mass market adoption of immersive audio. This review serves as a quick recap of some immersive audio background, end to end workflow, covering audio capture, compression, and rendering. The technical aspects of object audio and ambisonic will be explored, as well as other related topics such as binauralization, virtual surround, and upmix. Industry trends and applications are also discussed where user experience ultimately decides the future direction of the immersive audio technologies.
Article
Full-text available
Nowadays, smartphones and laptops equipped with cameras have become an integral part of our daily lives. The pervasive use of cameras enables the collection of an enormous amount of data, which can be easily extracted through video images processing. This opens up the possibility of using technologies that until now had been restricted to laboratories, such as eye-tracking and emotion analysis systems, to analyze users’ behavior in the wild, during the interaction with websites. In this context, this paper introduces a toolkit that takes advantage of deep learning algorithms to monitor user’s behavior and emotions, through the acquisition of facial expression and eye gaze from the video captured by the webcam of the device used to navigate the web, in compliance with the EU General data protection regulation (GDPR). Collected data are potentially useful to support user experience assessment of web-based applications in thewild and to improve the effectiveness of e-commerce recommendation systems.
Technical Report
Full-text available
Scene Based Audio is a set of technologies for 3D audio that is based on Higher Order Ambisonics. HOA is a technology that allows for accurate capturing, efficient delivery, and compelling reproduction of 3D audio sound fields on any device, such as headphones, arbitrary loudspeaker configurations, or soundbars. We introduce SBA and we describe the workflows for production, transport and reproduction of 3D audio using HOA. The efficient transport of HOA is made possible by state-of-the-art compression technologies contained in the MPEG-H Audio standard. We discuss how SBA and HOA can be used to successfully implement Next Generation Audio systems, and to deliver any combination of TV, VR, and 360° video experiences using a single audio workflow.
Article
Full-text available
Room response equalization aims at improving the sound reproduction in rooms by applying advanced digital signal processing techniques to design an equalizer on the basis of one or more measurements of the room response. This topic has been intensively studied in the last 40 years, resulting in a number of effective techniques facing different aspects of the problem. This review paper aims at giving an overview of the existing methods following their historical evolution, and discussing pros and cons of each approach with relation to the room characteristics, as well as instrumental and perceptual measures. The review is concluded by a discussion on emerging topics and new trends.
Conference Paper
Full-text available
Binaural audio is growing in relevance due to an increase in mobile media consumption (and therefore headphone listening) and due to an increase in the popularity of immersive technology as a whole. With this comes an increasing need to evaluate the quality of experience provided by binaural audio. Overall listening experience (OLE) is an affective measure used in the evaluation of audio and can be regarded as the 'quality of experience' in the context of audio consumption. In this paper, a web-based study is presented that investigates the influence of binaural audio on OLE. Results show that, for the stimuli and participants used, binaural processing influences OLE in a small but significant way compared to stereo headphone reproduction.
Article
Full-text available
Recursive Ambiophonic Crosstalk Elimination (RACE), implemented as a VST plug-in, convolved from an impulse response, or purchased as part of a TacT Audio or other home audiophile product, properly reproduces all the ITD and ILD data sequestered in most standard two or multichannel media. Ambiophonics is so named because it is intended to be the replacement for 75 year old stereophonics and 5.1 in the home, car, or monitoring studio, but not in theaters. The response curves show that RACE produces a loudspeaker binaural soundfield with no audible colorations, much like Ambisonics or Wavefield Synthesis. RACE can do this starting with most standard CD/LP/DVD two, four or five-channel media, or even better, 2 or 4 channel recordings made with an Ambiophone, using one or two pairs of closely spaced loudspeakers. The RACE stage can easily span up to 170° for two channel orchestral recordings or 360° for movie/electronic-music surround sources. RACE is not sensitive to head rotation and listeners can nod, recline, stand up, lean sideways, move forward and back, or sit one behind the other. As in 5.1, off center listeners can easily localize the center dialog even though no center speaker is ever needed.
Article
Full-text available
A methodology is introduced for smoothing the Complex Transfer Function of measured responses using well-established or arbitrary fractional octave profiles, based on a novel time-frequency, mapping framework. A corresponding impulse response is also analytically derived having reduced complexity but conforming to perceptual principles. The relationship between the Complex Smoothing and the traditional Power spectral Smoothing is also presented.
Article
Full-text available
This paper discusses the psychoacoustical background and the computational issues involved in the real-time implementation of a complete Ambiophonics reproduction system based on binaural technology. Ambiophonics, which requires only two media channels, evolved from previously known approaches such as the reproduction of binaural recordings over closely spaced loudspeakers through cross-talk cancellation, and the reconstruction of hall ambience by convolution from suitable impulse responses. The equations for the design of the digital filter coefficients are derived with regard to the many possible kinds of pre-existing recordings (binaural, sphere, ORTF, M/S), and their implementation on available hardware and software platforms are described. The authors suggest psychoacoustic explanations for the perceived audible performance, and describe the first results of a comparative listening test, evaluating the realism of three periphonic surround reproduction systems: Stereo Dipole, Ambisonics and Ambiophonics.
Article
Full-text available
Wave field synthesis is a spatial sound field reproduction technique aiming at authentic reproduction of auditory scenes. Its theoretical foundation has been developed almost 20 years ago and has been improved considerably since then. Most of the original work on wave field synthesis is restricted to the reproduction in a planar listening area using linear loudspeaker arrays. Extensions like arbitrarily shaped distributions of secondary sources and three-dimensional reproduction in a listening volume have not been discussed in a unified framework so far. This paper revisits the theory of wave field synthesis and presents a unified theoretical framework covering arbitrarily shaped loudspeaker arrays for two-and three-dimensional repro-duction. The paper additionally gives an overview on the artifacts resulting in practical setups and briefly discusses some extensions to the traditional concepts of WFS.
Conference Paper
Full-text available
In 2000, the Cohn-Kanade (CK) database was released for the purpose of promoting research into automatically detecting individual facial expressions. Since then, the CK database has become one of the most widely used test-beds for algorithm development and evaluation. During this period, three limitations have become apparent: 1) While AU codes are well validated, emotion labels are not, as they refer to what was requested rather than what was actually performed, 2) The lack of a common performance metric against which to evaluate new algorithms, and 3) Standard protocols for common databases have not emerged. As a consequence, the CK database has been used for both AU and emotion detection (even though labels for the latter have not been validated), comparison with benchmark algorithms is missing, and use of random subsets of the original database makes meta-analyses difficult. To address these and other concerns, we present the Extended Cohn-Kanade (CK+) database. The number of sequences is increased by 22% and the number of subjects by 27%. The target expression for each sequence is fully FACS coded and emotion labels have been revised and validated. In addition to this, non-posed sequences for several types of smiles and their associated metadata have been added. We present baseline results using Active Appearance Models (AAMs) and a linear support vector machine (SVM) classifier using a leave-one-out subject cross-validation for both AU and emotion detection for the posed data. The emotion and AU labels, along with the extended image data and tracked landmarks will be made available July 2010.
Article
Full-text available
In this comprehensive study, algorithms for upmixing, downmixing, and joint up/downmixing are examined and compared. Five upmixing algorithms based on signal decorrelation and reverberation are employed to convert two-channel stereo signals to jive-channel signals. For downmixing, methods ranging from mixing with simple gain adjustment to more sophisticated head related transfer function (HRTF) filtering and crosstalk cancellation system (CCS) are utilized to downmix the center channel and the surround channels into the available two frontal loudspeakers. For situations where only two-channel content and loudspeakers are available, a number of up/down mixing schemes are used to simulate a virtual surround environment. Emphasis of comparison is placed on two consumer electronic products: a 5.1 home theater system and a dual-loudspeaker MP3 handset. The effect of loudspeaker spacing on rendering performance is examined. Listening tests are conducted to compare the processing methods in terms of three levels of subjective indices. The results are processed by using the Multi-Analysis Of VAriance (MANOVA) to justify the statistical significance, followed by a multiple regression analysis to correlate the auditory preference with various timbral and spatial attributes.
Chapter
Reviewing work from diverse scientific fields, this chapter approaches the human aesthetic response to reproduced audio as a process of attraction and efficient (“fluent”) processing for certain auditory stimuli that can be associated with listener pleasure (valence) and attention (arousal), provided that they conform to specific semantic and contextual principles, either derived from perceived signal features or from top-down cognitive processes. Recent techniques for room-related loudspeaker-based presentation of auditory scenes, especially via multichannel reproduction, further extend the options for manipulating the source signals to allow the rendering of virtual sources beyond the frontal azimuth angles and to enhance the listener envelopment. Hence, such methods increase arousal and valence and contribute additional factors to the listeners’ aesthetic experience for reproduced natural or virtual scenes. This chapter also examines the adaptation of existing models of aesthetic response to include listeners’ aesthetic assessments of spatial-audio reproduction in conjunction with present and evolving methods for evaluating the quality of such audio presentations. Given that current sound-quality assessment methods are usually strongly rooted in objective, instrumental measures and models, which intentionally exclude the observers’ emotions, preferences and liking (hedonic response), the chapter also proposes a computational model structure that can incorporate aesthetic functionality beyond or in conjunction with quality assessment.
Chapter
Current immersive audio technologies offer today’s content producers more creative and recreative possibilities than ever before. These systems can provide an enhanced sense of realism to accompany video, a multi-sensory immersive experience with video games, or rich, luscious musical soundscapes. The ability to place sound, not just in front, but 360° around the listeners, or even above and below them, drastically expands the acoustic space within which a recording engineer or mixer can work and create.
Article
Automated affective computing in the wild setting is a challenging problem in computer vision. Existing annotated databases of facial expressions in the wild are small and mostly cover discrete emotions (aka the categorical model). There are very limited annotated facial databases for affective computing in the continuous dimensional model (e.g., valence and arousal). To meet this need, we collected, annotated, and prepared for public distribution a new database of facial emotions in the wild (called AffectNet). AffectNet contains more than 1,000,000 facial images from the Internet by querying three major search engines using 1250 emotion related keywords in six different languages. About half of the retrieved images were manually annotated for the presence of seven discrete facial expressions and the intensity of valence and arousal. AffectNet is by far the largest database of facial expression, valence, and arousal in the wild enabling research in automated facial expression recognition in two different emotion models. Two baseline deep neural networks are used to classify images in the categorical model and predict the intensity of valence and arousal. Various evaluation metrics show that our deep neural network baselines can perform better than conventional machine learning methods and off-the-shelf facial expression recognition systems.
Article
During the past decades, spatial reproduction of audio signals has evolved from simple two channel stereo to surround sound (e.g. 5.1 or 7.1) and, more recently, to 3D sound including height speakers, such as 9.1 or 22.2. With increasing number of speakers, increasing spatial fidelity and listener envelopment are expected. This paper reviews popular methods for subjective assessment of audio. Moreover, it provides an experimental evaluation of the subjective quality provided by these formats, contrasting the well-known basic audio quality (BAQ) type of evaluation with the more recent evaluation of overall listening experience (OLE). Commonalities and differences in findings between both assessment approaches are discussed. The results of the evaluation indicate that 3D audio enhances BAQ as well as OLE over both stereo and surround sound. Furthermore, the BAQ- and OLE-based assessments turned out to deliver consistent and reliable results.
Conference Paper
Crowd sourcing has become a widely adopted scheme to collect ground truth labels. However, it is a well-known problem that these labels can be very noisy. In this paper, we demonstrate how to learn a deep convolutional neural network (DCNN) from noisy labels, using facial expression recognition as an example. More specifically, we have 10 taggers to label each input image, and compare four different approaches to utilizing the multiple labels: majority voting, multi-label learning, probabilistic label drawing, and cross-entropy loss. We show that the traditional majority voting scheme does not perform as well as the last two approaches that fully leverage the label distribution. An enhanced FER+ data set with multiple labels for each face image will also be shared with the research community.
Article
A study with two experiments was carried out where participants rated the overall listening experience while listening to music excerpts. When rating the overall listening experience, participants are asked to take every aspect into account which seems important to them, including song, lyrics, mood and audio quality. In the first experiment participants rated music excerpts which were played back through three different single-/multi-channel systems (mono, stereo and surround). The analysis of the results revealed that the single-/multi-channel system had a significant effect on the overall listening experience. A second experiment was conducted to investigate the influence of the listening room on the overall listening experience, where the results showed that the listening room had a non-significant effect on the ratings.
Article
Former studies have shown that up- and down-mix algorithms have a significant effect on ratings of audio quality. The question arises whether this significant effect is also verifiable when it comes to rating the overall listening experience of music. When listeners rate the overall listening experience, they are allowed to take everything into account that is important to them for enjoying a listening experience. An experiment was conducted where 25 participants rated the overall listening experience while listening to music that was artistically mixed and up- and down-mixed by six algorithms. The results show that there are no significant differences between the artistic mixes and the up- and down-mix algorithms except for two mixing algorithms which served as "lower anchors" and had a significant negative effect on the ratings.
Article
In recent years, new developments have led to an increasing number of virtual reality (VR)-based experiments, but little is known about their validity compared to real-world experiments. To this end, an experiment was carried out which compares responses given in a real-world environment to responses given in a VR environment. In the experiment, thirty participants rated the overall listening experience of music excerpts while sitting in a cinema and a listening booth being in a real-world environment and in a VR environment. In addition, the VR system that was used to carry out the sessions in the VR environment is presented in detail. Results indicate that there are only minor statistically significant differences between the two environments when the overall listening experience is rated. Furthermore, in the real-world environment, the ratings given in the listening booth were slightly higher than in the cinema.
Article
A methodology is introduced for smoothing the complex transfer function of measured responses using well-established or arbitrary fractional-octave profiles, based on a novel time-frequency mapping framework. A corresponding impulse response, also derived analytically, has reduced complexity but conforms to perceptual principles. The relationship between complex smoothing and traditional power spectral smoothing is also discussed.
Book
Explores the principles and practical considerations of spatial sound recording and reproduction. Particular emphasis is given to the increasing importance of multichannel surround sound and 3D audio, including binaural approaches, without ignoring conventional stereo. The enhancement of spatial quality is arguably the only remaining hurdle to be overcome in pursuit of high quality sound reproduction. The rise of increasingly sophisticated spatial sound systems presents an enormous challenge to audio engineers, many of whom are confused by the possibilities and unfamiliar with standards, formats, track allocations, monitoring configurations and recording techniques. The author provides a comprehensive study of the current state of the art in spatial audio, concentrating on the most widely used approaches and configurations. Anyone wishing to expand their understanding of these cutting-edge technologies will want to own this book. Keep up to date with the latest technology - 3D audio, surround sound and multichannel recording techniques Enables you to cut through the common confusions and problems associated with the subject Further your knowledge with this comprehensive study of the very latest spatial audio technologies and techniques
Conference Paper
upmixing, downmixing, and joint up/downmixing are examined. Two upmixing algorithms are employed to convert two-channel stereo signals to five-channel signals. For downmixing, methods ranging from mixing with simple gain adjustment to more sophisticated Head Related Transfer Function (HRTF) filtering and Crosstalk Cancellation System (CCS) are utilized to downmix the center channel and the surround channels into the available two frontal loudspeakers. For situations where only two-channel content and loudspeakers are available, a number of up/downmixing schemes are used to simulate a virtual surround environment. Emphasis of comparison is placed on a dual- loudspeaker MP3 handset. Listening tests are conducted to compare the processing methods in terms of three levels of subjective indices. The results are processed by using the Multivariate ANalysis Of VAriance (MANOVA) to justify the statistical significance1.
An experimental verification of localization in two-channel stereo
  • E Benjamin
E. Benjamin, "An experimental verification of localization in twochannel stereo," in Proc. of 121st Audio Engineering Society Convention. Audio Engineering Society, 2006.
Ambiophonics: The synthesis of concert hall sound fields in the home
  • glasgal
Mediapipe: A framework for building perception pipelines
  • lugaresi
Mediapipe: A framework for building perception pipelines
  • C Lugaresi
  • J Tang
  • H Nash
  • C Mcclanahan
  • E Uboweja
  • M Hays
C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, and M. Grundmann, "Mediapipe: A framework for building perception pipelines," arXiv preprint arXiv:1906.08172, 2019.
Immersive sound: the art and science of binaural and multi-channel audio.
  • A Roginska
  • P Geluso
A. Roginska and P. Geluso, Immersive sound: the art and science of binaural and multi-channel audio. Taylor & Francis, 2017.