Conference PaperPDF Available

Audio-Visual Perception of Omnidirectional Video for Virtual Reality Applications

Authors:

Abstract and Figures

Ambisonics, which constructs a sound distribution over the full viewing sphere, improves immersive experience in omnidirectional video (ODV) by enabling observers to perceive the sound directions. Thus, human attention could be guided by audio and visual stimuli simultaneously. Numerous datasets have been proposed to investigate human visual attention by collecting eye fixations of observers navigating ODV with head-mounted displays (HMD). However , there is no such dataset analyzing the impact of audio information. In this paper, we establish a new audiovisual attention dataset for ODV with mute, mono, and ambisonics. The user behavior including visual attention corresponding to sound source locations , viewing navigation congruence between observers and fixa-tions distributions in these three audio modalities is studied based on video and audio content. From our statistical analysis, we preliminarily found that, compared to only perceiving visual cues, perceiving visual cues with salient object sound (i.e., human voice, siren of ambulance) could draw more visual attention to the objects making sound and guide viewing behaviour when such objects are not in the current field of view. The more in-depth interactive effects between audio and visual cues in mute, mono and ambisonics still require further comprehensive study. The dataset and developed testbed in this initial work will be publicly available with the paper to foster future research on audiovisual attention for ODV.
Content may be subject to copyright.
AUDIO-VISUAL PERCEPTION OF OMNIDIRECTIONAL VIDEO FOR VIRTUAL REALITY
APPLICATIONS
Fang-Yi Chao, Cagri Ozcinar, Chen Wang, Emin Zerman, Lu Zhang,
Wassim Hamidouche, Olivier Deforges, Aljosa Smolic
Univ Rennes, INSA Rennes, CNRS, IETR - UMR 6164, F-35000 Rennes, France,
V-SENSE, School of Computer Science and Statistics, Trinity College Dublin, Ireland.
ABSTRACT
Ambisonics, which constructs a sound distribution over the full
viewing sphere, improves immersive experience in omnidirectional
video (ODV) by enabling observers to perceive the sound directions.
Thus, human attention could be guided by audio and visual stim-
uli simultaneously. Numerous datasets have been proposed to in-
vestigate human visual attention by collecting eye fixations of ob-
servers navigating ODV with head-mounted displays (HMD). How-
ever, there is no such dataset analyzing the impact of audio infor-
mation. In this paper, we establish a new audio-visual attention
dataset for ODV with mute, mono, and ambisonics. The user be-
havior including visual attention corresponding to sound source lo-
cations, viewing navigation congruence between observers and fixa-
tions distributions in these three audio modalities is studied based on
video and audio content. From our statistical analysis, we prelimi-
narily found that, compared to only perceiving visual cues, perceiv-
ing visual cues with salient object sound (i.e., human voice, siren of
ambulance) could draw more visual attention to the objects making
sound and guide viewing behaviour when such objects are not in the
current field of view. The more in-depth interactive effects between
audio and visual cues in mute, mono and ambisonics still require fur-
ther comprehensive study. The dataset and developed testbed in this
initial work will be publicly available with the paper to foster future
research on audio-visual attention for ODV.
Index TermsAmbisonics, omnidirectional video, virtual re-
ality (VR), visual attention, audio-visual saliency.
1. INTRODUCTION
With recent technological advancements in virtual reality (VR) sys-
tems, omnidirectional video (ODV), also known as 360° video, is an
increasingly important multimedia representation to provide a high-
quality immersive VR experience. The audio-visual representation
of ODV is typically captured with omnidirectional microphone and
camera systems. The audio part of ODV can be represented by spa-
tial audio, e.g., ambisonics, which is a description of a 3D spatial au-
dio scene. The ambisonics format encodes the directional properties
of the sound field to four or more fixed audio channels. The visual
part of the ODV signal is typically stored in 2D planar representa-
tions such as equirectangular projection (ERP) to be compatible with
the existing video technology systems. Thanks to its immersive and
interactive nature, ODV can be used in different applications such as
entertainment and education.
Although technical aspects of ODV have been widely investi-
gated for different applications, many research questions are still
open in the context of audio-visual perception of ODV. The need
This publication has emanated from research conducted with the finan-
cial support of Science Foundation Ireland (SFI) under the Grant Number
15/RP/27760.
for understanding and anticipating human behavior while watch-
ing ODVs in VR is essential for optimizing VR systems, such as
streaming [1] and rendering [2]. Towards this aim, recent visual at-
tention/saliency research activities and studies for ODV have set a
fundamental background for understanding users’ behavior in VR
systems. Most research works have investigated users’ behavior
with subjective experiments and developed algorithms for predicting
users’ visual attention. However, they have focused on visual cues
only. Specifically, audio-visual perception of ODVs is highly over-
looked in the literature. Creating immersive VR experiences requires
full spherical audio-visual representation of ODV. In particular, the
spatial aspect of audio might also play an important role in inform-
ing the viewers about the location of objects in the 360environ-
ment [3], guiding visual attention in ODV films [4], and achieving
presence with head-mounted displays (HMDs). To this end, in spite
of the existing evidence on the correlation between audio and visual
cues and their joint contribution to our perception [5], to date, most
user behavior studies and algorithms for prediction of visual atten-
tion neglect audio cues, and consider visual cues as the only source
of attention. The lack of understanding of the audio-visual percep-
tion of ODV rises interesting research questions to the multimedia
community, such as How does ODV with and without audio affect
users’ attention?
To understand the auditory and visual perception of ODV, in this
work, we investigated users’ audio-visual attention using ODV with
three different audio modalities, namely, mute, mono, and ambison-
ics. We first designed a testbed for gathering users’ viewport center
trajectories (VCTs), created a dataset with a diverse set of audio-
visual ODVs, and conducted subjective experiments for each ODV
with mute, mono, and ambisonics modalities. We analyzed visual
attention in ODV with mute modality and audio-visual attention in
ODV with mono and ambisonics modalities by investigating the cor-
relation of visual attention and sound source locations, the consis-
tency of viewing paths between observers, and distribution of vi-
sual attention in the three audio modalities. An ODV with ambison-
ics provides not only auditory cues but also the direction of sound
sources, while mono only provides the magnitude of auditory cues.
Users only perceive the loudness of the audio without audio direc-
tion in mono modality. Our new dataset includes VCTs and visual
attention maps from 45 participants (15 for each audio modality),
and our developed testbed will be available with this paper1. To the
best of our knowledge, this dataset with such audio-visual analysis is
the first to address the problem of audio-visual perception of ODV.
We expect that this initial study will be beneficial for future research
on understanding and anticipating human behavior in VR.
The rest of the paper is organized as follows. Section 2 discusses
the related literature on visual attention studies for ODV. Section 3
1https://v-sense.scss.tcd.ie/research/360audiovisualperception/
978-1-7281-1331-9/20/$31.00 ©2020 European Union
describes the technical details of subjective experiments and post-
processing, and Section 4 presents our analysis. Finally, Section 5
concludes the paper.
2. RELATED WORK
Although visual attention has been widely investigated for ODV in
recent years, audio-visual perception of ODV has not been studied
much for VR. Here, we briefly review recent ODV perception re-
search, in particular, visual attention/saliency studies and algorithms
for modeling the visual attention of ODV. For a comprehensive lit-
erature review on the analysis of ODV visual attention, we refer the
reader to [6].
Analysis of visual attention of ODV based on eye and head
tracking datasets aims to identify the most salient regions of ODVs.
In particular, eye or head movements determine the areas of ODVs
which are salient for users. For instance, David et al. [7] estab-
lished a dataset for head and eye movements to understand how
users consume ODV. Their study investigates the impact of the lon-
gitudinal starting position when watching ODV. Several algorithms
have been proposed for modeling visual attention of ODV. In partic-
ular, the Salient360! Grand challenges at ICME 2017-2018 fostered
the development of saliency prediction models for ODV by provid-
ing benchmark platforms and datasets [7]. Also, Zhang et al. [8]
presented a large-scale eye-tracking dataset using only sport-related
ODV. Their analysis demonstrates that salient objects (e.g., appear-
ance and motion of the object) easily attract the viewer’s atten-
tion. Ozcinar and Smolic [9] analyzed content consumption of
ODV viewed in HMDs. Their results prove that the quantity of
fixations depends on the motion complexity of ODV. Furthermore,
Nasrabadi et al. [10] investigated the impact of the type of camera
motion and the number of moving objects in ODV using HMD nav-
igation trajectories. Their results also reveal that users tend to look
at moving targets in ODV. In the studies mentioned above, the audio
signal was discarded from ODVs during subjective experiments. For
realism and presence in VR, experiences should be multi-modal, but
none of the above-mentioned perception studies proposes an audio-
visual ODV dataset nor performs an analysis of human behavior on
audio-visual ODV.
Recently, there has been increasing interest in the audio-visual
aspects of ODV. For example, Rana et al. [11] and Morgado et
al. [12] focused on generating ambisonics for ODV using differ-
ent modeling strategies. They concentrate on utilizing texture and
mono audio of ODV, predicting the location of the audio sources
and encoding ambisonics. Also, Senocak et al. [13] proposed a
unified end-to-end deep convolutional neural network for predicting
the location of sound sources using an attention mechanism that is
guided by sound information. Furthermore, a recent work conducted
by Tavakoli et al. [14] proposed DAVE to investigate the applica-
bility of audio cues in conjunction with visual ones in predicting
saliency maps for standard 2D video using deep neural networks.
Min et al. [15] proposed a multi-modal framework which fuses spa-
tial, temporal and audio saliency maps for standard 2D video with
high audio-visual correspondence. Their results show that the audio
signal contributes significantly to standard video saliency prediction.
However, to the best of our knowledge, neither audio-visual percep-
tion analysis nor audio-visual saliency prediction algorithms exist
for ODV.
3. SUBJECTIVE EXPERIMENTS AND POST-PROCESSING
3.1. Design of testbed
We developed a JavaScript-based testbed that allows us to play
ODVs with three different modalities (i.e., mute, mono, and am-
viewport
location
Scene objects
Mesh
Sphere
Geometry
Sensor viewport
Video Texture (ODV)
360°
180°
Φ
θ
VR
Renderer
Texture
0°
W
X
Y
Z
Multi-channel
Audio
Scene
Camera
{mute, mono, ambisonics}
Audio
Fig. 1: Schematic diagram of the designed testbed.
bisonics) while recording VCTs of participants for the whole du-
ration of the experiment. The testbed was implemented using
three JavaScript libraries, namely three.js [16], WebXR [17],
and JSAmbisonics [18]. The libraries of three.js and
WebXR enable the creation of fully immersive ODV experiences in
a browser, allowing us to use an HMD with a web browser. The
JSAmbisonics facilitated spatial audio experiences for ODVs
with its real-time spatial audio processing functions (i.e., non-
individual head-related transfer functions based on spatially oriented
format for acoustics). The developed testbed can record VCTs with-
out the need for eye tracking devices, which is an adequate use-case
for many VR applications.As shown in Fig. 1, the developed testbed
records participants’ VCTs with the current time-stamp, name of
ODV, and audio modality. At the front-end of the testbed, a .json
file of a given set of ODVs is first loaded as the playlist file, and
a given video is played while the recorded data is stored at the
back-end of the testbed with the refresh rate of the device’s graph-
ics card. The HTTP server was implemented at the back-end us-
ing an Apache web server with the MySQL database, where the
audio-related (e.g., mute, mono, and ambisonics), sensor-related
(e.g., viewing direction), and user-related (e.g., user ID, age, and
gender) data are stored in the database.
3.2. Methodology
To equalize the number of VCTs per audio modality for each ODV,
and to ensure that each participant watches each ODV content only
once, three playlists were prepared. Each playlist included a training
and four test ODVs per audio modality, so there were three training
and twelve ODVs for testing. The ODVs with three different audio
modalities, namely, mute, mono, and ambisonics, and three content
categories were allocated to three playlists, respectively, and equal
numbers of participants were distributed to the three playlists. The
playing order of the test ODVs for each playlist was randomized
before starting each subjective test.
Task-free viewing sessions were performed in our subjective ex-
periments. All the participants were wearing an HMD, sitting in a
swivel chair, and asked to explore the ODVs without any specific
intention. In the experiments, we used an Oculus Rift consumer
version as HMD, Bose QuietComfort noise-canceling headphones,
and Firefox Nightly as web browser. During the test, VCTs were
recorded as coordinates of longitude (0θ < 360) and latitude
(0Φ180) in a viewing sphere. We fixed the starting position
of each viewing as the center point (θ= 180and Φ = 90) in the
beginning of every ODV display. A 5-second rest period showing
a gray screen was included between two successive ODVs to avoid
eye fatigue and motion sickness. The total duration of the experi-
ments was about 10 minutes. During experiments, participants were
alone in the environment to avoid any influence by the presence of
instructor.
3.3. Materials
Our dataset contains 15 monoscopic ODVs (three training and 12
testing) with first-order ambisonics in 4-channel B-format (W, X, Y,
Table 1: Description of the ODVs in our dataset.
Dataset ODV Fps YouTube Selected
ID Name ID Segment
Conversation
Train VoiceComic 24 5h95uTtPeck 00:30:10 – 00:55:10
01 TelephoneTech 30 idLVnagjl s 00:32:00 – 00:57:00
02 Interview 50 ey9J7w98wlI 02:21:20 – 02:40:10
03 GymClass 30 kZB3KMhqqyI 00:50:00 – 01:15:00
04 CoronationDay 25 MzcdEI-tSUc 09:10:00 – 09:35:00
Music
Train Chiaras 30 Bvu9m ZX60 00:12:15 – 00:37:15
05 Philarmonic 25 8ESEI0bqrJ4 00:40:00 – 01:05:00
06 GospelChoir 25 1An41lDIJ6Q 00:09:10 – 00:34:10
07 Riptide 60 6QUCaLvQ 3I 00:00:00 – 00:25:00
08 BigBellTemple 30 8feS1rNYEbg 02:54:26 – 03:19:26
Environment
Train Skatepark 30 gSueCRQO 5g 00:00:00 – 00:25:00
09 Train 30 ByBF08H-wDA 00:20:10 – 00:45:10
10 Animation 30 fryDy9YcbI4 00:01:00 – 00:26:00
11 BusyStreets 30 RbgxpagCY c 02:16:18 – 02:39:20
12 BigBang 25 dd39herpgXA 00:00:00 – 00:25:00
and Z) collected from YouTube. In our experiment, ODVs in mute
modality were produced by removing all audio channels, and ODVs
in mono modality were produced by mixing four audio channels into
one channel which can be distributed equally in left and right head-
phones. They are all 4K resolution (3840 ×1920) in ERP format,
and 25 sec. segment length each. We divided ODVs into three cat-
egories, namely, Conversation,Music, and Environment, depending
on their audio-visual cues in a pilot test with two experts. The cat-
egory of Conversation presents a person or several people talking,
the category Music features people singing or playing instruments,
while the category Environment includes background sound such as
noise of crowds, vehicle engines and horns on the streets. Table 1
summarizes the main characteristics of ODVs used in our dataset,
where Train denotes the training set in each category, and Fig. 2
presents examples of each ODV. Also, Fig. 3 illustrates the visual
diversity of each ODV in terms of spatial and temporal information
measures [19], SI and TI, respectively. Each ODV is re-projected
to cubic faces for computation of SI and TI to prevent effects from
serious geometric distortion along latitude in ERP, as suggested by
De Simone et al. [20].
3.4. Participants
Forty-five participants were recruited in this subjective experiment.
Each ODV with each modality was viewed by 15 participants, and
each participant viewed each ODV only once. These participants
were aged between 21 and 40 years with an average of 27.3 years,
and sixteen of them were female. Eight of them were familiar with
VR, and the others were na¨
ıve viewers. All were screened and re-
ported normal or corrected-to-normal visual and audio acuity, and
24 participants wore glasses during the experiment.
3.5. Post-processing
We recorded the VCTs of each participant in our subjective tests. As
human eyes tend to look straight ahead [21] and head movement fol-
lows eye movement to preserve the eye resting position, similar as
previous works [9, 10,22, 23], we also consider the viewport center
as an approximate gaze position for visual attention estimation. As
observers do not see the complete 360° view at a glance, but only
the content in the viewport, the information about the VCT is an im-
portant information for ODV applications, such as streaming [1,22].
For analyzing visual attention, only fixations in raw scan-path data
collected from VCTs are identified with a density-based spatial clus-
tering (DBSCAN) algorithm [24]. We define a fixation as a particu-
lar location where successive gaze positions remain almost unmoved
for at least 200 ms. To ignore the minor involuntary head movement
and to reduce the sensitivity to noise, similar to previous ODV visual
attention studies [9, 23,25, 26], we utilized the DBSCAN algorithm
to filter noisy fixation points.
Fig. 2: Examples for each ODV used in subjective experiments.
Rows from top to bottom respectively belong to category: Conver-
sation;Music;Environment.
Fig. 3: SI and TI [19, 20] for each ODV used in subjective exper-
iments. Each color visualizes each category: Conversation;Mu-
sic;Environment.
After detecting all fixations of each ODV, we estimated a fixa-
tion map for t-th sec by gathering fixations of all the participants of
a given ODV. Then, a dynamic visual attention map (i.e., saliency
map) of each ODV was generated by applying a Gaussian filter to
its corresponding fixation map sequence. A Gaussian filter with σ
visual angle was used to spread the fixation points to account for the
gradually decreasing acuity from the foveal vision towards the pe-
ripheral vision. Based on the fact that gaze shifts smaller than 10
can occur without the corresponding head movement [21], we set σ
to 5according to the 68–95–99.7 rule in Gaussian distribution [23].
As only the viewport plane is displayed to an observer in HMD, we
applied the Gaussian filter on the projected viewport plane rather
than the entire ERP image.
4. ANALYSIS
In the following, we analyze the observer behavior while watching
ODVs with three audio modalities (mute, mono, and ambisonics) in
three content categories.
4.1. Do audio source locations attract attention of users?
To analyze the effect of audio information on visual attention when
audio and visual stimuli are presented simultaneously, we measure
how far visual attention corresponds to areas with audio sources
under three audio modalities. We generate an audio energy map
(AEM), representing the audio energy distribution with a frame-by-
frame heat map. In AEM, the energy distribution is calculated with
the help of given audio directions in four channels (W, X, Y, Z) in
ambisonics as proposed in [12]. We then estimate normalized scan-
path saliency (NSS) [27] to quantify the number of fixations that
overlap with the distribution of audio energy via AEM. NSS is a
widely-used saliency evaluation metric. It is sensitive to false posi-
tives and relative differences in saliency across the image [28].
Fig. 4 illustrates the mean and 95% confidence intervals com-
puted by bootstrapping of NSS per user for each modality of ODVs.
A higher NSS score indicates more fixations are attracted to areas
of audio source locations with AEM, and negative scores indicate
Fig. 4: Mean and 95% confidence interval of Normalized Scan-
path Saliency (NSS) of fixations falling in sound sources areas under
three audio modalities. ** marks statistically significant difference
(SSD) between two modalities.
most fixations are not corresponding to areas of audio source loca-
tions. Numerical results show that either ambisonics or mono case
has a greater NSS score than mute case. From Fig. 4, we observe
that users may tend to follow audio stimuli (especially human voice)
in categories conversation and music while they tend to look around
in general regardless of the background sound in category environ-
ment. Notably, the two ODVs (ODV 06, 07) in the category music
feature singing humans, while the others (ODV 05, 08) contain hu-
mans playing instruments. However, in category conversation, ODV
02 obtains almost equal NSS scores in three audio modalities, which
shows that visual attention could also be affected by the interaction
of visual stimuli and audio stimuli depending on contents. In the cat-
egory environment, ODV 10 and ODV 12 have similar NSS scores,
while ODV 09 and ODV 11 have some difference. It appears that
only ODV 11, which has an ambulance driving through with siren,
obtains much higher NSS in ambisonics and mono than mute. It
shows that hearing the siren and the sound direction of siren catches
more attention than only seeing the ambulance.
To understand the significance of the NSS results, we per-
formed a statistical analysis with a Kruskal-Wallis H Test following a
Shapiro-Wilks normality test which rejects the hypothesis of normal-
ity of variables. Statistically significant difference (SSD) between
two modalities is detected by the Dunn-Bonferroni non-parametric
post hoc method. The pairs with a SSD are marked with ** in Fig. 4,
which shows that there are three ODVs in category conversation, two
ODVs in category music, and one ODV in category environment ob-
tain SSD between mute and mono, and mute and ambisonics. The
statistical significance analysis results are in line with our observa-
tions above. Furthermore, only one ODV has SSD between mono
and ambisonics, which demonstrates that perceiving the direction of
sound (i.e., ambisonics) might not catch more attention than only
perceiving the loudness of sound without directions (i.e., mono) in
most of ODVs.
For a visual comparison, Fig. 6 presents AEMs and fixations of
two ODVs for each category. In this example, we show an ODV
for each category (ODV 04, 06, 11) that receives statistically sig-
nificantly higher NSS in ambisonics, and the other (ODV 02, 05,
10) receives almost equal NSS or negative NSS under three modal-
ities. Looking at the figures, we can see that fixations are widely
distributed along the horizon under mute modality and are more con-
centrated in AEMs under ambisonics modality. We can see in ODV
04, 06, 11, which obtain higher NSS in mono and ambisonics, fea-
ture talking or singing people or ambulance with siren outside the
center field of view that can attract visual attention by object au-
dio cues. However, in ODV 02, 05, 10, we observe that visual cues
(e.g., human faces, moving objects and, fast-moving camera) have
Fig. 5: Mean and 95% confidence interval of IOC based on NSS of
each ODV with three audio modalities. ** marks statistically signif-
icant difference (SSD) between two modalities.
more effect than audio cues on the distribution of fixations. For ex-
ample, as seen in ODV 02, three human faces are very close to one
another in the center of the ODV, and the users focused on the area of
faces in all three modalities. In ODV 05, a moving object, which is a
conductor in the center of an orchestra, has a more substantial contri-
bution to visual attention than audio cues. Furthermore, in ODV 10,
we see that the participants paid attention to the direction of camera
motion regardless of the sound source location.
4.2. Do observers have similar viewing behavior in mute, mono,
and ambisonics?
Observers’ viewing behavior could exhibit considerable variance
when consuming ODVs. Viewing trajectories might be more con-
sistent to one another when observers perceive audio (i.e., mono)
or audio direction (i.e., ambisonics). To investigate this, we esti-
mate inter-observer congruence (IOC) [29], which is a characteri-
zation of fixation dispersion between observers viewing the same
content. A higher IOC score represents lower dispersion implying
higher viewing concurrency. NSS is used here to compute IOC as
suggested in [30] to compare the fixations of each individual with
the rest of observers. Statistical analysis was also conducted with
the same methods as mentioned in Section 4.1.
Fig. 5 illustrates mean and 95% confidence intervals of IOC
scores of each ODV in three modalities and SSD is marked as **.
From the figure, it is shown that there is a significant difference
between without sound (i.e., mute) and with sound (i.e., mono or
ambisonics) cases. In particular, only in 4 out of 12 cases we ob-
serve a statistically significant difference between the two different
cases, without sound and with sound. Moreover, we observe that
visual attention is guided by object’s sound to look for that object
when observers do not see it in current field of view. For example,
in category conversation, ODV 03, and 04 featuring talking people
in the back of viewing center receive significantly higher IOC be-
tween mute and mono, or mute and ambisonics, while the other two
ODVs (ODV 01, 02) featuring talking people in the front that can
be seen in the beginning of ODV display have no significant differ-
ences between three audio modalities. Similarly, in category music,
ODV 06 featuring people taking turns singing around the viewing
center receives significantly higher IOC in ambisonics as it informs
the direction of singing person unseen in the current filed of view
to observers. However, in ODV 07 which has singing people in the
front and ODV 05 and 08 featuring playing of instruments receive no
significant IOC in three audio modalities. In category environment,
ODV 11 featuring an ambulance with siren driving from right to left
obtains significantly higher IOC between mute and mono, and mute
and ambisonics, while in other ODVs ODV 09, 10, 12 having back-
ground sound from vehicle engines or crowds on the street obtain no
Fig. 6: A sample thumbnail frame with its AEM and fixations for each ODV, where the red represents AEM and the orange,blue, and pink
denotes fixations recorded under none, mono, and ambisonics modality, respectively. A frame for each ODV ID from left to right: 02,04,06,
08,09, and 10.
Fig. 7: Distribution of fixations and AEM in longitude of ODV 04 and 10. The orange,blue, and pink denotes fixations recorded under
none, mono, and ambisonics modality, respectively. In each polar sub-figure, the longitude value of the ERP and its number of fixations
(normalized) are respectively represented by the angle and the radius of the polar plot. Distribution of AEM is represented with red.
significant differences between three audio modalities. This demon-
strates that perceiving object audio cues and the corresponding direc-
tion guides visual attention and increases consistency of viewing pat-
terns between observers, when that object is not in the current field
of view. Comparing the IOC scores between mono and ambisonics,
we can see that the latter does not always receive higher scores in our
subjective experiments. It shows that hearing the direction of sound
(i.e., ambisonics) does not certainly increase consistency of viewing
patterns between observers, compared to only hearing the loudness
of sound (i.e., mono).
4.3. Does sound affect observers’ navigation?
To study the impact of perception of audio (i.e., mono) and audio di-
rection (i.e., ambisonics) to visual attention, we estimated the over-
all fixation distributions and overall AEM of all the frames. In most
of the cases as shown in Fig. 6, the distribution of fixations for the
ODVs with ambisonics modality is more concentrated. Fig. 7 shows
the distribution of fixations and AEM in longitude of ODV 04, 10
with three modalities. This figure shows that, in the ODV 04, the par-
ticipants follow the direction of object audio with ambisonics case in
a crowded scenario, where the main actors talking in the back side
of observers are attracting visual direction in the crowded scene. In
contrast, in ODV 10, the fixation distributions of three modalities are
similar to each other and unrelated to the audio information. This
is due to visual saliency of the fast moving camera, where most of
visual attention corresponds to the direction of camera motion.
From our analyses in Sections 4.1, 4.2, and 4.3, we can gener-
ally conclude that when salient audio (i.e., human voice and siren)
is presented, it catches visual attention more than if only visual cues
are presented. On the other hand, in some cases having salient vi-
sual cues (i.e., human faces, moving objects, and moving camera),
audio and visual information interactively affect visual attention. In
addition, perceiving sounds and sound directions of salient objects
can guide visual attention and achieve higher IOC, if these objects
are not in the current field of view.
Although this study reveal several initial findings, more stud-
ies are required to support the open research questions raised with
this work. In particular, “does the directions of sounds lead higher
viewing congruence than mono sound?” and “does the directions of
sounds guide visual attention more than mono sound?” are still not
confirmed due to limited number of participants. For this purpose,
we plan to conduct more comprehensive subjective experiments (in
terms of number of participants and diverse ODVs), and we plan to
further investigate these questions with statistical tests.
5. CONCLUSION
This paper studied audio-visual perception of ODVs in mute, mono,
and ambisonics modalities. First, we developed a testbed that can
play ODVs with multiple audio modalities while recording users’
VCTs at the same time, and created a new audio-visual dataset con-
taining 12 ODVs with different audio-visual complexity. Next, we
collected users’ VCTs in subjective experiments, where each ODV
had three different audio modalities. Finally, we statistically ana-
lyzed the viewing behavior of participants while consuming ODVs.
This is, to the best of our knowledge, the first user behavior analysis
for ODV viewing with mute, mono, ambisonics.
Our results show that in most of cases visual attention disperses
widely when viewing ODVs without sound (i.e., mute), and concen-
trates on salient regions when viewing ODVs with sound (i.e., mono
and ambisonics). In particular, salient audio cues, such as human
voices and sirens, and salient visual cues, such as human faces, mov-
ing objects, and fast-moving cameras, have more impact on visual
attention of participants. Regarding audio cues, the nature of the
sound (i.e., informative content, frequency changing, performance
timing, audio ensemble) may also play a role in how it gets noticed.
We will leave the aforementioned as future work to further foster the
study of audio-visual attention in ODV. We expect this initial work
which provides a testbed, a dataset from subjective experiments, and
an analysis of user behavior could contribute the community and
arouse more in-depth research in the future.
6. REFERENCES
[1] Cagri Ozcinar, Julian Cabrera, and Aljosa Smolic, “Visual
attention-aware omnidirectional video streaming using opti-
mal tiles for virtual reality, IEEE Journal on Emerging and
Selected Topics in Circuits and Systems, vol. 9, no. 1, March
2019.
[2] Konrad Tollmar, Pietro Lungaro, Alfredo Fanghella Valero,
and Ashutosh Mittal, “Beyond foveal rendering: smart eye-
tracking enabled networking (SEEN),” in ACM SIGGRAPH
2017 Talks. 2017.
[3] Dingzeyu Li, Timothy R Langlois, and Changxi Zheng,
“Scene-aware audio for 360 videos, ACM Transactions on
Graphics (TOG), vol. 37, no. 4, 2018.
[4] Colm O Fearghail, Cagri Ozcinar, Sebastian Knorr, and Aljosa
Smolic, “Director’s cut - Analysis of aspects of interactive
storytelling for VR films,” in International Conference for In-
teractive Digital Storytelling (ICIDS) 2018, 2018.
[5] Erik Van der Burg, Christian NL Olivers, Adelbert W
Bronkhorst, and Jan Theeuwes, “Audiovisual events capture
attention: Evidence from temporal order judgments,” Journal
of Vision, vol. 8, no. 5, 2008.
[6] Mai Xu, Chen Li, Shanyi Zhang, and Patrick Le Callet, “State-
of-the-art in 360° video/image processing: Perception, assess-
ment and compression,” IEEE Journal of Selected Topics in
Signal Processing, pp. 1–1, 2020.
[7] Erwan J. David, Jes´
us Guti´
errez, Antoine Coutrot,
Matthieu Perreira Da Silva, and Patrick Le Callet, “A
dataset of head and eye movements for 360videos,” in
Proceedings of the 9th ACM Multimedia Systems Conference,
2018, MMSys ’18.
[8] Ziheng Zhang, Yanyu Xu, Jingyi Yu, and Shenghua Gao,
“Saliency detection in 360 videos, in Proceedings of the Eu-
ropean Conference on Computer Vision (ECCV ’18), 2018.
[9] Cagri Ozcinar and Aljosa Smolic, “Visual attention in omnidi-
rectional video for virtual reality applications,” in 2018 Tenth
International Conference on Quality of Multimedia Experience
(QoMEX). IEEE, 2018.
[10] Afshin Taghavi Nasrabadi, Aliehsan Samiei, Anahita Mahzari,
Ryan P. McMahan, Ravi Prakash, Myl `
ene C. Q. Farias, and
Marcelo M. Carvalho, “A taxonomy and dataset for 360
videos,” in Proceedings of the 10th ACM Multimedia Systems
Conference, 2019, MMSys ’19.
[11] Aakanksha Rana, Cagri Ozcinar, and Aljosa Smolic, “Towards
generating ambisonics using audio-visual cue for virtual real-
ity, in IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), May 2019.
[12] Pedro Morgado, Nuno Nvasconcelos, Timothy Langlois, and
Oliver Wang, “Self-supervised generation of spatial audio for
360 video,” in Advances in Neural Information Processing Sys-
tems, 2018.
[13] Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang,
and In So Kweon, “Learning to localize sound sources in
visual scenes: Analysis and applications,” arXiv preprint
arXiv:1911.09649, 2019.
[14] Hamed R. Tavakoli, Ali Borji, Esa Rahtu, and Juho Kannala,
“DAVE: A deep audio-visual embedding for dynamic saliency
prediction,” CoRR, vol. abs/1905.10693, 2019.
[15] Xiongkuo Min, Guangtao Zhai, Jiantao Zhou, Xiao-Ping
Zhang, Xiaokang Yang, and Xinping Guan, “A multimodal
saliency model for videos with high audio-visual correspon-
dence,” IEEE Transactions on Image Processing, vol. 29, pp.
3805–3819, 2020.
[16] “JavaScript 3D library. https://threejs.org/,” Jan 2020.
[17] “WebXR device api specification,”
https://github.com/immersive-web/webxr, Jan 2020.
[18] “Jsambisonics,” https://github.com/polarch/JSAmbisonics, Jan
2020.
[19] ITU-T, “Subjective video quality assessment methods for
multimedia applications,” ITU-T Recommendation P.910, Apr
2008.
[20] Francesca De Simone, Jes´
us Guti´
errez, and Patrick Le Callet,
“Complexity measurement and characterization of 360-degree
content,” in Electronic Imaging, Human Vision and Electronic
Imaging, 2019.
[21] Otto-Joachim Gr ¨
usser and Ursula Gr¨
usser-Cornehls, “The
sense of sight,” in Human Physiology, Robert F. Schmidt and
Gerhard Thews, Eds. Springer Berlin Heidelberg, Berlin, Hei-
delberg, 1983.
[22] Xavier Corbillon, Francesca De Simone, and Gwendal Simon,
360 degreee video head movement dataset, in Proceedings of
the 8th ACM on Multimedia Systems Conference, 2017, MM-
Sys’17.
[23] Ana De Abreu, Cagri Ozcinar, and Aljosa Smolic, “Look
around you: Saliency maps for omnidirectional images in
VR applications,” in 2017 Ninth International Conference on
Quality of Multimedia Experience (QoMEX), May 2017.
[24] Martin Ester, Hans-Peter Kriegel, J¨
org Sander, and Xiaowei
Xu, A density-based algorithm for discovering clusters in
large spatial databases with noise, in Proceedings of the Sec-
ond International Conference on Knowledge Discovery and
Data Mining. 1996, AAAI Press.
[25] Anh Nguyen, Zhisheng Yan, and Klara Nahrstedt, “Your atten-
tion is unique: Detecting 360-degree video saliency in head-
mounted display for head movement prediction, in Proceed-
ings of the 26th ACM international conference on Multimedia,
2018.
[26] Anh Nguyen and Zhisheng Yan, “A saliency dataset for 360-
degree videos, in Proceedings of the 10th ACM Multimedia
Systems Conference, New York, NY, USA, 2019, MMSys ’19,
Association for Computing Machinery.
[27] Robert J. Peters, Asha Iyer, Laurent Itti, and Christof Koch,
“Components of bottom-up gaze allocation in natural images,”
Vision Research, vol. 45, no. 18, 2005.
[28] Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and
Fredo Durand, “What do different evaluation metrics tell us
about saliency models?, IEEE Trans. Pattern Anal. Mach.
Intell., vol. 41, no. 3, Mar. 2019.
[29] Olivier Le Meur, Thierry Baccino, and Aline Roumy, “Predic-
tion of the inter-observer visual congruency (IOVC) and ap-
plication to image ranking,” in Proceedings of the 19th ACM
International Conference on Multimedia, New York, NY, USA,
2011, MM ’11, Association for Computing Machinery.
[30] Alexandre Bruckert, Yat Hong Lam, Marc Christie, and Olivier
Le Meur, “Deep learning for inter-observer congruency predic-
tion,” in IEEE International Conference on Image Processing
(ICIP), 2019, pp. 3766–3770.
... Alexa, Siri) and Large Language Models (LLMs) are gaining ground, making voice commands and interactions more ubiquitous while increasing the significance of audio as a data format [2]. We are slowly embracing the world of virtual reality(VR), augmented reality(AR), and immersive technologies, which are considered Extended Reality [3] [4]. Audio is a critical component for creating fully immersive experiences. ...
... Lower tempos (e.g., 60 BPM) and negative pitch shifts (-8, -4) often correlate with "sad" or "disgust." Higher tempos PPAI-25 (e.g., 120, 140 BPM) and positive pitch shifts (4,8) show varied predictions like "happy," "fear," or "surprise." Neutral tempo (0 BPM) shows variability in emotion predictions, with both "neutral" and other emotions like "happy" appearing. ...
Preprint
Full-text available
The rapid proliferation of speech-enabled technologies, including virtual assistants, video conferencing platforms, and wearable devices, has raised significant privacy concerns, particularly regarding the inference of sensitive emotional information from audio data. Existing privacy-preserving methods often compromise usability and security, limiting their adoption in practical scenarios. This paper introduces a novel, user-centric approach that leverages familiar audio editing techniques, specifically pitch and tempo manipulation, to protect emotional privacy without sacrificing usability. By analyzing popular audio editing applications on Android and iOS platforms, we identified these features as both widely available and usable. We rigorously evaluated their effectiveness against a threat model, considering adversarial attacks from diverse sources, including Deep Neural Networks (DNNs), Large Language Models (LLMs), and and reversibility testing. Our experiments, conducted on three distinct datasets, demonstrate that pitch and tempo manipulation effectively obfuscates emotional data. Additionally, we explore the design principles for lightweight, on-device implementation to ensure broad applicability across various devices and platforms.
... Three existing ODV saliency databases have incorporated audio information during their construction [37]- [39], however, these databases have limitations such as small dataset size, lack of research on the impact of spatial audio, or only containing head movement data. Moreover, the prediction models derived from these databases, including AVS360 [40] and SVGC-AVA [38], are limited by relying solely on audio energy maps without incorporating the semantic information of the audio. ...
... The red boxes indicate sound sources, while the blue boxes indicate the other salient objects. To demonstrate the diversity of our database, we compare it with three other ODV audio-visual saliency databases, including the 360AV-HM database [39], the LJ-ODV database [37], and the YQ-ODV database [38]. Table I summarizes the main features of these four databases, highlighting the advantages of AVS-ODV. ...
Preprint
Understanding and predicting viewer attention in omnidirectional videos (ODVs) is crucial for enhancing user engagement in virtual and augmented reality applications. Although both audio and visual modalities are essential for saliency prediction in ODVs, the joint exploitation of these two modalities has been limited, primarily due to the absence of large-scale audio-visual saliency databases and comprehensive analyses. This paper comprehensively investigates audio-visual attention in ODVs from both subjective and objective perspectives. Specifically, we first introduce a new audio-visual saliency database for omnidirectional videos, termed AVS-ODV database, containing 162 ODVs and corresponding eye movement data collected from 60 subjects under three audio modes including mute, mono, and ambisonics. Based on the constructed AVS-ODV database, we perform an in-depth analysis of how audio influences visual attention in ODVs. To advance the research on audio-visual saliency prediction for ODVs, we further establish a new benchmark based on the AVS-ODV database by testing numerous state-of-the-art saliency models, including visual-only models and audio-visual models. In addition, given the limitations of current models, we propose an innovative omnidirectional audio-visual saliency prediction network (OmniAVS), which is built based on the U-Net architecture, and hierarchically fuses audio and visual features from the multimodal aligned embedding space. Extensive experimental results demonstrate that the proposed OmniAVS model outperforms other state-of-the-art models on both ODV AVS prediction and traditional AVS predcition tasks. The AVS-ODV database and OmniAVS model will be released to facilitate future research.
... Audiovisual information has proven to be useful in VR to redirect users' attention [10], self-orientation [27], increasing sound localization accuracy [1] and learning tasks [53], among other applications. Interactions between visual and auditory cues, when properly synchronized, tend to produce facilitatory effects that improve performance in tasks such as visual learning [30] or speech processing [3]. ...
Article
Full-text available
Virtual reality (VR) has the potential to become a revolutionary technology with a significant impact on our daily lives. The immersive experience provided by VR equipment, where the user’s body and senses are used to interact with the surrounding content, accompanied by the feeling of presence elicits a realistic behavioral response. In this work, we leverage the full control of audiovisual cues provided by VR to study an audiovisual suppression effect (ASE) where auditory stimuli degrade visual performance. In particular, we explore if barely audible sounds (in the range of the limits of hearing frequencies) generated following a specific spatiotemporal setup can still trigger the ASE while participants are experiencing high cognitive loads. A first study is carried out to find out how sound volume and frequency can impact this suppression effect, while the second study includes higher cognitive load scenarios closer to real applications. Our results show that the ASE is robust to variations in frequency, volume and cognitive load, achieving a reduction of visual perception with the proposed hardly audible sounds. Using such auditory cues means that this effect could be used in real applications, from entertaining to VR techniques like redirected walking.
... It is unclear, though how large this effect is, since work by Hendrix and Barfield [25] showed that visual features such as field of view or head tracking might have even more influence on presence. For 360°v ideos, spatial audio nudged users towards more exploration [28], and changed their focus towards sound-emitting regions [15,27,45]. Our work contributes a new method to make sound localization more accurate, which would be beneficial for many of above approaches. ...
Preprint
Full-text available
Spatial audio in Extended Reality (XR) provides users with better awareness of where virtual elements are placed, and efficiently guides them to events such as notifications, system alerts from different windows, or approaching avatars. Humans, however, are inaccurate in localizing sound cues, especially with multiple sources due to limitations in human auditory perception such as angular discrimination error and front-back confusion. This decreases the efficiency of XR interfaces because users misidentify from which XR element a sound is coming. To address this, we propose Auptimize, a novel computational approach for placing XR sound sources, which mitigates such localization errors by utilizing the ventriloquist effect. Auptimize disentangles the sound source locations from the visual elements and relocates the sound sources to optimal positions for unambiguous identification of sound cues, avoiding errors due to inter-source proximity and front-back confusion. Our evaluation shows that Auptimize decreases spatial audio-based source identification errors compared to playing sound cues at the paired visual-sound locations. We demonstrate the applicability of Auptimize for diverse spatial audio-based interactive XR scenarios.
Conference Paper
This paper presents the key phases that need to be addressed for enhancing Quality of Experience for immersive media. The overall immersive media pipeline is analyzed, from creation to rendering and fruition, in order to understand how each phase impacts on the user's Quality of Experience. The relation and inter-dependencies between the different steps are highlighted and the current challenges are discussed.
Article
Virtual/augmented reality (VR/AR) devices offer both immersive imagery and sound. With those wide-field cues, we can simultaneously acquire and process visual and auditory signals to quickly identify objects, make decisions, and take action. While vision often takes precedence in perception, our visual sensitivity degrades in the periphery. In contrast, auditory sensitivity can exhibit an opposite trend due to the elevated interaural time difference. What occurs when these senses are simultaneously integrated, as is common in VR applications such as 360° video watching and immersive gaming? We present a computational and probabilistic model to predict VR users' reaction latency to visual-auditory multisensory targets. To this aim, we first conducted a psychophysical experiment in VR to measure the reaction latency by tracking the onset of eye movements. Experiments with numerical metrics and user studies with naturalistic scenarios showcase the model's accuracy and generalizability. Lastly, we discuss the potential applications, such as measuring the sufficiency of target appearance duration in immersive video playback, and suggesting the optimal spatial layouts for AR interface design.
Conference Paper
Full-text available
In this paper, we propose a taxonomy for 360° videos that categorizes videos based on moving objects and camera motion. We gathered and produced 28 videos based on the taxonomy, and recorded viewport traces from 60 participants watching the videos. In addition to the viewport traces, we provide the viewers' feedback on their experience watching the videos, and we also analyze viewport patterns on each category.
Conference Paper
Full-text available
Ambisonics i.e., a full-sphere surround sound, is quintessential with 360360^\circ visual content to provide a realistic virtual reality (VR) experience. While 360360^\circ visual content capture gained a tremendous boost recently, the estimation of corresponding spatial sound is still challenging due to the required sound-field microphones or information about the sound-source locations. In this paper, we introduce a novel problem of generating Ambisonics in 360360^\circ videos using the audio-visual cue. With this aim, firstly, a novel 360360^\circ audio-visual video dataset of 265 videos is introduced with annotated sound-source locations. Secondly, a pipeline is designed for an automatic Ambisonic estimation problem. Benefiting from the deep learning based audio-visual feature-embedding and prediction modules, our pipeline estimates the 3D sound-source locations and further use such locations to encode to the B-format. To benchmark our dataset and pipeline, we additionally propose evaluation criteria to investigate the performance using different 360360^\circ input representations. Our results demonstrate the efficacy of the proposed pipeline and open up a new area of research in 360360^\circ audio-visual analysis for future investigations.
Article
Image classification models have achieved satisfactory performance on many datasets, sometimes even better than human. However, The model attention is unclear since the lack of interpretability. This paper investigates the fidelity and interpretability of model attention. We propose an Explainable Attribute-based Multi-task (EAT) framework to concentrate the model attention on the discriminative image area and make the attention interpretable. We introduce attributes prediction to the multi-task learning network, helping the network to concentrate attention on the foreground objects. We generate attribute-based textual explanations for the network and ground the attributes on the image to show visual explanations. The multi-model explanation can not only improve user trust but also help to find the weakness of network and dataset. Our framework can be generalized to any basic model. We perform experiments on three datasets and five basic models. Results indicate that the EAT framework can give multi-modal explanations that interpret the network decision. The performance of several recognition approaches is improved by guiding network attention.
Article
Audio information has been bypassed by most of current visual attention prediction studies. However, sound could have influence on visual attention and such influence has been widely investigated and proofed by many psychological studies. In this paper, we propose a novel multi-modal saliency (MMS) model for videos containing scenes with high audio-visual correspondence. In such scenes, humans tend to be attracted by the sound sources and it is also possible to localize the sound sources via cross-modal analysis. Specifically, we first detect the spatial and temporal saliency maps from the visual modality by using a novel free energy principle. Then we propose to detect the audio saliency map from both audio and visual modalities by localizing the moving-sounding objects using cross-modal kernel canonical correlation analysis, which is first of its kind in the literature. Finally we propose a new two-stage adaptive audiovisual saliency fusion method to integrate the spatial, temporal and audio saliency maps to our audio-visual saliency map. The proposed MMS model has captured the influence of audio, which is not considered in the latest deep learning based saliency models. To take advantages of both deep saliency modeling and audio-visual saliency modeling, we propose to combine deep saliency models and the MMS model via a later fusion, and we find that an average of 5% performance gain is obtained. Experimental results on audio-visual attention databases show that the introduced models incorporating audio cues have significant superiority over state-of-the-art image and video saliency models which utilize a single visual modality.
Article
Nowadays, 360° video/image has been increasingly popular and drawn great attention. The spherical viewing range of 360° video/image accounts for huge data, which pose the challenges to 360° video/image processing in solving the bottleneck of storage, transmission, etc. Accordingly, the recent years have witnessed the explosive emergence of works on 360° video/image processing. In this paper, we review the state-of-the-art works on 360° video/image processing from the aspects of perception, assessment and compression. First, this paper reviews both datasets and visual attention modelling approaches for 360° video/image. Second, we survey the related works on both subjective and objective visual quality assessment (VQA) of 360° video/image. Third, we overview the compression approaches for 360° video/image, which either utilize the spherical characteristics or visual attention models. Finally, we summarize this review paper and outlook the future research trends on 360° video/image processing.
Article
Visual events are usually accompanied by sounds in our daily lives. However, can the machines learn to correlate the visual scene and sound, as well as localize the sound source only by observing them like humans? To investigate its empirical learnability, in this work we first present a novel unsupervised algorithm to address the problem of localizing sound sources in visual scenes. In order to achieve this goal, a two-stream network structure which handles each modality with attention mechanism is developed for sound source localization. The network naturally reveals the localized response in the scene without human annotation. In addition, a new sound source dataset is developed for performance evaluation. Nevertheless, our empirical evaluation shows that the unsupervised method generates false conclusions in some cases. Thereby, we show that this false conclusion cannot be fixed without human prior knowledge due to the well-known correlation and causality mismatch misconception. To fix this issue, we extend our network to the supervised and semi-supervised network settings via a simple modification due to the general architecture of our two-stream network. We show that the false conclusions can be effectively corrected even with a small amount of supervision, i.e. , semi-supervised setup. Furthermore, we present the versatility of the learned audio and visual embeddings on the cross-modal content alignment and we extend this proposed algorithm to a new application, sound saliency based automatic camera view panning in 360 degree videos.
Conference Paper
Despite the increasing popularity, realizing 360-degree videos in everyday applications is still challenging. Considering the unique viewing behavior in head-mounted display (HMD), understanding the saliency of 360-degree videos becomes the key to various 360-degree video research. Unfortunately, existing saliency datasets are either irrelevant to 360-degree videos or too small to support saliency modeling. In this paper, we introduce a large saliency dataset for 360-degree videos with 50,654 saliency maps from 24 diverse videos. The dataset is created by a new methodology supported by psychology studies in HMD viewing. We describe an open-source software implementing this methodology that can generate saliency maps from any head tracking data. Evaluation of the dataset shows that the generated saliency is highly correlated with the actual user fixation and that the saliency data can provide useful insight on user attention in 360-degree video viewing. The dataset and the program used to extract saliency are both made publicly available to facilitate future research.