PreprintPDF Available

Spherical Harmonics for Saliency Computation and Virtual Cinematography in 360°Videos

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Omnidirectional videos, or 360° videos, have exploded in popularity due to the recent advances in virtual reality head-mounted displays (HMDs) and cameras. Despite the 360° field of regard (FoR), almost 90% of the pixels are outside a typical HMD's field of view (FoV). Hence, understanding where users are more likely to look at plays a vital role in efficiently streaming and rendering 360° videos. While conventional saliency models have shown robust performance over rectilinear images, they are not formulated to handle equatorial bias, horizontal clipping, and spherical rotations in 360° videos. In this paper, we present a novel GPU-driven pipeline for saliency computation and virtual cinematography in 360° videos using spherical harmonics (SH). By analyzing the spherical harmonics spectrum of the 360° video, we extract the spectral residual by accumulating the SH coefficients between a low band and a high band. Our model outperforms the classic Itti et al.'s model in timings by 5\times to 13\times in timing and runs at over 60 FPS for 4K videos. Further, our interactive computation of spherical saliency can be used for saliency-guided virtual cinematography in 360° videos. We formulate a spatiotemporal model to ensure large saliency coverage while reducing the camera movement jitter. Our pipeline can be used in processing, navigating, and streaming 360° videos in real time.
Content may be subject to copyright.
Spherical Harmonics for Saliency Computation and Virtual
Cinematography in 360° Videos
Ruofei Du, Student Member, IEEE and Amitabh Varshney, Fellow, IEEE
(A) The input 360°video frame (B) Saliency map by Itti et al.’s model (C) Saliency map by our SSR model
Fig. 1. This paper presents an efficient GPU-driven pipeline of computing saliency maps of 360
°
videos using spherical harmonics (SH).
(A) shows an input frame from a 360
°
video. (B) shows the saliency maps computed by the classic Itti et al.’s model in
104.46
ms on the
CPU. (C) show the saliency maps computed by our spherical spectral residual (SSR) model in
21.34
ms on the CPU and
10.81
ms on
the GPU. In contrast to the classic models for images in rectilinear projections, our model is formulated in the
SO(2)
space. Therefore,
it remains consistent in challenging cases such as horizontal clipping, spherical rotations, and equator biases in 360° videos.
Abstract
— Omnidirectional videos, or 360
°
videos, have exploded in popularity due to the recent advances in virtual reality head-
mounted displays (HMDs) and cameras. Despite the 360
°
field of regard (FoR), almost 90% of the pixels are outside a typical HMD’s
field of view (FoV). Hence, understanding where users are more likely to look at plays a vital role in efficiently streaming and rendering
360
°
videos. While conventional saliency models have shown robust performance over rectilinear images, they are not formulated
to handle equatorial bias, horizontal clipping, and spherical rotations in 360
°
videos. In this paper, we present a novel GPU-driven
pipeline for saliency computation and virtual cinematography in 360
°
videos using spherical harmonics (SH). By analyzing the spherical
harmonics spectrum of the 360
°
video, we extract the spectral residual by accumulating the SH coefficients between a low band and a
high band. Our model outperforms the classic Itti et al.’s model in timings by
5×
to
13×
in timing and runs at over
60
FPS for
4K
videos.
Further, our interactive computation of spherical saliency can be used for saliency-guided virtual cinematography in 360
°
videos. We
formulate a spatiotemporal model to ensure large saliency coverage while reducing the camera movement jitter. Our pipeline can be
used in processing, navigating, and streaming 360° videos in real time.
Index Terms
—spherical harmonics, virtual reality, visual saliency, 360
°
videos, omnidirectional videos, perception, Itti model, spectral
residual, GPGPU, CUDA
1 INTRODUCTION
With recent advances in consumer-level virtual reality (VR) head-
mounted displays (HMD) and panoramic cameras, omnidirectional
videos are becoming ubiquitous. These 360
°
videos are becoming a
crucial medium for news reports, live concerts, remote education, and
social media. One of the most significant benefits of omnidirectional
videos is immersion: users have a sustained illusion of presence in
such scenes. Nevertheless, despite the rich omnidirectional visual in-
formation, most of the content is out of the field of view (FoV) of the
head-mounted displays, as well as human eyes. The binocular vision
system of human eyes can only interpret
114°
FoV horizontally, and
135°
FoV vertically [19]. As a result, over
75%
of the 360
°
videos are
not being perceived. Furthermore, as shown in Table 1, almost
90%
pix-
els are beyond the FoV of the current generation of the consumer-level
VR HMDs1.
Therefore, predicting where humans will look at, i.e., saliency de-
Ruofei Du and Amitabh Varshney are affiliated with the Augmentarium at
the University of Maryland Institute for Advanced Computer Studies
(UMIACS) and Department of Computer Science at University of Maryland,
College Park. Emails: {ruofei, varshney}@umiacs.umd.edu
Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publication
xx xxx. 201x; date of current version xx xxx. 201x. For information on
obtaining reprints of this article, please send e-mail to: reprints@ieee.org.
Digital Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx
1
Data sources: the official websites of Oculus, HTC Vive, Samsung, blog
posts https://goo.gl/eBqpvm and https://goo.gl/n7Vji3
tection, has great potential over a wide range of applications, such
as:
efficiently compressing and streaming high-resolution panoramic
videos under poor network conditions [11],
salient object detection in panoramic images and videos [40],
information overlay in panoramic images [13], videos [38], and
for augmented reality displays,
directing the user’s viewpoint to salient objects which are out
of the user’s current field of view, or automatic navigation and
synopsis of the 360° videos [20, 36, 41, 42].
Saliency of regular images and videos has been well studied thor-
oughly since Itti et al.’s work [22]. Previous research has also in-
vestigated mesh saliency [28], volume saliency [26], and light-fields
saliency [31]. However, unlike classic images which are stored in recti-
linear or gnomonic projections, most of the panoramic videos are stored
in equirectangular projections. Consequently, classic saliency may not
work for 360
°
videos due to the following challenges, as further shown
in Figure 6:
Horizontal clipping may slice a salient object into two parts on
the left and right edges, which may cause a false negative result.
Spherical rotation may distort the non-salient objects near the
north and south poles, which may cause a false positive result.
Visual
Medium Approximate Field of View (FoV)Ratio Beyond FoV
Horizontal Vertical
Human Eyes 114°135°76.25%
HTC Vive,
Oculus Rift 85°95°87.53%
Samsung
Gear VR 75°85°90.16%
Google
Cardboard 65°75°92.48%
Table 1. This table shows the comparison of the approximate binocular
field of view of human eyes, as well as the current generation of the
consumer-level head-mounted displays.
Equatorial bias is not formulated in the classic saliency detectors.
In this paper, we address three interrelated research questions: (a)
how should we formulate the saliency in the
SO(2)
space with spherical
harmonics, (b) how should we speed up the computation by discarding
the low-frequency information, and (c) how should we automatically
and smoothly navigate the 360
°
videos with saliency maps? To in-
vestigate these questions, we present a novel GPU-driven pipeline for
saliency computation and navigation based on spherical harmonics
(SH), as shown in Figure 1.
In Section 3, we present the preprocessing for computing the SH co-
efficients for representing the 360
°
videos. Our pipeline pre-computes
a set of the Legendre polynomials and SH functions and stores them
in GPU memory. We adopt the highly-parallel prefix sum algorithm to
integrate feature maps of the downsampled 360
°
frames as 15 bands of
spherical harmonics coefficients on the GPU.
In Section 4, we introduce the Spherical Spectral Residual (SSR)
model. Inspired by the spectral residual approach, we define SSR as
the accumulation of the SH coefficients between a low band and a high
band. This model reveals the multi-scale saliency maps in the spherical
spectral domain and reduces the computational cost by discarding
the low bands of SH coefficients. From the experimental results, it
outperforms the Itti et al.’s model by over
5×
to
13×
in timing, and
runs in real time at over 60 frames per second for 4Kvideos.
In Section 5, as a proof-of-concept, we propose and imple-
ment a saliency-guided virtual cinematography system for navigating
360
°
videos. We formulate a spatiotemporal model to ensure large
saliency coverage while reducing the camera movement jitter.
The main contributions of our work are:
formulating saliency natively and directly in the special orthogo-
nal group
SO(2)
space using the spherical harmonics coefficients,
without converting the image to R2,
reducing the computational cost and formulating the spherical
saliency using the spectral residual model with spherical harmon-
ics,
devising a saliency-guided virtual cinematography system for
automatic navigation in 360° videos,
implementing the GPU-driven real-time pipeline of computing
saliency maps in 360° videos.
2 RE LATE D WORK
Our work builds upon a rich literature of prior art on saliency detection,
as well as spherical harmonics.
2.1 Visual Saliency
Visual saliency has been investigated in ordinary images [18, 22],
videos [17], giga-pixel images [21], 3D meshes [28], volumes [27], and
light fields [31]. Here, we mainly focus on image and video saliency.
A region is considered salient if it has perceptual differences from
the surrounding areas that are likely to draw visual attention. Prior
research has designed bottom-up [22, 33, 43, 46], top-down [15,23, 32],
and hybrid models for constructing a saliency map of images (see the
review by Zhao et al. [52]). The bottom-up models combine low-level
image features from multi-scale Gaussian pyramids or Fourier spectrum.
Top-down models usually use machine learning strategies and take
advantage of higher-level knowledge such as context or specific tasks
for saliency detection. Recently, hybrid models using convolutional
neural networks [30, 35, 44, 47, 51,53] have emerged to improve the
accuracy of saliency prediction.
One of the most pivotal algorithms for saliency detection is Itti et
al.’s model [22]. This model computes the center-surround differences
of multi-level Gaussian pyramids of the feature maps, which include
intensity, color contrast, and orientations, as conspicuity maps. It
further combines the conspicuity maps with non-linear combination
methods and a winner-take-all network. Another influential algorithm
is the spectral residual approach devised by Hou and Zhang [18]. This
model computes the visual saliency by the difference of the original
and smoothed log-Fourier spectrum of the image.
However, both approaches assume the input data as rectilinear im-
ages, which would not output consistent results for spherical images
with horizontal clipping or spherical rotation. Inspired by these two
approaches, we formulate the spherical spectral residual model in the
SO(2)
space. By efficiently evaluating the SH coefficients between two
bands, our model can be easily implemented on the GPU and achieves
spherical consistency.
In addition to Itti et al.’s model and the spectral residual model,
Bruce et al. [8] learn a set of sparse codes from example images to
evaluate saliency of new inputs. Wang et al. [50] use random graph
walks on image pixels to compute image saliency. Goferman et al. [15]
consider visual organization and high-level features such as human
faces in saliency computation.
Nonetheless, all of these prior approaches only work for rectilinear
images. Our paper, as far as we are aware, is the first to apply spherical
harmonics for saliency analysis of 360
°
videos. The work presented in
this paper is inspired by Itti et al.’s model [22] and the spectral residual
approach [18] used for image and video saliency.
2.2 Spherical Harmonics
Fig. 2. This figure shows the first five bands of spherical harmonics
functions. Blue indicates positive real values, and red indicates negative
real values. Our code and visualization can be viewed online interactively
at
https://shadertoy.com/view/4dsyW8
. This demo is built on
´
I
˜
nigo
Qu´
ılez’s prior work.
Spherical harmonics are a complete set of orthogonal functions on
the sphere (as visualized in Figure 2), and thus may be used to repre-
sent functions defined on the surface of a sphere. In visual computing,
spherical harmonics have been widely applied to various domains
and applications: indirect lighting [3, 16], volume rendering [5], 3D
sound [1, 37], and 3D object retrieval [9, 45]. As for lighting, previous
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
The High Band - Q
The Low Band - P
Fig. 3. The spectral residual maps between different bands of spherical harmonics. The number along the horizontal axis indicates the high band
Q
,
while the vertical axis indicates the low band
P
. Note that the saliency maps within or close to the orange bounding box successfully detect the two
people in the frame.
work in computer graphics has applied spherical harmonics to calcu-
late global illumination and ambient occlusion [39], refraction [14],
scattering [4], as well as precomputed radiance transfer [29].
To the best of our knowledge, we are the first to apply spherical
harmonics for saliency detection in panoramic images and videos.
3 COMPUTING THE SPHERICAL HARMONICS CO EFFI CI EN TS
Spherical harmonics coefficients are usually computed using Monte
Carlo integration over the sphere [16]. 360
°
videos are mostly
stored in equirectangular projections, where each pair of the tex-
ture coordinate
(u,v)
corresponds to a pair of spherical coordinate
(θ,φ)(2πu,πv)
. Therefore, we can directly integrate over the
scalar fields of the feature maps by using precomputed spherical har-
monics at each texture coordinate. Hence, computation of the spherical
harmonics coefficients is reduced to a prefix sum problem on the GPU,
which is efficiently solved by the Blelloch Scan algorithm. Finally, we
also show that we could downsample the panoramic image to
N×M
pixels while maintaining a small error in the resulting spherical harmon-
ics coefficients. We further show that, for
L
bands of SH coefficients,
the computational complexity is O(L2logMN)on the GPU.
3.1 Evaluating SH Functions
To efficiently extract the spherical harmonics coefficients from the
360
°
videos, we precompute the SH functions at each spherical coor-
dinate
(θ,φ)
of the input panorama of
N×M
pixels. Since the values
in the feature maps, which are used to define the intensity and color
contrast are positive and real, we compute only the real-valued SH
functions, also known as the tesseral spherical harmonics, as shown in
Figure 2.
The SH functions,
Y(θ,φ)
, are orthonormal to each other, and de-
fined in terms of the Legendre polynomials Pm
las follows:
Ym
l(θ,φ) =
2Km
lcos(mφ)Pm
lcos(θ),m>0
K0
lP0
lcos(θ),m=0
2Km
lsin(mφ)Pm
lcos(θ),m<0
(1)
where
0lL
is the band index,
m
is the order of the band, and
lml.Pm
lare the associated Legendre polynomials:
Pl
l= (1)l(2l1)!!(1x2)l/2
Pl1
l= [x(2l1)]Pl
l
Pm
l=x(2l1)
lmPm
l1l+m1
lmPm
l2
(2)
Km
lis a scaling factor to normalize the functions:
Km
l=s(2l+1)
4π
(l|m|)!
(l+|m|)!(3)
3.2 Evaluating SH Coefficients
To compute the SH coefficients of the 360
°
videos, we first extract the
feature maps such as the intensity and color contrast, inspired by Itti et
al.’s model [22] and SaliencyToolbox [46]. The intensity is calculated
from the red, green, blue channels of each frame (
r,g,b
) according
to [46]:
I= (r,g,b)T·(0.2126,0.7152,0.0722)(4)
We also define the red-green (RG) and blue-yellow (BY) contrast
for each pixel as follows:
RG =rg
max(r,g,b)(5)
BY =bmin(r,g)
max(r,g,b)(6)
For each feature map, the SH coefficients consist of
L2
values for
L
bands. In the equirectangular representation of the 360
°
videos, we
assume that each feature
fi,j
at the coordinate
(i,j),0i<N,0
Fig. 4. The reconstructed images using the first 15 bands of spherical
harmonics coefficients extracted from the video frame.
j<M
represents the mean value
f(θi+0.5,φj+0.5)
at the solid angle
(θi+0.5,φj+0.5), where θiand φjare defined as:
θi=πi
N,φj=2πj
M,(7)
Therefore, for the
mth
element of a specific band
l
, we evaluate the
SH coefficients of the feature map fas:
cm
l(θ,φ) = Z(θ,φ)S
f(θ,φ)·Ym
l(θ,φ)sinθdθdφ
=2π
M
N
i=1
M
j=1
fi,j·Ym
l(θi+0.5,φj+0.5)|cosθi+1cos θi|
(8)
Let
Hi,j=2π
MYm
l(θi+0.5,φj+0.5)|cosθi+1cos θi|(9)
we have
cm
l(θ,φ) =
N
i=1
M
j=1
fi,j·Hi,j(10)
Hence, for a given dimension of the input frames, we can precompute
the terms
H(i,j)
and store them in a lookup table. The integration of the
SH coefficients is then reduced to a conventional prefix sum problem.
3.3 Implementation Details
On the CPU-driven pipeline, we use OpenMP to accelerate the evalua-
tion of SH coefficients with 12 threads. On the GPU-driven pipeline,
we take advantage of the Blelloch Scan algorithm [6] with CUDA 9
to efficiently aggregate the SH coefficients with 2048 kernels on an
NVIDIA GTX 1080. The Blelloch Scan algorithm computes the cumu-
lative sum in
O(logN)
for
N
numbers. Therefore, our algorithm runs
at O(L2logMN )for L2coefficients.
Finally, we show the reconstructed image
f0
with the
115
bands
of SH coefficients with regular RGB color maps in Figure 4 with the
following equation:
f0(θ,φ) =
L
l=0
l
m=l
cm
l·Ym
l(θ,φ)(11)
Note that the low-band SH coefficients capture the background infor-
mation, such as sky and mountains, while the high-band SH coefficients
capture the details, such as parachuters.
4 SPHERICAL SPECTRAL RESIDUAL MODEL
With the spherical harmonics coefficients, we present a novel approach
to compute saliency for spherical 360
°
videos using the idea of spherical
spectral residuals (SSR).
4.1 Spherical Spectral Residual Approach
As shown in Figure 4, spherical harmonics bands can be used to com-
pute the contrast directly across multiple scales in the frequency space.
In the space of
SO(2)
, we define the spherical spectral residual (SSR)
as the difference between the higher bands (up to
Q
) of SH coefficients
and the lower bands (up to P) of SH coefficients:
R(θ,φ) =
Q
l=0
l
m=l
cm
l·Ym
l(θ,φ)
P
l=0
l
m=l
cm
l·Ym
l(θ,φ)
=
Q
l=P+1
l
m=l
cm
l·Ym
l(θ,φ)
(12)
in which
Ym
l(φ,θ)
are pre-computed associated Legendre polynomials
in the preprocessing stage. The SSR represents the salient part of the
scene in the spectral domain and serves as a compressed representation
using spherical harmonics.
For better visual effects, we square the spectral residual to reduce
the estimation errors. For better visual effects, we smooth the spherical
saliency maps using a Gaussian:
S(θ,φ) = G(σ)[R(θ,φ)]2(13)
where
G(σ)
is a Gaussian filter with standard deviation
σ
(
σ=5
for
the results presented in this paper).
We show the SSR results of the intensity channel with all different
pairs of the lower band
P
and the higher band
Q
in Figure 3. As
P
increases, the low-frequency information such as the sky and mountains
are filtered out. The spectral residual results within and close to the
orange bounding box reveal the salient objects, such as the two people.
4.2 Temporal Saliency
In addition to intensity and color features, we further extract temporal
saliency in the spherical harmonics domain.
For the SH coefficients extracted from the three feature maps, we
maintain two sliding temporal windows to estimate the temporal con-
trast. The smaller window
w0
stores the more recent SH coefficients
from the feature maps, and the larger window
w1
stores the SH coeffi-
cients over a longer term. For each frame, we calculate the estimated SH
coefficients
¯cm
l,¯
¯cm
l
from both windows, using two probability density
functions from the Gaussian distribution (
|w0|=5,|w1|=25,σ=7.0
).
We use formulation similar to the spatial saliency to measure the spher-
ical spectral residual between two temporal windows:
R(Ftemporal,θ,φ) =
Q
l=P+1
l
m=l
(¯
¯cm
l(θ,φ)¯cm
l(θ,φ)) ·Ym
l(θ,φ)
(14)
We further apply Equation 13 to compute the smoothed temporal
saliency maps.
4.3 Saliency Maps with Nonlinear Normalization
Following Itti et al. [22], we apply the non-linear normalization operator
N(·)
to all the six saliency maps: intensity, red-green, and blue-
yellow contrasts, both statically and temporally. This operator globally
Itti et al’s Model Our SSR Model
Parachute
(1920 × 1080)
Office
(4096 × 2048)
Spring
Outdoor
(7680 × 3840)
Input
Night
(7680 × 3840)
Winter
Outdoor
(4096 × 2048)
Grassland
(1920 × 1080)
Fig. 5. The visual comparison between the Itti et al.’s model and our SSR model. Note that while the results are visually similar, our SSR model are
5×to 13×times faster than the Itti et al.’s model.
The Input Frame
Source
Horizontal
Clipping
Spherical
Rotation
Our SSR ModelItti et al’s Model
Fig. 6. This figure shows the comparison between Itti et al.’s model and our SSR model with horizontal translation and spherical rotation in the
360
°
video frame. White circles indicate the false negative result from Saliency Toolbox and orange ones indicate false positive result from Saliency
Toolbox. Meanwhile, the results from our SSR model remain consistent, regardless of horizontal clipping and spherical rotation. Note that we use
custom shaders to transform the spherical images, compute the saliency, and apply the inverse transformation for intuitive visualization.
promotes maps which contain a few peak responses and suppresses
maps with a large number of peaks.
S=1
N
N
M
i=1
N(S(Fi)) (15)
After the non-linear normalization, we linearly combine all saliency
maps into the final saliency map. Empirically, we choose
Q=15,P=7
.
The final composed result is shown at the bottom left corner in Figure 3,
as well as in the accompanying video.
4.4 Comparison Between the Itti et al.’s and SSR Model
As shown in Figure 6, our SSR model is visually better than Itti et al.s
model. In addition, our experimental results below compare the classic
Itti et al.’s model and our model.
We use six videos from the Insta360
2
and the 360Rize
3
. The video
resolutions vary from 1920×1080 to 7680 ×3840 pixels.
Resolution The Average Timing Per Frame
Itti et al. (CPU) SSR (CPU) SSR (GPU)
1920x1080 104.46 ms 21.34 ms 10.81 ms
4096x2048 314.94 ms 48.18 ms 13.20 ms
7680x3840 934.26 ms 69.53 ms 26.58 ms
Table 2. Timing comparison between the Itti et al.s model and our
spherical spectral residual (SSR) model.
The experiments are conducted on a workstation with an NVIDIA
GTX 1080 and an Intel Xeon E5-2667 2.90GHz CPU with
32
GB RAM.
Both Itti et al.’s model and the SSR model are implemented in C
++
and
OpenCV. The GPU version of the SSR model is developed using CUDA
8.0. We measure the average timing of saliency computation, as well
as the visual results between the Itti et al.’s model and our SSR model.
Note that the timings do not include the uploading time for each frame
2Insta360: https://www.insta360.com
3360 Rize: http://www.360rize.com
from system memory to GPU memory. We believe that our algorithms
would map well to products such as NVIDIA DrivePX
4
in which videos
are directly loaded onto the GPU memory.
We measure the average computational cost of the initial
600
frames
across three resolutions:
1920 ×1080
,
4096 ×2048
, and
7680 ×3840
,
as shown in Table 2. All frames are preloaded into the CPU memory
to eliminate the I/O overhead. Both the CPU and GPU versions of our
SSR model outperform the classic Itti et al.’s model, with the speedups
ranging from
4.8×
to
13.4×
, depending on various resolutions. We
show the example input, and the output from both models in Figure 5.
5 SALIENCY-GUIDED VI RTUAL CINEMATOGRAPHY
With advances in the 360
°
video cameras and network bandwidth,
more events are being live-streamed as high-resolution 360
°
videos.
Nevertheless, while the user is watching the 360
°
video in a typical
commodity HMD, almost 90 percent of the video is beyond the user’s
field of view, as shown in Table 1. Therefore, methods to automatically
control the path of the virtual camera (virtual cinematography) becomes
a vital challenge for streaming and navigating 360
°
videos in real time.
Inspired by the prior work on camera path selection and interpolation [2,
7, 10, 12, 24, 25, 34, 36, 42], we investigate how saliency maps could
guide automatic camera control for 360° videos.
First, we compute the saliency maps by linearly combining the
saliency maps based on intensity, color, and motion, and then perform
a non-linear normalization, as introduced in the previous section.
However, for 360
°
videos, the most salient objects may vary from
frame to frame, due to the varying occlusions, colors, and self-
movement. As a result, an approach that relies on just tracking the most
salient objects may incur rapid motion of the camera, and worse still,
may induce motion sickness in virtual reality.
In this section, we propose a spatiotemporal optimization model of
the virtual camera’s discrete control points and further employ a spline
interpolation amongst the control points to achieve smooth camera
navigation.
5.1 Optimization of the Virtual Camera’s Control Points
To estimate the virtual camera’s control points, we formulate an energy
function
E(
C
)
in terms of camera location C
= (θ,φ)
. The energy
function
4https://NVIDIA.com/en-us/self-driving-cars/drive-px
Fig. 7. This figure shows the interpolation amongst the global maximas
of the saliency maps in the spherical space. The yellow dots show the
discrete optimal location using the energy function, and the blue dots
show the interpolation using the spherical spline curve with
C2
continuity.
E(C) = λsaliency ·Esaliency(C) + λtemporal ·Etemporal(C)(16)
consists of a saliency coverage term
Esaliency
and a temporal motion
term
Etemporal
, thus taking both saliency coverage and temporal smooth-
ness into consideration.
5.1.1 Saliency Coverage Term
This spatial term
Esaliency
penalizes the coverage of the saliency values
beyond the field of view. As for a specific virtual camera location C,
this term would be written as:
Esaliency(C) = θ,φS(θ,φ)·O(C,θ,φ)
θ,φS(θ,φ)(17)
where
O(
C
,φ,θ)
indicates whether an arbitrary spherical point
(φ,θ)is observed by the camera centered at the location Ci:
O(C,θ,φ) = 1,(θ,φ)is observed by virtual camera at C
0,otherwise (18)
Thus,
Esaliency (C)
measures the coverage of the saliency values
beyond the field of view of the virtual camera centered at C. To reduce
the computation, we compute the saliency coverage term over
2048
points (θ,φ), that are uniformly distributed over the sphere.
5.1.2 Temporal Motion Term
For the
ith
frame in the sequence of the discrete control points,
Etemporal (C)
measures the temporal motion of the virtual camera as
follows:
Etemporal (C) = kCi1,Cik2,i1
0,i=0(19)
5.1.3 The Optimization Process
Based on this spatiotemporal model, we evaluate the energy functions
over
32 ×64
pairs of discrete
(θ,φ)
. This process is highly parallel,
and can be efficiently implemented on the GPU. For each frame, we
compute the optimal camera point as follows:
˚
C=argmin
C
E(C)(20)
In this way, we extract a sub-sequence of discrete spherical coordi-
nates
Seq ={˚
Ci|˚
Ci= (φi,θi)}
of the optimal camera location in the
saliency maps every
K
frames,
K=5
in our examples. Since these
locations are discrete and sampled at a lower frame rate, we further
perform spline interpolation with C2continuity.
5.2 Interpolation of Quaternions
To achieve superior interpolation over a sphere, we convert the spherical
coordinates to quaternions:
Q(θ,φ) = (0,sin(θ)cos (φ),sin (θ)sin (φ),cos (θ)) (21)
We use the spherical spline curves with
C2
continuity to compute the
smooth trajectory of the camera cruise path over the quaternions. For
an arbitrary timestamp
x
, we need to compute the interpolated spherical
coordinates
Si(x)
. We denote
ti
as the most recent timestamp to
x
,
which corresponds to the
ith
video frame, and
Q(θi,φi)
as the corre-
sponding quaternion. Hence, we compute the interpolated quaternion
Qi(x)as follows:
Qi(x) = 2Qi(xti1)3
6hi
+2Qi1(tix)3
6hi
+Qi
hi2Qihi
6(xti1) + Qi1
hi2Qi1hi
6(tix)
(22)
where
2ti
is the second derivative of the timestamp at the
ith
frame,
and
hi=titi1
. Figure 7 shows the locations of the global maximas,
as well as the interpolated spline path over the sphere.
5.3 Evaluation of the SpatioTemporal Optimization Model
We compare our method with the MaxCoverage model which deter-
mines the camera position for the maximal coverage of the saliency map.
We evaluate the temporal motion terms for the same video sequence
and plot the data in Figure 8.
0
20
40
60
80
100
120
1
8
15
22
29
36
43
50
57
64
71
78
85
92
99
106
113
120
127
134
141
148
155
162
169
176
183
190
197
204
211
218
225
232
239
246
253
260
267
274
281
288
295
302
309
316
323
330
337
344
351
358
Temporal Motion (diagonal degrees)
Frame Number
Comparison of Temporal Artifacts Between
MaxCoverage and SpatioTemporal Models
MaxCoverage SpatioTemporal
Fig. 8. Quantitative comparison between the MaxCoverage model and
the SpatioTemporal Optimization model. We visualize the temporal
motion of the virtual camera across 360 frames. Compared with the Max-
Coverage model, the SpatioTemporal Optimization model significantly
reduces the temporal jitters.
From the quantitative evaluation, as well as the complementary
video, we have validated that the SpatioTemporal Optimization model
reduces the temporal jittering of the camera motion compared to Max-
Coverage model for virtual cinematography in 360°videos.
6 CONCLUSION AND FUTURE WORK
In this paper, we have presented a novel GPU-driven pipeline which
employs spherical harmonics to directly compute the saliency maps for
360
°
videos in the
SO(2)
space. In contrast to the traditional method,
our method remains consistent for challenging cases like horizontal
clipping, spherical rotations, and equatorial bias, and is
5×
to
13×
faster than the classic Itti et al.’s model.
We demonstrate the application of using spherical harmonics
saliency to automatically control the path of the virtual camera. We
present a novel spatiotemporal optimization model to maximize the spa-
tial saliency coverage and minimize the temporal jitters of the camera
motion.
In future, we plan to further develop our SSR model for stereoscopic
saliency detection [49] in 360
°
videos. We aim to collect a large scale
dataset with stereo 360
°
videos and human eye tracking data. Another
future direction is to generate hyper-lapse rectilinear videos [48] from
360° videos, using a variant of our virtual cinematography model.
We would like to open source our toolbox for computing spherical
harmonics from 360
°
videos, saliency maps from SSR models, and
virtual cinematography. We believe the spherical representation of
saliency maps will inspire more research to think out of the rectilinear
space. We envision our techniques will be widely be used for live
streaming of events, video surveillance of public areas, as well as
templates for directing the camera path for immersive storytelling.
Future research may explore how to naturally place 3D objects with
spherical harmonics irradiance in 360
°
videos, how to employ spherical
harmonics for foveated rendering in 360
°
videos, and the potential of
compressing and streaming 360° videos with spherical harmonics.
REFERENCES
[1]
T. D. Abhayapala and D. B. Ward. Theory and design of high order
sound field microphones using spherical microphone array. In 2002 IEEE
International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), vol. 2, pp. II–1949. IEEE, 2002. doi: 10.1109/ICASSP.2002.
5745011
[2]
P. Alfeld, M. Neamtu, and L. L. Schumaker. Fitting scattered data on
sphere-like surfaces using spherical splines. Journal of Computational
and Applied Mathematics, 73(1-2):5–43, 1996.
[3]
R. Basri and D. W. Jacobs. Lambertian reflectance and linear sub-
spaces. IEEE Transactions on Pattern Analysis and Machine Intelligence,
25(2):218–233, 2003. doi: 10. 1109/TPAMI.2003.1177153
[4]
M. Billeter, E. Sintorn, and U. Assarsson. Real-time multiple scattering
using light propagation volumes. In Proceedings of the 2012 Symposium
on Interactive 3D Graphics (I3D), pp. 119–126. ACM, 2012. doi: 10.
1145/2159616.2159636
[5]
S. Bista, J. Zhuo, R. P. Gullapalli, and A. Varshney. Visualization of brain
microstructure through spherical harmonics illumination of high fidelity
spatio-angular fields. IEEE Transactions on Visualization and Computer
Graphics, 20(12):2516–2525, 2014. doi: 10.1109/TVCG.2014. 2346411
[6]
G. E. Blelloch. Scans as primitive parallel operations. IEEE Transactions
on Computers, 38(11):1526–1538, 1989. doi: 10.1109/12.42122
[7]
J. Bloomenthal. Calculation of reference frames along a space curve.
Graphics gems, 1:567–571, 1990.
[8] N. D. Bruce and J. K. Tsotsos. Saliency, attention, and visual search: An
information theoretic approach. Journal of Vision, 9(3):5–5, 2009.
[9]
D.-Y. Chen, X.-P. Tian, Y.-T. Shen, and M. Ouhyoung. On visual similarity
based 3d model retrieval. Computer Graphics Forum, 22(3):223–232,
2003. doi: 10. 1111/1467-8659.00669
[10]
M. Christie, P. Olivier, and J.-M. Normand. Camera control in computer
graphics. Computer Graphics Forum, 27(8):2197–2218, 2008. doi: 10.
1145/1665817.1665820
[11]
X. Corbillon, G. Simon, A. Devlic, and J. Chakareski. Viewport-adaptive
navigable 360-degree video delivery. In 2017 IEEE International Confer-
ence on Communications (ICC), pp. 1–7. IEEE, 2017.
[12]
B. M. Dennis and C. G. Healey. Assisted navigation for large information
spaces. In Proceedings of the Conference on Visualization ’02, pp. 419–
426. IEEE Computer Society, 2002. doi: 10.1109/VISUAL.2002.1183803
[13]
R. Du and A. Varshney. Social street view: blending immersive street
views with geo-tagged social media. In Proceedings of the 21st Inter-
national Conference on Web3D Technology, pp. 77–85, 2016. doi: 10.
1145/2945292.2945294
[14]
O. G
´
enevaux, F. Larue, and J.-M. Dischler. Interactive refraction on
complex static geometry using spherical harmonics. In Proceedings of
the 2006 Symposium on Interactive 3D Graphics and Games (I3D), pp.
145–152. ACM, 2006. doi: 10.1145/1111411.1111438
[15]
S. Goferman, L. Zelnik-Manor, and A. Tal. Context-aware saliency detec-
tion. IEEE Transactions on Pattern Analysis and Machine Intelligence,
34(10):1915–1926, 2012. doi: 10. 1109/TPAMI.2011.272
[16]
R. Green. Spherical harmonic lighting: The gritty details. In Archives of
the Game Developers Conference, vol. 56, p. 4. GDC, 2003.
[17]
C. Guo and L. Zhang. A novel multiresolution spatiotemporal saliency
detection model and its applications in image and video compression.
IEEE Transactions on Image Processing, 19(1):185–198, 2010. doi: 10.
1109/TIP.2009. 2030969
[18]
X. Hou and L. Zhang. Saliency detection: A spectral residual approach.
In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8.
IEEE, 2007. doi: 10. 1109/CVPR.2007. 383267
[19]
I. P. Howard and B. J. Rogers. Binocular vision and stereopsis. Oxford
University Press, USA, 1995. doi: 10.1093/acprof:oso/9780195084764.
001.0001
[20]
H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang, and M. Sun.
Deep 360 pilot: Learning a deep agent for piloting through 360 sports
video. In CVPR, p. 3, 2017.
[21]
C. Y. Ip and A. Varshney. Saliency-assisted navigation of very large
landscape images. IEEE Transactions on Visualization and Computer
Graphics, 17(12):1737–1746, 2011. doi: 10.1109/TVCG.2011. 231
[22]
L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention
for rapid scene analysis. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 20(11):1254–1259, 1998. doi: 10.1109/34.730558
[23] Y. Jia and M. Han. Category-independent object-level saliency detection.
In Proceedings of the IEEE International Conference on Computer Vision,
pp. 1761–1768. IEEE, 2013. doi: 10. 1109/ICCV.2013. 221
[24]
N. Joubert, M. Roberts, A. Truong, F. Berthouzoz, and P. Hanrahan. An
interactive tool for designing quadrotor camera shots. ACM Transactions
on Graphics (TOG), 34(6):1–11, 2015. doi: 10.1145/2816795.2818106
[25]
A. Khan, B. Komalo, J. Stam, G. Fitzmaurice, and G. Kurtenbach. Hov-
ercam: interactive 3D navigation for proximal object inspection. In Ro-
ceedings of the 2005 Symposium on Interactive 3D Graphics and Games
(I3D), pp. 73–80. ACM, 2005. doi: 10.1145/1053427.1053439
[26]
Y. Kim and A. Varshney. Saliency-guided enhancement for volume visu-
alization. IEEE Transactions on Visualization and Computer Graphics,
12(5):925–932, 2006. doi: 10. 1109/TVCG.2006. 174
[27]
Y. Kim, A. Varshney, D. W. Jacobs, and F. Guimbreti
`
ere. Mesh saliency
and human eye fixations. ACM Transactions on Applied Perception (TAP),
7(2):12, 2010. doi: 10. 1145/1670671.1670676
[28]
C. H. Lee, A. Varshney, and D. W. Jacobs. Mesh saliency. ACM Transac-
tions on Graphics (TOG), 24(3):659–666, 2005. doi: 10.1145/1073204.
1073244
[29]
J. Lehtinen and J. Kautz. Matrix radiance transfer. In Proceedings of
the 2003 Symposium on Interactive 3D Graphics (I3D), pp. 59–64. ACM,
2003. doi: 10. 1145/641480.641495
[30]
G. Li and Y. Yu. Visual saliency based on multiscale deep features.
In Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition, pp. 5455–5463. IEEE, 2015. doi: 10.1109/CVPR.2015.
7299184
[31]
N. Li, J. Ye, Y. Ji, H. Ling, and J. Yu. Saliency detection on light field.
In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 2806–2813. IEEE, 2014. doi: 10.1109/CVPR.2014. 359
[32]
R. Liu, J. Cao, Z. Lin, and S. Shan. Adaptive partial differential equation
learning for visual saliency detection. In Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 3866–3873. IEEE,
2014. doi: 10. 1109/CVPR.2014. 494
[33]
A. Oliva, A. Torralba, M. S. Castelhano, and J. M. Henderson. Top-down
control of visual attention in object detection. In Proceedings on 2003
International Conference on Image Processing, vol. 1, pp. 1–4. IEEE,
2003. doi: 10. 1016/j.conb. 2010.02.003
[34]
T. Oskam, A. Hornung, H. Bowles, K. Mitchell, and M. H. Gross.
OSCAM-Optimized Stereoscopic Camera Control for Interactive 3D.
ACM Transaction on Graphics (ToG), 30(6):189–1, 2011.
[35]
J. Pan, E. Sayrol, X. Giro-i Nieto, K. McGuinness, and N. E. O’Connor.
Shallow and deep convolutional networks for saliency prediction. In
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pp. 598–606. IEEE, 2016. doi: 10.1109/CVPR.2016. 71
[36]
A. Pavel, B. Hartmann, and M. Agrawala. Shot orientation controls for
interactive cinematography with 360 video. In Proceedings of the 30th
Annual ACM Symposium on User Interface Software and Technology, pp.
289–297. ACM, 2017.
[37]
M. A. Poletti. Three-dimensional surround sound systems based on spher-
ical harmonics. Journal of the Audio Engineering Society, 53(11):1004–
1025, 2005. doi: 10. 1.1. 460.6651
[38]
T. Rhee, L. Petikam, B. Allen, and A. Chalmers. MR360: Mixed reality
rendering for 360 panoramic videos. IEEE transactions on visualization
and computer graphics, 23(4):1379–1388, 2017.
[39]
P. Shanmugam and O. Arikan. Hardware accelerated ambient occlusion
techniques on GPUs. In Proceedings of the 2007 Symposium on Interactive
3D Graphics and Games, pp. 73–80. ACM, 2007. doi: 10.1145/1230100.
1230113
[40]
V. Sitzmann, A. Serrano, A. Pavel, M. Agrawala, D. Gutierrez, B. Masia,
and G. Wetzstein. Saliency in VR: How do people explore virtual envi-
ronments? IEEE Transactions on Visualization and Computer Graphics,
24(4):1633–1642, 2018.
[41]
Y.-C. Su and K. Grauman. Making 360 Video Watchable in 2D: Learning
Videography for Click Free Viewing. arXiv preprint, 2017.
[42]
Y.-C. Su, D. Jayaraman, and K. Grauman. Pano2vid: Automatic cine-
matography for watching 360 videos. In Asian Conference on Computer
Vision, pp. 154–171. Springer, 2016.
[43]
J. K. Tsotsos, S. M. Culhane, W. Y. K. Wai, Y. Lai, N. Davis, and F. Nuflo.
Modeling visual attention via selective tuning. Artificial intelligence,
78(1):507–545, 1995. doi: 10. 1016/0004-3702(95)00025-9
[44]
E. Vig, M. Dorr, and D. Cox. Large-scale optimization of hierarchical
features for saliency prediction in natural images. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, pp.
2798–2805. IEEE, 2014. doi: 10. 1109/CVPR.2014. 358
[45]
D. V. Vranic, D. Saupe, and J. Richter. Tools for 3d-Object Retrieval:
Karhunen-Loeve Transform and Spherical Harmonics. In IEEE Fourth
Workshop on Multimedia Signal Processing, pp. 293–298. IEEE, 2001.
doi: 10.1109/MMSP. 2001.962749
[46]
D. Walther and C. Koch. Modeling attention to salient proto-objects.
Neural networks, 19(9):1395–1407, 2006. doi: 10.1016/j.neunet. 2006.10.
001
[47]
L. Wang, H. Lu, X. Ruan, and M.-H. Yang. Deep networks for saliency
detection via local estimation and global search. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pp. 3183–
3192. IEEE, 2015. doi: 10. 1109/CVPR.2015. 7298938
[48]
M. Wang, J.-B. Liang, S.-H. Zhang, S.-P. Lu, A. Shamir, and S.-M. Hu.
Hyper-lapse from multiple spatially-overlapping videos. IEEE Transac-
tions on Image Processing, 27(4):1735–1747, 2018.
[49]
W. Wang, J. Shen, Y. Yu, and K.-L. Ma. Stereoscopic thumbnail creation
via efficient stereo saliency detection. IEEE transactions on visualization
and computer graphics, 23(8):2014–2027, 2017.
[50]
W. Wang, Y. Wang, Q. Huang, and W. Gao. Measuring visual saliency
by site entropy rate. In 2010 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), pp. 2368–2375. IEEE, 2010.
[51]
Z. Zhang, Y. Xu, J. Yu, and S. Gao. Saliency detection in 360 videos. In
Proceedings of the European Conference on Computer Vision (ECCV), pp.
488–503, 2018.
[52]
Q. Zhao and C. Koch. Learning saliency-based visual attention: A review.
Signal Processing, 93(6):1401–1407, 2013. doi: 10.1016/j. sigpro.2012.
06.014
[53]
R. Zhao, W. Ouyang, H. Li, and X. Wang. Saliency detection by multi-
context deep learning. In Proceedings of the IEEE Conference on Com-
puter Vision and Pattern Recognition, pp. 1265–1274. IEEE, 2015. doi:
10.1109/CVPR. 2015.7298731
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The delivery and display of 360-degree videos on Head-Mounted Displays (HMDs) presents many technical challenges. 360-degree videos are ultra high resolution spherical videos, which contain an omnidirectional view of the scene. However only a portion of this scene is displayed on the HMD. Moreover, HMD need to respond in 10 ms to head movements, which prevents the server to send only the displayed video part based on client feedback. To reduce the bandwidth waste, while still providing an immersive experience, a viewport-adaptive 360-degree video streaming system is proposed. The server prepares multiple video representations, which differ not only by their bit-rate, but also by the qualities of different scene regions. The client chooses a representation for the next segment such that its bit-rate fits the available throughput and a full quality region matches its viewing. We investigate the impact of various spherical-to-plane projections and quality arrangements on the video quality displayed to the user, showing that the cube map layout offers the best quality for the given bit-rate budget. An evaluation with a dataset of users navigating 360-degree videos demonstrates that segments need to be short enough to enable frequent view switches.
Article
Full-text available
Understanding how humans explore virtual environments is crucial for many applications, such as developing compression algorithms or designing effective cinematic virtual reality (VR) content, as well as to develop predictive computational models. We have recorded 780 head and gaze trajectories from 86 users exploring omni-directional stereo panoramas using VR head-mounted displays. By analyzing the interplay between visual stimuli, head orientation, and gaze direction, we demonstrate patterns and biases of how people explore these panoramas and we present first steps toward predicting time-dependent saliency. To compare how visual attention and saliency in VR are different from conventional viewing conditions, we have also recorded users observing the same scenes in a desktop setup. Based on this data, we show how to adapt existing saliency predictors to VR, so that insights and tools developed for predicting saliency in desktop scenarios may directly transfer to these immersive applications.
Conference Paper
Virtual reality filmmakers creating 360-degree video currently rely on cinematography techniques that were developed for traditional narrow field of view film. They typically edit together a sequence of shots so that they appear at a fixed-orientation irrespective of the viewer's field of view. But because viewers set their own camera orientation they may miss important story content while looking in the wrong direction. We present new interactive shot orientation techniques that are designed to help viewers see all of the important content in 360-degree video stories. Our viewpoint-oriented technique reorients the shot at each cut so that the most important content lies in the the viewer's current field of view. Our active reorientation technique, lets the viewer press a button to immediately reorient the shot so that important content lies in their field of view. We present a 360-degree video player which implements these techniques and conduct a user study which finds that users spend 5.2-9.5% more time viewing the important points (manually labelled) of the scene with our techniques compared to the traditional fixed-orientation cuts. In practice, 360-degree video creators may label important content, but we also provide an automatic method for determining important content in existing 360-degree videos.
Article
Hyper-lapse video with high speed-up rate is an efficient way to overview long videos such as a human activity in first-person view. Existing hyper-lapse video creation methods produce a fast-forward video effect using only one video source. In this work, we present a novel hyper-lapse video creation approach based on multiple spatially-overlapping videos. We assume the videos share a common view or location, and find transition points where jumps from one video to another may occur. We represent the collection of videos using a hyper-lapse transition graph; the edges between nodes represent possible hyper-lapse frame transitions. To create a hyper-lapse video, a shortest path search is performed on this digraph to optimize frame sampling and assembly simultaneously. Finally, we render the hyper-lapse results using video stabilization and appearance smoothing techniques on the selected frames. Our technique can synthesize novel virtual hyper-lapse routes which may not exist originally. We show various application results on both indoor and outdoor video collections with static scenes, moving objects, and crowds.
Conference Paper
We introduce the novel task of Pano2Vid — automatic cinematography in panoramic 360^{\circ } videos. Given a 360^{\circ } video, the goal is to direct an imaginary camera to virtually capture natural-looking normal field-of-view (NFOV) video. By selecting “where to look” within the panorama at each time step, Pano2Vid aims to free both the videographer and the end viewer from the task of determining what to watch. Towards this goal, we first compile a dataset of 360^{\circ } videos downloaded from the web, together with human-edited NFOV camera trajectories to facilitate evaluation. Next, we propose AutoCam, a data-driven approach to solve the Pano2Vid task. AutoCam leverages NFOV web video to discriminatively identify space-time “glimpses” of interest at each time instant, and then uses dynamic programming to select optimal human-like camera trajectories. Through experimental evaluation on multiple newly defined Pano2Vid performance measures against several baselines, we show that our method successfully produces informative videos that could conceivably have been captured by human videographers.