PreprintPDF Available

Visual Comfort Aware-Reinforcement Learning for Depth Adjustment of Stereoscopic 3D Images

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

Depth adjustment aims to enhance the visual experience of stereoscopic 3D (S3D) images, which accompanied with improving visual comfort and depth perception. For a human expert, the depth adjustment procedure is a sequence of iterative decision making. The human expert iteratively adjusts the depth until he is satisfied with the both levels of visual comfort and the perceived depth. In this work, we present a novel deep reinforcement learning (DRL)-based approach for depth adjustment named VCA-RL (Visual Comfort Aware Reinforcement Learning) to explicitly model human sequential decision making in depth editing operations. We formulate the depth adjustment process as a Markov decision process where actions are defined as camera movement operations to control the distance between the left and right cameras. Our agent is trained based on the guidance of an objective visual comfort assessment metric to learn the optimal sequence of camera movement actions in terms of perceptual aspects in stereoscopic viewing. With extensive experiments and user studies, we show the effectiveness of our VCA-RL model on three different S3D databases.
Visual Comfort Aware-Reinforcement Learning
for Depth Adjustment of Stereoscopic 3D Images
Hak Gu Kim1,2*, Minho Park1, Sangmin Lee1, Seongyeop Kim1, Yong Man Ro1†
1Image and Video Systems Lab., KAIST, Korea
2School of Computer and Communication Sciences, EPFL, Switzerland
hakgu.kim@epfl.ch, {roger618, sangmin.lee, seongyeop, ymro}@kaist.ac.kr
Abstract
Depth adjustment aims to enhance the visual experience of
stereoscopic 3D (S3D) images, which accompanied with im-
proving visual comfort and depth perception. For a human
expert, the depth adjustment procedure is a sequence of iter-
ative decision making. The human expert iteratively adjusts
the depth until he is satisfied with the both levels of visual
comfort and the perceived depth. In this work, we present a
novel deep reinforcement learning (DRL)-based approach for
depth adjustment named VCA-RL (Visual Comfort Aware Re-
inforcement Learning) to explicitly model human sequential
decision making in depth editing operations. We formulate
the depth adjustment process as a Markov decision process
where actions are defined as camera movement operations to
control the distance between the left and right cameras. Our
agent is trained based on the guidance of an objective visual
comfort assessment metric to learn the optimal sequence of
camera movement actions in terms of perceptual aspects in
stereoscopic viewing. With extensive experiments and user
studies, we show the effectiveness of our VCA-RL model on
three different S3D databases.
Introduction
With the concerns on the viewing safety in stereoscopic 3D
(S3D) displays, depth adjustment has increasingly gained
importance for improving visual experience of stereo-
scopic images such as visual comfort and depth perception
(Meesters, IJsselsteijn, and Seunti¨
ens 2004; Lambooij et al.
2009; Tam et al. 2011). For a proper viewing experience
of S3D contents, highly skilled professionals (e.g., stereog-
raphers) carefully control the camera parameters such as a
camera baseline using professional depth editing tools. It re-
quires not only expertise in stereoscopy, but also a lot of
time and effort (Tam et al. 2011). Therefore, it is essential to
develop an automatic depth adjustment method.
Previous studies have proposed various depth adjustment
methods to improve visual comfort by shifting the zero dis-
parity plane (ZDP) or scaling the disparity range of a scene.
However, a common shortcoming of these existing works is
that they edited the given disparities in a direct way without
*Work done as a part of the research project in KAIST
Corresponding author (ymro@kaist.ac.kr)
Copyright © 2021, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
considering the perceptual effects of the changed depth and
visual comfort levels. In addition, they mainly focused on
the visual comfort improvement rather than the depth per-
ception. There is a trade-off between the visual comfort im-
provement and the perceived depth degradation in the depth
adjustment process. That is why human experts carefully
and iteratively manipulate depths, not in a direct way.
For a human expert, the depth adjustment procedure is a
sequence of iterative decision making for S3D contents. A
human expert iteratively conducts depth editing operations
until the levels of visual comfort and the perceived depth fit
what he has in mind. Adjusting depths to the optimal is com-
plex operations that need to consider the perceptual aspect as
well as spatial distortions. The perceptual effect should be
taken into account to prevent undesirable perceptual side-
effects such as excessive visual discomfort or unnoticeable
depth in stereoscopic viewing.
Inspired by human expert’s sequential decision making
which benefits depth editing, we propose a novel depth ad-
justment framework by combining the knowledge of hu-
man binocular perception and deep reinforcement learn-
ing named VCA-RL (Visual Comfort Aware Reinforcement
Learning). Despite recent advances in deep learning-based
S3D applications (e.g., visual comfort assessment (Jeong,
Kim, and Ro 2017; Kim et al. 2018) and stereo match-
ing (Yang et al. 2019; Tulyakov, Ivanov, and Fleuret 2018;
Chang and Chen 2018; Poggi et al. 2019)), it is hard to ex-
tend these approaches to the depth adjustment task due to
the complex non-linear operation, subjective nature of hu-
man visual system, and the lack of expensive pairs of input
and processed S3D contents. In this paper, we firstly for-
mulate the depth adjustment process as a Markov decision
process to control the distance between the left and right
cameras (i.e., stereo baseline) in a sequential way. By iter-
atively adjusting the stereo baseline via camera movement
actions, the range of disparity could be carefully controlled
to achieve satisfying visual experiences (see Fig. 1). In par-
ticular, to find a proper 3D visual satisfaction comprising
visual comfort and perceived depth, we design a novel vi-
sual comfort aware reward function based on the guidance
of objective visual comfort assessment metric. Based on the
visual comfort aware reward, our agent can learn the opti-
mal sequence of camera movement actions preserving both
visual comfort and perceived depth in stereoscopic viewing.
arXiv:2104.06782v1 [cs.CV] 14 Apr 2021
Input
(Before processing) Our result
(After processing) Input
(Before processing) Our result
(After processing)
VC score: 1.89 (Uncomfortable) VC score: 3.09 (Comfortable) VC score: 3.95 (Comfortable) VC score: 3.31 (Comfortable)
DP score: 4.77 (Good) DP score: 3.93 (Good) DP score: 2.97 (Moderate) DP score: 3.94 (Good)
-3 -3
Right camera
(Reference)
Left camera
before process.
Camera Movement Action
Left camera
after process.
+7 +3
Camera Movement Action
Right camera
(Reference)
Left camera
after process.
Left camera
before process.
+3
Figure 1: The intuition of the proposed VCA-RL model. We train our agent guided by visual comfort assessment metric. Similar
to human professionals, the agent sequentially determines the camera movement action. By applying the action, we can obtain
the visually comfortable stereoscopic image with sufficient depth. Note that the VC score is a visual comfort score and the DP
score is a depth perception score.
The contributions of this work are summarized as follows.
Inspired by human expert’s iterative decision making in
depth editing, we firstly design a depth adjustment agent
using reinforcement learning that learns iterative depth
adjustment process. By sequentially adjusting the stereo
baseline in the world coordinates, our model can find an
optimal trade-off between visual comfort improvement
and the perceived depth degradation.
We propose a novel visual comfort aware reward func-
tion. By learning the reward based on the predicted visual
comfort scores of the stereoscopic image at each step in
training, our VCA-RL model can automatically decide for
itself whether to ameliorate the visual comfort or improve
the depth perception at each step in testing.
• With extensive experiments and subjective evaluations,
we demonstrate the effectiveness and the superiority of
our VCA-RL model for improving visual experiences of
stereoscopic images on various S3D databases.
Related Work
Depth adjustment mainly aims at improving visual com-
fort of stereoscopic images while preserving the perceived
depth. The visual discomfort is highly related with the dis-
parity/depth characteristics of stereoscopic images (e.g., dis-
parity magnitude (Kim and Sohn 2011; Choi et al. 2010) and
disparity difference (Sohn et al. 2013a; Jung et al. 2013b)).
To deal with that, there are two main approaches to for depth
adjustment, which are disparity shifting (Lei et al. 2014;
Shao et al. 2015; Ying et al. 2020) and disparity scaling
(Lang et al. 2010; Sohn et al. 2013b; Jung et al. 2014, 2015;
Oh et al. 2017; Lei et al. 2017; Shao et al. 2016).
Previous works have proposed various disparity shifting
methods to reduce visual discomfort by simply moving the
ZDP of the original scene while maintaining the range of
disparity. Shao et al. (Shao et al. 2015) proposed a dispar-
ity shifting method considering spatial frequency, disparity
response, and visual attention to mitigate visual discomfort
in stereoscopic viewing. Recently, Ying et al. (Ying et al.
2020) proposed a viewing distance-based nonlinear shift-
ing (VDNS) approach to improve visual comfort and per-
ceived depth quality. The disparity shifting methods are ef-
fective to reduce excessive screen disparity. They can mit-
igate the accommodation-vergence (AV) conflict (Hoffman
et al. 2008; Yano, Emoto, and Mitsuhashi 2004) with a low
computational cost. However, they cannot reduce visual fa-
tigue of stereoscopic images with disparity range exceeding
visual comfortable zone (i.e., ±1angular disparity (Lam-
booij et al. 2009; Tam et al. 2011))(Jung et al. 2014). In
the proposed method, by explicitly adjusting the distance of
stereo cameras, the overall depth range can be edited to fit a
visual comfortable zone based on the guidance of the objec-
tive visual comfort assessment metric.
The disparity scaling methods have been proposed that
linearly or nonlinearly adjusted the disparity range of stereo-
scopic images into the visual comfortable zone. Lang et
al. (Lang et al. 2010) proposed a nonlinear disparity map-
ping based on visual importance of scene elements. Sohn
et al.(Sohn et al. 2013b) proposed a disparity remapping
method combining global and local disparity range adjust-
ments. Jung et al. (Jung et al. 2015) proposed a visual com-
fort improvement method that adaptively adjusted the depth
range considering saliency information. Shao et al. (Shao
et al. 2016) developed an optimization-based approach that
conducted layer-dependent depth range adjustment consid-
ering both visual comfort and depth sensation. However, the
disparity scaling way can lead to decrease the relative dis-
tance between objects because the scene is compressed in-
tentionally. It can also reduce the senses of the perceived
depth and realism. In addition, by increasing the viewing
distance between eyes and the scene, the sense of presence
can be weakened (Ying et al. 2020). On the other hand, we
do not explicitly change the disparities in the image domain.
In the proposed method, we can preserve the geometric pro-
portion of objects and the relative distance between objects
in 3D space because we progressively adjusted the distance
between left and right cameras in the world coordinate.
Next
stereoscopic image Current
stereoscopic image
   
 
,
stereo L R
ttI I I
   
 
1 1 ,
stereo L R
tt  I I I
Depth Image-
based
Rendering
Disparity
map, D(t)
Image
Saliency, SI(t)
Depth
Saliency, SD(t)
Perceptually
significant
feature, fdisp(t)
Agent Network
Visual Comfort
Score Predictor
argmax
Reward
Calculation
XPAD(t)
XPADD(t)
Predicted
VC score Target
VC score
Q(S,A)>0
Applying the camera movement action Aoptimal(t) to current stereoscopic image Istereo(t)
YNSTOP
Saliency
Estimation
Disparity
Estimation
1024-d1024-d 5-d112,640-d
1024-d 64-d
∆D(t)
D(t)
Perceptual
Importance
Map, SP(t)
Figure 2: The illustration of the proposed VCA-RL framework for depth adjustment. At first, the perceptual importance map
SPis estimated from ILand DL. Based on that, we encode fdisp. Then, our agent estimates the action value Q(S(t),A)that
maximizes the visual comfort aware reward. The selected camera movement action is applied to the input for depth adjustment.
This process is iteratively carried out until the visual comfort of stereoscopic image falls into the comfortable range.
Visual Comfort Aware-Reinforcement
Learning for Depth Adjustment
To imitate the expert’s decision making process, we formu-
late the depth adjustment as a problem of finding an op-
timal sequence of camera movement action A. We adjust
depth of given left image ILand right image IRby itera-
tively applying the camera movement action A. The visual
comfort score sV C (t)at a step tis estimated from the dis-
parity map D(t)and the perceptual importance map SP(t).
Based on a sequence of the predicted comfort score ˆsV C at
each step, our agent determines a camera movement action
A(t)for depth adjustment under the policy θ. Therefore,
our goal is to find an optimal sequence for depth adjust-
ment T {Aoptimal(t)⊂ A} that presents visually comfort-
able and sufficient depth in stereoscopic viewing.
Fig. 2 shows the overall process of the proposed VCA-
RL framework for depth adjustment. At first, the disparity
map D(t)and the perceptual importance map SP(t)are es-
timated from given stereoscopic image at a step t,Istereo =
[IL(t),IR](IRis used as a reference in our study). By con-
sidering the disparity and human attention information, the
perceptually significant disparity feature fdisp(t)is encoded.
Then, fdisp(t)is forwarded to visual comfort score predictor
(VC score predictor) to evaluate the degree of visual com-
fort, ˆsV C (t).fdisp (t)is also forwarded to our agent network
to estimate the action value Q(S(t),A), which is the ex-
pected sum of future reward R. The state S(t)is a combi-
nation of fdisp(t)and ˆsV C (t),S(t) = {fdisp(t),ˆsV C (t)}.
The agent then approximates the action value Q(S(t),A)
and chooses the best action Aoptimal(t)maximizing the ac-
tion value Q(S(t),A). Finally, by applying the best action
at a step t,Aoptimal(t)to input Istereo(t), the stereoscopic
image at next step t+ 1,Istereo(t+ 1), is obtained via depth
image based rendering (DIBR) with the updated stereo base-
line. The agent repeats this process and stops when all esti-
mated action values are negative.
Camera Movement Action
To adjust the range of depth, the action Ais composed
by the camera movements. By explicitly increasing or de-
creasing the distance between stereo cameras (i.e., stereo
baseline) in the world coordinates, we can manipulate the
depth of stereoscopic image while preserving relative dis-
tance between objects and their geometric proportions. The
camera movement action at a step t,A(t), is only applied
to the left camera. The right camera is fixed (i.e., refer-
ence). In this work, we define 5 camera movement actions,
A={−7,3,+3,+7,0}, to shift the position of the left
camera on the stereo camera baseline. The sign indicates the
direction the left camera moves (00for left side and 0+0for
right side). The values mean the distance (unit: mm) that
the camera moves at each step. They are determined in con-
sideration of the distance between the pupils of eyes (i.e.,
interpupillary distance '63mm). The zero means the ter-
mination of iterative depth adjustment operations.
Perceptually Significant Disparity Feature
It is well known that the disparity magnitude, which is re-
lated with the absolute screen disparity, is a critical fac-
tor affecting visual discomfort due to binocular fusion limit
(i.e., Panum’s fusional area (Howard 2002)) (Kim and Sohn
2011; Choi et al. 2010). The disparity gradient, which is the
disparity difference between nearby objects (i.e., differen-
tial disparity), reflects on visual discomfort as well (Sohn
et al. 2013a; Jung et al. 2013b). Based on these characteris-
tics, we employ the perception-weighted absolute disparity
map (PAD) and the perception-weighted differential dispar-
ity map (PADD) to encode perceptually significant disparity
feature as in (Jeong, Kim, and Ro 2017; Jung et al. 2013a).
For this purpose, We first generate the perceptual impor-
tance map SPusing both image saliency SIand disparity
saliency SD. For SI, we employ a recent deep learning-
based saliency estimation (Hou et al. 2017). Note that the
saliency values range from 0 (least saliency) to 1 (most
saliency). For SD, we assume that the foreground objects
usually attract more human attention compared with back-
grounds in a scene (Jeong, Kim, and Ro 2017; Jung et al.
2013a). SDis generated by mapping the minimum and max-
imum disparity values in DLto 0 and 1, respectively. In
this study, a hierarchical deep stereo matching (HSM) (Yang
et al. 2019) is used for disparity estimation. Finally, the per-
ceptual importance map SPis computed (see Fig. S1 in our
supplementary file), which can be written as
SP=wISI+wDSD(1)
where we set wI=wD= 0.5in our experiment.
Then, we obtain the PAD, XP AD =SP⊗|D|, and PADD,
XP ADD =SP⊗ |D|where indicates element-wise
multiplication. We use them as input of our perceptual fea-
ture extractor. To encode the perceptually significant dispar-
ity feature fdisp R11×10×1024 capturing the visual com-
fort level of stereoscopic images, we employ a deep con-
volutional neural network (DCNN) based on VGG-16 (Si-
monyan and Zisserman 2014; Jeong, Kim, and Ro 2017).
The disparity feature is trained by f(·)and regressed to vi-
sual comfort score by p(·). During this training, by mini-
mizing the loss for visual comfort prediction LV C , the per-
ceptually significant disparity feature fdisp is encoded.
LV C =1
N
N
X
i=1
pfi
dispsi
V C
2(2)
where pfi
dispis the predicted comfort score for i-th stereo
image, (i.e., ˆsi
V C ) and si
V C is the corresponding ground-
truth comfort score. Nis the number of training dataset.
Visual Comfort Aware Reward
To make our agent determine an optimal camera movement
action sequence in terms of viewing experience of S3D con-
tents, we design a novel visual comfort aware reward func-
tion using the objective visual comfort assessment metric
(i.e., visual comfort score) for stereoscopic images.
The visual comfort score can be divided into 5-scale,
which are 1: extremely uncomfortable, 2: uncomfortable, 3:
comfortable, 4: moderately comfortable, and 5: Very com-
fortable (Shao et al. 2016). We reasonably assume that
sT
V C = 3 (comfortable) is the target comfort level while
maintaining the sufficient depth. This is because the level
of visual comfort is inversely related to the level of the per-
ceived depth in stereoscopic viewing.
Our goal is to find the optimal sequence of camera move-
ment actions for depth adjustment T {Aoptimal(t)⊂ A}
that minimizes the difference between the predicted com-
fort score of a given stereoscopic image and the target com-
fort score. The process can be regarded as a Markov deci-
sion process. In a Markov decision process, the state Sis
a combination of the fdisp and ˆsV C . The action space is a
set of our camera movement operations A. Finally, inspired
by (Caicedo and Lazebnik 2015), our visual comfort aware
reward R(t)can be defined as
R(t) = sign
sT
V C ˆsV C (t+ 1)
+
sT
V C ˆsV C (t)
(3)
where sign(·)is a sign function. In our study, the sign func-
tion is used to limit the variation of the difference values and
make model training stable (Li et al. 2018).
In our VCA-RL model, if the distance from the target
comfort score is lower than 0.3, the positive reward is given
to our agent. If the distance from target comfort score is
higher than 0.3, our agent will receive a negative reward as
a penalty for the action (Bellver et al. 2016).
R(S(t),A(t)) = +η, if
sT
V C ˆsV C
<0.3
η, otherwise (4)
where ηset to 0.3 in our experience.
Our reward function is to adjust stereo baseline so that
the comfort score at t+ 1 is closer to the target score than
before. Otherwise, the action is penalized. Through the pro-
posed reward function, the agent can learn the rules about
which action should be chosen as Aoptimal(t).
Agent for Depth Adjustment
Our agent network consists of 4 fully connected layers for
action value estimation. The perceptually significant dispar-
ity feature fdisp(t)is fed to our agent network. The agent
estimates the action value Q(S(t),A)with Q-learning. It
can be defined as an expected sum of future visual comfort
aware rewards (Mnih et al. 2015). The Q-learning iteratively
updates the action-selection policy Θusing the Bellman
equation, which can be written as
Q(S(t),A) = ER(t) + γR(t+ 1) + γ2R(t+ 2) + · · ·
' R(t) + γmax
A0Q(S(t),A)
(5)
where γis a discount factor and set to 0.9 (Bellver et al.
2016). We train the agent to estimate the action value
Q(S(t),A(t)) and choose an optimal action Aoptimal(t)
that maximizes Q(S(t),A).
To train the agent network, we use an -greedy algorithm.
By the -greedy algorithm, the policy Θis determined dur-
ing training. The -greedy algorithm randomly samples ac-
tions with a probability of and takes the actions with the
highest reward in a greedy way with a probability of 1-.
In the test stage, the policy is determined with = 0, i.e.
the highest expected reward is always chosen. The process
is repeated until all expected rewards are negative.
After our agent chooses Aoptimal(t), the action is applied
to Istereo to edit its depth range. By using DIBR process, we
can synthesize a new left image IL(t+ 1) at a new left cam-
era position moved by the selected camera movement action.
In this work, the disocclusions are very small in IL(t+1) be-
cause the camera is progressively moved to the optimal posi-
tion. In our experiment, the disoccluded regions in IL(t+ 1)
are filled with the hole filling method considering binocular
symmetry (Kim and Ro 2016). As noted, this study focuses
on formulating the depth adjustment framework as a sequen-
tial decision making process like a human expert, rather than
the development of a new image-based rendering method.
1.679
2.918
1
2
3
4
1 2 3 4 5 6 7 8
Predicted
score
Iterations
1.516
2.724
1
2
3
4
1 2 3 4 5 6 7 8
Predicted
score
Iterations
1.756
2.905
1
2
3
4
1 2 3 4 5 6
Predicted
score
Iterations
(a)
4.228
2.746
2
3
4
5
1 2 3 4
Predicted
score
Iterations
3.693
2.785
2
3
4
1 2 3 4
Predicted
score
Iterations
3.759
2.804
2
3
4
1 2 3 4 5
Predicted
score
Iterations
(b)
Figure 3: Visual results of our VCA-RL model. (a) Results of uncomfortable stereoscopic images with excessive screen dispar-
ities and (b) Results of comfortable stereoscopic images with unnoticeable depths. In case of (a), our VCA-RL model progres-
sively improves the visual comfort level while mitigating excessive disparity magnitude. In case of (b), our model enhances the
depth perception in the comfortable range.
Experiments and Results
Experimental Setting
Datasets In the experiments, IEEE-SA stereo image
database (Park et al. 2014) was used to train our VCA-RL
model. It consists of 800 stereoscopic image pairs with a
resolution of 1920×1080 pixels and the corresponding sub-
jective comfort scores. These have 160 different scenes with
5 convergence points. For training and testing, we used 10-
fold cross-validation. The IEEE SA stereo image database
was randomly divided into 10 subsets. 9 subsets were used
for training stage and 1 subset was used for testing stage.
In testing, to verify the robustness and generalization,
we conducted the depth adjustment on additional databases,
which are NBU 3D-VCA database (Jiang et al. 2015) and
IVY Lab S3D image database for visual discomfort re-
duction (Jung et al. 2013a). The NBU 3D-VCA database
consists of 200 stereoscopic images (1920×1080) with the
associated mean opinion score (MOS) for visual comfort.
IVY Lab S3D image database consists of 120 stereoscopic
images (1920×1080) captured by 3D digital camera with
dual lenses (Fujifilm FinePix 3D W3) and the correspond-
ing MOS values as well. They were used in testing only.
Implementation Details In the training stage, the fea-
ture extractor and visual comfort score network were pre-
trained end-to-end with Adam optimizer. For Adam opti-
mizer, a learning rate was initialized at 1e5.β1and β2
were set to 0.9 and 0.999, respectively (Kingma and Ba
2014). Then, we trained our agent network. In our training
of deep Q-network with reinforcement learning, we adopted
an -greedy policy. The initial value of was 1 and the value
decreased until = 0.1in steps of 0.1. The weights for deep
Q-network were initialized with normal distribution (Bellver
et al. 2016). We used an experience replay of 2,000 experi-
ences and a batch size of 256.
Input Our results Lei’s method Jung’s method Shao’s method
Figure 4: Performance comparisons of depth adjustment in
stereoscopic viewing. These are anaglyph images, which can
be seen as 3D through red-green glasses. Results of existing
methods were taken from each paper.
Visual Results of Iterative Depth Adjustment
Fig. 3a shows examples of uncomfortable stereoscopic im-
ages with excessive disparities. In this case, our goal is to re-
duce the level of visual discomfort and fall the excessive dis-
parities into comfortable range. Our VCA-RL iteratively im-
proved the visual comfort level of given stereoscopic images
until the comfort scores reach to comfortable range while
avoiding unnoticeable depth information of foreground ob-
jects. The agent progressively ameliorated the degree of vi-
sual comfort by decreasing the distance between stereo cam-
eras until achieving the target visual comfort score range.
Fig. 3b shows examples of very comfortable stereoscopic
images with unnoticeable depths. In this case, the proposed
method iteratively increased the stereo baseline to improve
their disparities for sufficient depth perception. Simultane-
ously, our agent carefully checked the level of visual com-
fort to prevent the adjusted disparities from causing extreme
visual fatigue. As a result, we could provide visually com-
fortable stereoscopic images with sufficient depth percep-
tion. In Fig. 3b, the final comfort score was lower than the
comfort score of the original stereoscopic image. However,
it is still within the visually comfortable range. In particular,
the depth of foreground objects considerably increased by
our VCA-RL model.
Qualitative Comparisons
In this section, for performance comparisons of depth adjust-
ment, we visually compared our results with existing depth
adjustment methods: Lei’s method (Lei et al. 2014), Jung’s
method (Jung et al. 2015), Shao’s method (Shao et al. 2016),
and Ying’s method (Ying et al. 2020). Fig. 4 and Fig. 5 show
visual results of our VCA-RL and previous depth adjustment
methods (Lei et al. 2014; Ying et al. 2020; Jung et al. 2015;
Shao et al. 2016) for performance comparisons. Note that
visual results of previous methods were taken from their pa-
pers (Lei et al. 2014; Ying et al. 2020; Jung et al. 2015; Shao
et al. 2016) because their codes and any other results are not
available.
Input Our results Jung’s method Ying’s method
Figure 5: Performance comparisons of depth adjustment in
stereoscopic viewing. These are anaglyph images, which can
be seen as 3D through red-green glasses. Results of existing
methods were taken from each paper.
In Fig. 4, the first input stereoscopic image seems to have
sufficient disparities. the inputs in the second and the third
rows have large disparities relatively. The disparities of in-
puts in the fourth and the fifth rows seem to be small. Shao’s
method (Shao et al. 2016) tried to strike a balance between
visual comfort and depth perception, compared to (Lei et al.
2014) and (Jung et al. 2015). However, it seemed to fail in
the results for the second, the third and the fifth examples.
On the other hand, our VCA-RL provided the visual result
maintaining the depth for the content in the first row, and vi-
sual results mitigating the excessive disparities for the con-
tents in the second and the third rows. For the fourth and the
fifth contents, our VCA-RL provided the visual results with
increasing disparities.
Fig. 5 shows visual results of our VCA-RL, Jung’s
method (Jung et al. 2015), and Ying’s method (Ying et al.
2020). Similar to Fig. 4, the result of Jung’s method didn’t
seem large enough to perceive depth. Ying’s method (Ying
et al. 2020) provided reliable results for examples in the sec-
ond and the third rows. However, for the content in the first
row, it did not sufficiently reduce the disparities around the
menu board. For that in the fourth row, Ying’s method (Ying
et al. 2020) also significantly reduced its disparities. On the
other hand, our VCA-RL could stably provide reliable visual
results for various examples.
Quantitative Evaluations
To verify the effectiveness of our VCA-RL, we objectively
measured the levels of visual comfort and the depth percep-
tion for most uncomfortable top 10% and most comfortable
top 10% stereoscopic images on (Park et al. 2014). In this
experiment, we employed an objective visual comfort as-
sessment metric (VC score) based on deep visual and dis-
parity feature (Jeong, Kim, and Ro 2017) for visual comfort
assessment. To measure the presence of depth, we employed
an objective assessment metric (DP score) in (Ying et al.
2020). The VC score and the DP score are higher, the com-
fort level and the perceived depth level are higher. VC and
DP scores range from 0 to 5.
Most uncomfortable top 10% stimuli Most comfortable top 10% stimuli
VC input VC ours DP input DP ours VC input VC ours DP input DP ours
mean 1.65 3.00 4.85 3.59 4.30 3.01 2.81 3.70
std 0.43 0.54 0.23 0.44 0.15 0.26 0.31 0.45
p-value p<0.05 p<0.05 p<0.05 p<0.05
Table 1: Statistical results of objective assessment for visual comfort and depth perception on IEEE SA stereo image database
For uncomfortable stimuli For comfortable stimuli
before
processing
after
processing
before
processing
after
processing
mean of
MOS 2.81 3.67 3.7 3.26
std 0.38 0.18 0.17 0.36
p-value p<0.05 p<0.05
Table 2: Statistical results of subjective assessment on IVY
S3D database
The statistical results of the objective assessment on IEEE
SA stereo image database (Park et al. 2014) are presented in
Table 1. For most uncomfortable stimuli, the improvement
of overall visual comfort was statistically significant, com-
pared with original input stereoscopic images. The level of
depth perception was still good (DP score >3). For most
comfortable stereoscopic images with unnoticeable depth,
the improvement of depth perception was statistically sig-
nificant, compared to the original. More importantly, de-
spite the increase of disparity/depth, the overall visual com-
fort still remained within comfortable range. Thus, the re-
sults revealed that our VCA-RL could provide a significantly
meaningful improvement in terms of both visual comfort
and depth perception (see Table S1 and S2 in our supple-
mentary material for statistical results on NBU (Jiang et al.
2015) and IVY databases (Jung et al. 2013a)).
User Study
Furthermore, we conducted a set of subjective assessment
experiments to investigate users’ visual comfort rating and
viewing preference. A half-mirror type stereoscopic 3D
monitor was used to display the stereoscopic images. A total
of 16 subjects participated in the experiment. We randomly
selected 15 uncomfortable stimuli among stereoscopic im-
ages with ˆsV C less than 3 and 15 comfortable stimuli among
stereoscopic images with ˆsV C higher than 3 on IVY Lab
S3D image database (Jung et al. 2013a).
To measure the degree of visual comfort, a modified ver-
sion of the single stimulus (SS) was used with a five point
grading scale (Series 2012). During the experiment, original
images (‘before processing’) and our results (‘after process-
ing’) were randomly presented to the subjects (i.e., a total of
60 stimuli). For the viewing preference test, subjects were
asked to answer the following question: “Which one do you
prefer to see in considering all quality aspects of the viewing
experience of stereoscopic images?”(Jung et al. 2013b). For
more details, please see the section for subjective assessment
environment and procedure in our supplementary material.
(a) (b)
0.2125
0.7375
0.05
0
0.5
1
Before
process. After
process.No diff.
Viewing
Preference
0.325
0.6625
0.0125
0
0.5
1
Before
process. After
process.No diff.
Viewing
Preference
Figure 6: Subjective assessment result for viewing prefer-
ence. (a) For uncomfortable stereoscopic images. (b) For
comfortable stereoscopic images. ‘No diff.’ means there is
no difference between before and after processing.
Table 2 shows the statistical analysis of subjective as-
sessment results. For uncomfortable stimuli, the mean of
MOS value after processing increased statistically signifi-
cantly than the mean of MOS before processing. The mean
of difference MOS was +0.86 in range of [1, 5] (i.e., 21.5%
improvement). For comfortable stimuli, the mean of MOS
values after processing decreased statistically significantly
than the mean of MOS before processing. After enhanc-
ing the depth perception, the mean of MOS after processing
(i.e., 3.26) remained within comfortable range. These results
demonstrated that our agent carefully adjusted depths to pro-
vide better visual comfort for uncomfortable data and to pre-
serve visual comfort for comfortable data, respectively.
Fig. 6 shows the results of the viewing preference. The
viewing preference of ‘after processing’ was much better
than that of ‘before processing’. In summary, the result
demonstrated that the proposed method had a positive effect
on the overall viewing experience of stereoscopic images by
carefully increasing or decreasing their depths.
Conclusion
In this paper, we proposed a novel reinforcement learning-
based approach considering visual comfort of stereoscopic
images for depth adjustment named VCA-RL. With the deep
reinforcement learning strategy, we explicitly modeled a hu-
man professional’s depth adjustment process. In particu-
lar, to take into account perceptual aspects in stereoscopic
viewing, we designed a novel visual comfort aware reward
function to train our agent to learn the perceptual charac-
teristics of stereoscopic viewing. Therefore, our VCA-RL
could sequentially estimate proper depth adjustment steps
via DIBR process. With extensive qualitative and quantita-
tive experiments on various S3D databases, our VCA-RL
model showed its effectiveness and superiority for depth ad-
justment. In addition, the results of user study showed that
our VCA-RL could be feasible for current S3D displays.
Acknowledgements
This work was partly supported by IITP grant (No. 2017-
0-00780), IITP grant (No. 2017-0-01779), and BK 21 Plus
project. M. Park is now in ETRI, Korea.
References
Bellver, M.; Gir´
o-i Nieto, X.; Marqu´
es, F.; and Torres, J.
2016. Hierarchical object detection with deep reinforcement
learning. arXiv preprint arXiv:1611.03718 .
Caicedo, J. C.; and Lazebnik, S. 2015. Active object local-
ization with deep reinforcement learning. In Proceedings
of the IEEE international conference on computer vision,
2488–2496.
Chang, J.-R.; and Chen, Y.-S. 2018. Pyramid stereo match-
ing network. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 5410–5418.
Choi, J.; Kim, D.; Ham, B.; Choi, S.; and Sohn, K. 2010. Vi-
sual fatigue evaluation and enhancement for 2D-plus-depth
video. In 2010 IEEE International Conference on Image
Processing, 2981–2984. IEEE.
Hoffman, D. M.; Girshick, A. R.; Akeley, K.; and Banks,
M. S. 2008. Vergence–accommodation conflicts hinder vi-
sual performance and cause visual fatigue. Journal of vision
8(3): 33–33.
Hou, Q.; Cheng, M.-M.; Hu, X.; Borji, A.; Tu, Z.; and Torr,
P. H. 2017. Deeply supervised salient object detection with
short connections. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 3203–3212.
Howard, I. P. 2002. Seeing in depth, Vol. 1: Basic mecha-
nisms. University of Toronto Press.
Jeong, H.; Kim, H. G.; and Ro, Y. M. 2017. Visual com-
fort assessment of stereoscopic images using deep visual and
disparity features based on human attention. In 2017 IEEE
International Conference on Image Processing (ICIP), 715–
719. IEEE.
Jiang, Q.; Shao, F.; Jiang, G.; Yu, M.; and Peng, Z. 2015.
Three-dimensional visual comfort assessment via prefer-
ence learning. Journal of Electronic Imaging 24(4): 043002.
Jung, C.; Cao, L.; Liu, H.; and Kim, J. 2015. Visual com-
fort enhancement in stereoscopic 3D images using saliency-
adaptive nonlinear disparity mapping. Displays 40: 17–23.
Jung, Y. J.; Sohn, H.; Lee, S.-I.; Park, H. W.; and Ro, Y. M.
2013a. Predicting visual discomfort of stereoscopic images
using human attention model. IEEE transactions on circuits
and systems for video technology 23(12): 2077–2082.
Jung, Y. J.; Sohn, H.; Lee, S.-i.; and Ro, Y. M. 2014. Vi-
sual comfort improvement in stereoscopic 3D displays using
perceptually plausible assessment metric of visual comfort.
IEEE Transactions on Consumer Electronics 60(1): 1–9.
Jung, Y. J.; Sohn, H.; Lee, S.-i.; Speranza, F.; and Ro, Y. M.
2013b. Visual importance-and discomfort region-selective
low-pass filtering for reducing visual discomfort in stereo-
scopic displays. IEEE transactions on circuits and systems
for video technology 23(8): 1408–1421.
Kim, D.; and Sohn, K. 2011. Visual fatigue prediction for
stereoscopic image. IEEE Transactions on Circuits and Sys-
tems for Video Technology 21(2): 231–236.
Kim, H. G.; Jeong, H.; Lim, H.-t.; and Ro, Y. M. 2018.
Binocular fusion net: deep learning visual comfort assess-
ment for stereoscopic 3D. IEEE Transactions on Circuits
and Systems for Video Technology 29(4): 956–967.
Kim, H. G.; and Ro, Y. M. 2016. Multiview stereoscopic
video hole filling considering spatiotemporal consistency
and binocular symmetry for synthesized 3d video. IEEE
Transactions on Circuits and Systems for Video Technology
27(7): 1435–1449.
Kingma, D. P.; and Ba, J. 2014. Adam: A method for
stochastic optimization. arXiv preprint arXiv:1412.6980 .
Lambooij, M.; Fortuin, M.; Heynderickx, I.; and IJsselsteijn,
W. 2009. Visual discomfort and visual fatigue of stereo-
scopic displays: A review. Journal of Imaging Science and
Technology 53(3): 30201–1.
Lang, M.; Hornung, A.; Wang, O.; Poulakos, S.; Smolic, A.;
and Gross, M. 2010. Nonlinear disparity mapping for stereo-
scopic 3D. ACM Transactions on Graphics (TOG) 29(4):
1–10.
Lei, J.; Li, S.; Wang, B.; Fan, K.; and Hou, C. 2014. Stereo-
scopic visual attention guided disparity control for multi-
view images. Journal of Display Technology 10(5): 373–
379.
Lei, J.; Peng, B.; Zhang, C.; Mei, X.; Cao, X.; Fan, X.;
and Li, X. 2017. Shape-preserving object depth control for
stereoscopic images. IEEE Transactions on Circuits and
Systems for Video Technology 28(12): 3333–3344.
Li, D.; Wu, H.; Zhang, J.; and Huang, K. 2018. A2-RL: Aes-
thetics aware reinforcement learning for image cropping. In
Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 8193–8201.
Meesters, L. M.; IJsselsteijn, W. A.; and Seunti¨
ens, P. J.
2004. A survey of perceptual evaluations and requirements
of three-dimensional TV. IEEE Transactions on circuits and
systems for video technology 14(3): 381–391.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-
ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fid-
jeland, A. K.; Ostrovski, G.; et al. 2015. Human-level con-
trol through deep reinforcement learning. Nature 518(7540):
529–533.
Oh, H.; Kim, J.; Kim, J.; Kim, T.; Lee, S.; and Bovik, A. C.
2017. Enhancement of visual comfort and sense of presence
on stereoscopic 3d images. IEEE Transactions on Image
Processing 26(8): 3789–3801.
Park, J.; Oh, H.; Lee, S.; and Bovik, A. C. 2014. 3D visual
discomfort predictor: Analysis of disparity and neural activ-
ity statistics. IEEE transactions on image processing 24(3):
1101–1114.
Poggi, M.; Pallotti, D.; Tosi, F.; and Mattoccia, S. 2019.
Guided stereo matching. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, 979–
988.
Series, B. 2012. Methodology for the subjective assessment
of the quality of television pictures. Recommendation ITU-R
BT 500–13.
Shao, F.; Li, Z.; Jiang, Q.; Jiang, G.; Yu, M.; and Peng, Z.
2015. Visual discomfort relaxation for stereoscopic 3D im-
ages by adjusting zero-disparity plane for projection. Dis-
plays 39: 125–132.
Shao, F.; Lin, W.; Li, Z.; Jiang, G.; and Dai, Q. 2016. Toward
simultaneous visual comfort and depth sensation optimiza-
tion for stereoscopic 3-D experience. IEEE transactions on
cybernetics 47(12): 4521–4533.
Simonyan, K.; and Zisserman, A. 2014. Very deep convo-
lutional networks for large-scale image recognition. arXiv
preprint arXiv:1409.1556 .
Sohn, H.; Jung, Y. J.; Lee, S.-i.; and Ro, Y. M. 2013a. Pre-
dicting visual discomfort using object size and disparity in-
formation in stereoscopic images. IEEE Transactions on
Broadcasting 59(1): 28–37.
Sohn, H.; Jung, Y. J.; Lee, S.-I.; Speranza, F.; and Ro, Y. M.
2013b. Visual comfort amelioration technique for stereo-
scopic images: Disparity remapping to mitigate global and
local discomfort causes. IEEE transactions on circuits and
systems for video technology 24(5): 745–758.
Tam, W. J.; Speranza, F.; Yano, S.; Shimono, K.; and Ono,
H. 2011. Stereoscopic 3D-TV: visual comfort. IEEE Trans-
actions on Broadcasting 57(2): 335–346.
Tulyakov, S.; Ivanov, A.; and Fleuret, F. 2018. Practical
deep stereo (pds): Toward applications-friendly deep stereo
matching. In Advances in Neural Information Processing
Systems, 5871–5881.
Yang, G.; Manela, J.; Happold, M.; and Ramanan, D. 2019.
Hierarchical deep stereo matching on high-resolution im-
ages. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 5515–5524.
Yano, S.; Emoto, M.; and Mitsuhashi, T. 2004. Two fac-
tors in visual fatigue caused by stereoscopic HDTV images.
Displays 25(4): 141–150.
Ying, H.; Yu, M.; Jiang, G.; Peng, Z.; and Chen, F. 2020.
Perceived depth quality-preserving visual comfort improve-
ment method for stereoscopic 3D images. Signal Processing
169: 107374.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In the field of 3D technology, it is an interesting as well as meaningful issue to control object depth in 3D space. Recently, some depth control methods for stereoscopic images have been proposed, which usually employ depth map or directly process color images to implement depth control. There are two main disadvantages for these methods. First, the results of these methods usually suffer from object deformation and holes. Second, these methods are prone to cause undesired object size changing in 3D space. To address these issues, we propose a shape-preserving object depth control method for stereoscopic images. First, a novel depth mapping model is presented for calculating the ideal coordinates of the key points in depth control, so that the shape of the object can be well preserved. Afterwards, the image content-based constraints are used to further preserve the structure of the object and its background. Finally, the warping technology is introduced to deal with images optimally as well as to avoid holes. Experimental results show that the proposed method can control object depth and preserve the shape of the object effectively without sensible background distortion.
Article
The improvement of visual comfort (VC) of stereoscopic three-dimensional (S3D) images is often accompanied with a decline in the perceived depth (PD) quality which involves three senses: presence, power, and realism. To address this problem, this paper proposes a novel PD quality - preserving VC improvement method for S3D images. First, an overall visual experience index termed 3D visual satisfaction which accounts for both VC and PD quality is defined. Then, a new viewing distance nonlinear shifting (VDNS) scheme is developed to improve the VC of S3D images and VDNS-based rendering method is proposed to generate new S3D images. A subjective assessment experiment is conducted on S3D image database, consisting of discomfort S3D images and their VDNS-based rendering images to obtain the corresponding ground truth 3D visual satisfaction scores. Based on the labeled dataset, an objective 3D visual satisfaction assessment model is presented, which integrates VC and PD quality and denotes as VCPD model. Using the VCPD model as guidance, VDNS is used to improve the VC and 3D visual satisfaction of S3D images in a stepwise manner without introducing geometric proportional distortion. As a result, the adjusted S3D image can provide better 3D visual satisfaction to viewers, i.e., improving the VC while preserving the PD quality. Experimental results show that the proposed method can achieve better comprehensive performance in improving both of VC and PD quality than other relevant methods as it can provide the optimal 3D visual satisfaction.
Article
In this paper we propose a novel deep learning-based visual comfort assessment (VCA) for stereoscopic images. To assess the overall degree of visual discomfort in stereoscopic viewing, we devise a binocular fusion deep network (BFN) learning binocular characteristics between stereoscopic images. The proposed BFN learns the latent binocular feature representations for visual comfort score prediction. In the BFN, the binocular feature is encoded by fusing the spatial features extracted from left and right views. Finally, visual comfort score is predicted by projecting the binocular feature onto the subjective score space. In addition, we devise a disparity regularization network (DRN) for improving prediction results. The proposed DRN takes the binocular feature from the BFN and estimates disparity maps from the feature in order to embed disparity relations between left and right views into the deep network. The proposed deep network with BFN and DRN is end-to-end trained in a unified framework where the DRN acts as disparity regularization. We evaluated the prediction performance of the proposed deep network for VCA by the comparison of existing objective VCA metrics. Further, we demonstrated that the proposed BFN showed various factors causing visual discomfort by using network visualization.
Technical Report
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively.