Conference PaperPDF Available

Self-supervised Light Field View Synthesis Using Cycle Consistency

Conference Paper

Self-supervised Light Field View Synthesis Using Cycle Consistency

Abstract and Figures

High angular resolution is advantageous for practical applications of light fields. In order to enhance the angular resolution of light fields, view synthesis methods can be utilized to generate dense intermediate views from sparse light field input. Most successful view synthesis methods are learning-based approaches which require a large amount of training data paired with ground truth. However, collecting such large datasets for light fields is challenging compared to natural images or videos. To tackle this problem, we propose a self-supervised light field view synthesis framework with cycle consistency. The proposed method aims to transfer prior knowledge learned from high quality natural video datasets to the light field view synthesis task, which reduces the need for labeled light field data. A cycle consistency constraint is used to build bidirectional mapping enforcing the generated views to be consistent with the input views. Derived from this key concept, two loss functions, cycle loss and reconstruction loss, are used to fine-tune the pre-trained model of a state-of-the-art video interpolation method. The proposed method is evaluated on various datasets to validate its robustness, and results show it not only achieves competitive performance compared to supervised fine-tuning, but also outperforms state-of-the-art light field view synthesis methods, especially when generating multiple intermediate views. Besides, our generic light field view synthesis framework can be adopted to any pre-trained model for advanced video interpolation.
Content may be subject to copyright.
Self-supervised Light Field View Synthesis Using
Cycle Consistency
Yang Chen, Martin Alain, Aljosa Smolic
V-SENSE project, School of Computer Science and Statistics, Trinity College Dublin
{cheny5, alainm, smolica}@scss.tcd.ie
Abstract—High angular resolution is advantageous for practi-
cal applications of light fields. In order to enhance the angular
resolution of light fields, view synthesis methods can be utilized
to generate dense intermediate views from sparse light field
input. Most successful view synthesis methods are learning-based
approaches which require a large amount of training data paired
with ground truth. However, collecting such large datasets for
light fields is challenging compared to natural images or videos.
To tackle this problem, we propose a self-supervised light field
view synthesis framework with cycle consistency. The proposed
method aims to transfer prior knowledge learned from high
quality natural video datasets to the light field view synthesis
task, which reduces the need for labeled light field data. A cycle
consistency constraint is used to build bidirectional mapping en-
forcing the generated views to be consistent with the input views.
Derived from this key concept, two loss functions, cycle loss and
reconstruction loss, are used to fine-tune the pre-trained model
of a state-of-the-art video interpolation method. The proposed
method is evaluated on various datasets to validate its robustness,
and results show it not only achieves competitive performance
compared to supervised fine-tuning, but also outperforms state-
of-the-art light field view synthesis methods, especially when
generating multiple intermediate views. Besides, our generic light
field view synthesis framework can be adopted to any pre-trained
model for advanced video interpolation.
Index Terms—Light Fields View Synthesis, Video Interpola-
tion, Cycle Consistency, Self-supervised Fine-tuning
I. INTRODUCTION
Light field imaging has been introduced to the computer
graphics and computer vision community over 20 years ago
in the pioneer work of Levoy et al. [1], and has gained a lot of
attention since. Compared to traditional 2D imaging systems,
4D light field imaging systems aim to collect all light rays
passing through a given 3D volume by capturing not only
two spatial but also two additional angular dimensions. Light
field imaging has many applications ranging from post-capture
photograph refocusing to virtual and augmented reality [2].
Advantages of dense light fields have been demonstrated for
various computer vision and graphics tasks compared to sparse
light fields, including depth estimation, object segmentation
and image-based rendering [3]. However, a trade-off has
usually to be made between spatial and angular resolution
when implementing light field capture systems. While lenslet
plenoptic cameras tend to favor angular resolution over spatial
This publication has emanated from research conducted with the finan-
cial support of Science Foundation Ireland (SFI) under the Grant Number
15/RP/2776. We also gratefully acknowledge the support of NVIDIA Corpo-
ration with the donation of the Titan Xp GPU used for this research.
resolution [4], modern camera arrays allowing light field video
capture have physically limited angular sampling but can
capture high resolution 2D views [5].
Thus, light field resolution enhancement has been a very
active research topic. While enhancing the spatial resolution
can be seen as an extension of 2D image super-resolution
methods [6], [7], enhancing the angular resolution has been
explored from various angles, such as angular super-resolution,
view synthesis or interpolation, light field reconstruction,
or epipolar-plane image (EPI) inpainting [7]–[11]. All these
methods eventually generate novel intermediate views from a
sparse set of input views and thus obtain a denser light field.
The study of these angular resolution enhancement ap-
proaches shows that learning-based methods achieve the best
performance, and motivates the work presented in this paper.
One of the first learning-based approaches for light field
view synthesis was introduced in [9], and proposed to train
two convolutional neural networks (CNN) first estimating the
disparity, and then using this disparity to synthesize the inter-
mediate views. It has been observed that the method proposed
by Kalantari et.al. [9] can fail when dealing with complicated
scenes containing occluded regions, non-Lambertian areas
and large displacements. A ”blur-detail restoration-deblur”
framework was then proposed to enhance light field angular
resolution using EPIs in [10]. An end-to-end network with
3D detail on pseudo 4D EPI was then introduced in [11].
These generative methods are prone to over-smoothing for
large displacements that leads to a loss in perceptual quality.
The limitations of these learning-based methods can partly be
explained by the limited size of the training datasets.
As mentioned above, implementing real-world light field
capture systems is a complex task, which inherently limits the
size of available datasets, and can only be partly compensated
by the use of synthetic datasets. In comparison, a huge amount
of real-world natural images and videos are readily available,
which allows to train powerful learning-based methods, and it
has been shown recently that a direct extension of single image
learning-based methods can outperform light field learning-
based approaches for spatial super-resolution [12].
To address the aforementioned issues, we propose a self-
supervised learning-based light field view synthesis framework
based on existing video view synthesis methods which benefit
from large real-world training datasets.
Rather than training a leaning-based method from scratch,
the proposed framework adapts the existing video view synthe-
sis method to light fields using fine-tuning only, which requires
smaller training datasets. Cycle consistency has proven capable
of modeling invertible mapping when direct supervision is
unavailable [13]–[15]. Thus, a cycle consistency constraint is
introduced in our framework to allow self-supervised training
without the need of paired ground truth, which further reduces
the size of training data required.
In summary, our main contributions in this paper are:
We introduce a novel light field view synthesis framework
utilizing natural priors from existing large image datasets.
We propose for the first time a self-supervised fine-tuning
training approach for light fields based on the cycle
consistency constraint.
We demonstrate that the proposed framework outper-
forms state-of-the-art light field view synthesis methods.
This paper is organized as follows. In Section II, we
review existing related work about light field angular super-
resolution, general video frame interpolation and applications
of the cycle consistency. In Section III, the proposed light field
view synthesis framework and related self-supervised training
details are explained. Then, our proposed method is evaluated
on various datasets and compared to state-of-the-art light field
view synthesis methods in Section IV. Finally, we present our
conclusions in Section V.
II. RE LATE D WOR K
A. Light Field View Synthesis
1) Optimization-based methods.: Dense light fields are very
sparse in the transform domain, which can be used as a
powerful prior. Therefore, Shi et al. proposed an optimization
framework in the continuous Fourier domain to reconstruct
the dense light field [16]. A more advanced framework using
the shearlet transform is proposed in [8] which performs
inpainting on the EPIs to recover the missing views. While
these methods can achieve competitive performance, they
require specific inputs which are not flexible and can make
them difficult to use in practice.
2) Learning-based methods.: Kalantari et al. proposed a
learning based framework for light field view synthesis [9]
which estimates with a first CNN the disparity from 4 corner
input views of a dense light field. A second CNN is then
used to synthesize the target intermediate views using both the
disparity and the 4 input views. Wu et al. re-modeled light field
angular super-resolution as a detail restoration problem in the
2D EPI space [10]. A detail restoration framework is built to
process EPIs of a sparse light field and to recover the angular
details with a CNN. To exploit the inherent consistency of the
light field, Wang et al. introduced an end-to-end network with
pseudo 4D convolution by combining a 2D convolution on
EPIs and a sequential 3D convolution [11]. Yeung et al. [17]
also propose a two-step method, which first generates the
whole set of novel views using a view synthesis network, and
then retrieves texture details using a view refinement network.
Zhou et al. [18] train a deep network to predict the Multi-
Plane Images (MPI) representation from a narrow-baseline
Fig. 1: Dense light field reconstruction. We aim to reconstruct a
dense light field LDwith angular resolution (N, N )from a sparse
light field LSwith angular resolution (n, n). The spatial resolution
(h, w)of each view remains unchanged.
stereo image pair. The MPI representation can be use to
generate novel views using homographies to reproject each
plane of the MPI to the desired viewport. Mildenhall et al. [19]
extend this method to light fields by promoting an MPI to
each view of the light field. Novel views are then synthesized
by combining intermediate generated views from the closest
MPIs.
Video interpolation has also been applied to light field
view synthesis in [20], using fully supervised fine-tuning with
conventional loss functions on a small light field dataset.
Although various existing methods have been proposed to
enhance the angular resolution with impressive results, the
robustness and generalization of these methods are limited
by the quantity and quality of available real-world light field
datasets, as explained in the previous section.
B. Video Frame Interpolation
Niklaus et al. proposed to estimate motion and color inter-
polation within one stage using an adaptive 2D kernel which
is estimated from a trained convolutional neural network [21].
However, 2D kernel estimation requires huge memory to store
information for all pixels and this shortcoming is addressed by
replacing the 2D kernel with two separable 1D convolutional
kernels [22].
C. Cycle Consistency
The key element of our proposed method is the introduction
of the cycle consistency to the light field angular domain.
The cycle consistency constraint aims to regularize structured
predictions and establishes bidirectional mapping instead of
unidirectional mapping built by conventional cost measure-
ment functions.
Our work is inspired by the success of cyclic image gen-
eration for video interpolation [14], [15], which demonstrated
the strength of cycle consistency to adapt a pre-trained model
to a new target domain.
III. LIG HT FI EL D VIE W SYNTHESIS USI NG CYCLE
CONSISTENCY
A. Problem Formulation
In this work, a 4D light field Lis parameterized using the
two parallel planes representation as depicted in Figure 1,
Fig. 2: An overview of our proposed view interpolation approach.
Horizontal and vertical interpolation are cascaded to reconstruct
LDfrom LSusing view triplets from the corresponding angular
dimensions.
indexed by x,yover the spatial dimensions and s,tare the
angular dimensions. We denote by Is,t the view extracted from
a light field Lat the angular position s, t. Given a sparsely-
sampled light field LSwith resolution (h×w×n×n),
the goal is to reconstruct a more densely-sampled light field
LDwith the same spatial resolution and a higher angular
resolution (h×w×N×N), where N=α(n1) + 1 and
αis the up-sampling factor in the angular domain. Unless
mentioned specifically, α= 2 is used as default to explain
our method. By fixing one angular dimension, a set of views
can be extracted along the remaining angular dimension of the
light field. Such a view set can be considered as a consecutive
frame sequence, which can be captured by a virtual camera
moving along the corresponded angular direction. Thus, the
dense light field reconstruction problem can be treated as a
video interpolation process along the fixed angular dimension.
As listed in section II, many CNN-based methods have been
shown to be successful for video interpolation tasks. However,
directly adopting a pre-trained network from an existing video
interpolation method to the light field domain may fail since
the distribution of these two kind of data may differ. On the
other hand, retraining a CNN from scratch can be laborious
and the limited size of light field datasets may not allow to
reach competitive performance. Thus, to maximally leverage
the advantage of the cutting-edge video interpolation methods
and to avoid its troublesome retraining, we introduce a self-
supervised fine-tuning approach using cycle consistency to
apply the pre-trained model of a video interpolation method
to the light field domain.
B. Proposed Framework with Self-Supervised Learning
Given a sparse light field LSwith angular resolution (n, n),
our proposed approach aims to build a learning based model
that takes this light field as input and accurately reconstructs
a high quality dense light field LDwith angular resolu-
tion (N, N )without the support of paired ground truth. As
shown in Figure 2, we consider triplets of views extracted
from LSalong a fixed angular dimension, either horizontally
{IS
s2,t, I S
s,t, I S
s+2,t}or vertically {IS
s,t2, IS
s,t, I S
s,t+2}. Note
that triplets have to be used due to our proposed cycle loss
described below. A dense light field LDis obtained by first
performing horizontal interpolation on all rows, and then
performing vertical interpolation on all columns. The view
(a) Cycle loss (b) Reconstruction loss
Fig. 3: Illustration of the cycle loss and reconstruction loss on a
vertical input triplet. Both losses do not require any knowledge of
the ground truth intermediate views (represented by dashed square)
and can therefore be used for self-supervised training.
interpolation is achieved by two CNNs which share the same
architecture but are trained separately along the horizontal and
vertical dimensions.
Let us consider the horizontal interpolation case in order to
explain the framework more in detail. Given an input triplet
{IS
s2,t, I S
s,t, I S
s+2,t}, two intermediate views can be generated
from pairwise adjacent views:
ˆ
ID
s1,t =M(IS
s2,t, I S
s,t)
ˆ
ID
s+1,t =M(IS
s,t, I S
s+2,t)(1)
where Mis a pre-trained video interpolation method.
Inspired by the recent success of the application of the cycle
consistency for video interpolation [14], [15], we propose
to fine-tune our baseline interpolator Min a self-supervised
manner by applying the cycle consistency constraint to the
light field angular domain, as shown in Figure 3a. By applying
the interpolator Mon the two intermediate views generated
from the input triplet as defined in equation 1, we can obtain
an estimate of the center view of the input triplet IS
s,t which
we denote as the cycle-reconstructed view ˜
IS
s,t:
˜
IS
s,t =Mˆ
ID
s1,t,ˆ
ID
s+1,t
=MM(IS
s2,t, I S
s,t),M(IS
s,t, I S
s+2,t)(2)
We can thus define the cycle-loss as the `1-norm distance
between the cycle-reconstructed view ˆ
IS
s,t and the input view
IS
s,t:
Lc=||˜
IS
s,t IS
s,t||1(3)
While `1-norm based losses are able to minimize the overall
error between the estimated images and the corresponding
original images, they are known to generate over-smooth
results. To tackle this problem, we also introduce in our
framework a perceptual loss Lp, defined as the `2-norm
between high-level convolutional features extracted from the
cycle-reconstructed view and the input view:
Lp=||Ψ(˜
IS
s,t)Ψ(IS
s,t)||2(4)
where Ψextracts the convolutional features from images using
a VGG-16 network [23], which then is applied to train our base
CNN network (SepConv [22], see below).
Fig. 4: Demonstration of the two-step strategy to generate multiple
intermediate views from a set of sparse views, denoted as green
squares, when α= 4. The first step is to synthesize middle views,
denoted as red and blue circles, between each pairwise input views.
The second step uses original and synthetic views to reconstruct the
remaining missing views, denoted as yellow and grey circles, along
one angular dimension.
Furthermore, to stabilize the training process, we introduce
a reconstruction loss Lr, as shown in Figure 3b. In this case,
the two non-adjacent views from the input triplet IS
s2,t and
IS
s+2,t are used to generate the center view of the input triplet:
ˆ
IS
s,t =MIS
s2,t, I S
s+2,t(5)
This reconstructed view can be used to define the reconstruc-
tion loss Lras its `1-norm distance to the input view:
Lr=||ˆ
IS
s,t IS
s,t||1(6)
Note that all losses introduced in our framework as defined
in equations 3, 4, and 6, do not rely on any knowledge of the
ground truth dense light field LDbut only the given sparse
input light field LS, thus allowing to perform self-supervised
training or fine-tuning of the learning-based interpolator M.
Multi-step Light Field Generation While our framework
naturally performs angular up-sampling with a factor α= 2,
denser light fields can be obtained by iteratively applying the
proposed approach, as illustrated in Figure 4 for α= 4. Any
upsampling factor which is a power of two is in fact supported,
i.e. α= 2x,(xZ&x > 1).
C. Implementation Details
In this work, we select the adaptive separable convolution
(SepConv) [22] as our baseline interpolator Mdue to its bal-
ance between ease of use and performance accuracy, but note
that any learning-based video interpolation method [24]–[26]
can be used within our framework. The network of SepConv
employs an encoder-decoder architecture, each part contains
convolution blocks and skip connections, to extract features
and then performs four 1D kernel estimations individually to
obtain the final results. We use the implementation available
online based on PyTorch 12 and use the default configurations
1github.com/sniklaus/sepconv-slomo
2github.com/HyeongminLEE/pytorch-sepconv
from the original SepConv paper. We fine-tune the pre-trained
model by minimizing the objective function:
arg min
M
(λcLc+λrLr+λpLp)(7)
where Lc,Lrand Lpare defined in equations (3), (6) and (4).
For all experiments, we set the parameters as λc= 1,
λr= 1, and λp= 0.06. The Adam optimizer is applied for
optimization with a batch size of 8. We start with the learning
rate of 0.001 and a scheduler is applied to decay the rate
according to the learning progress. As in the original SepConv
work, we firstly crop training data to 150 ×150 patches, then
randomly crop to 128 ×128. In addition, we perform pre-
processing to eliminate patches containing too small disparity.
An Intel Core i7-6700k 4.0GHz CPU was used for all our
experiments, and the neural network training was run on a
single Nvidia Titan Xp GPU with 12 GB memory.
IV. EXP ER IM EN TS
In this section, we first conduct an ablation study, especially
evaluating the efficiency of the proposed framework compared
to supervised fine-tuning. For this purpose, we use a variety of
real-world and synthetic dense light field datasets which we
sub-sample to create our test sparse datasets with sampling
ratios α= 2 and α= 4.
We then compare the proposed framework to two top-
performing state-of-the-art light field view synthesis methods,
a shearlet-based method [8] and a learning-based method
(LFEPICNN) [10].
For all our evaluations, the peak signal-to-noise ratio
(PSNR) and the structural similarity (SSIM) are computed
over RGB images to evaluate the numerical performance of
the different methods. For each light field, unless emphasized
specifically, the average numerical results are computed over
all synthesized views. All evaluations are performed on the
same machine to ensure fairness of the comparison. Please
find more detailed experimental summary on our website 3.
A. Ablation Study
For this study we used dense light fields from real-world
and synthetic datasets. For the real-world dataset, we selected
27 real-world Lytro light fields captured by EPFL [27] and
INRIA [28] using Lytro Illum cameras, and 11 light fields from
the Stanford dataset taken by a camera gantry [29]. The Lytro
Illum light fields are processed with the pipeline of Matysiak
et al. [30]. For the synthetic light field dataset, all 28 light
fields from the HCI benchmark [31] were used, as well as
160 light fields from the dataset of [32].
For testing, 10 light fields are used: 2 from EPFL, 2 from
INRIA, 2 from Stanford, and 4 from HCI. All remaining light
fields are used for training.
Test sparse light fields are sub-sampled from the original
light fields with ratios α= 2 and α= 4. More precisely, 9×9
views are extracted from input light fields and considered as
3https://v-sense.scss.tcd.ie/?p=5163
dense ground truth, and 5×5and 3×3views are then sub-
sampled to create sparse light fields.
We conduct the ablation experiments by comparing to
several variants of the proposed framework. First, we use
the pre-trained model of SepConv as the baseline. Since the
dense light field ground truth is available, we fine-tuned the
SepConv model using supervised training. We also evaluate
the influence of the cycle loss by training our framework using
only the reconstruction loss. We also assess the performance
of our framework when vertical interpolation is performed
before horizontal interpolation, as opposed applying horizontal
interpolation first as shown in Figure 2.
The numerical results are computed and averaged over all
test light fields, and the comparison is presented in Table I.
As we can observe, our proposed method can outperform the
pre-trained model even without the support of the ground-
truth, and achieve competing performance compared to fully
supervised fine-tuning. It is also clear that the use of the
cycle loss improves the performance of our framework. In
addition, we can see that cascading order of horizontal/vertical
or vertical/horizontal interpolation has a non-negligible impact
on the final performance.
TABLE I: Quantitative results of the ablation study.
α= 2 α= 4
PSNR(dB) SSIM PSNR(dB) SSIM
SepConv Pretrained 37.23 0.9880 34.66 0.9793
SepConv Supervised Fine-tuning 38.40 0.9921 35.81 0.9831
Ours without Cycle Loss 38.01 0.9883 35.25 0.9801
Ours with V-H CNN 38.14 0.9889 35.67 0.9817
Ours Full Model 38.30 0.9902 35.72 0.9830
B. Comparison to Light Field View Synthesis Methods
We use here the same test datasets as for the ablation
study to compare the proposed framework against the pre-
trained SepConv model, shearlet-based reconstruction [8], and
LFEPICNN [10]. We used the implementations provided by
authors, and carefully selected their parameters to maximize
their performance.
We show the quantitative results for each dataset separately
in Table II, III, and IV, as each dataset corresponds to a
different disparity range. Note that the results are averaged
per dataset.
The shearlet-based reconstruction is almost always outper-
formed by all other methods, and while shearlet-based recon-
struction and LFEPICNN are designed specifically for light
fields, they are only competitive on the Lytro dataset which
has a very narrow disparity range. Our proposed framework
consistently outperforms all other methods including SepConv.
In addition, our method is more robust when using sparser
input datasets such as when using a sub-sampling ratio α= 4.
We present visual comparisons for the ChezEdgar and
LegoKnights light fields in Figure 5. LegoKnights is a challeng-
ing case as it has wider disparity than other test light fields and
large texture-less regions. Shearlet [8] and LFEPICNN [10]
both fail to produce plausible results and significant artifacts
TABLE II: Numerical results on the real-world Lytro datasets [27],
[28]
α= 2 α= 4
PSNR(dB) SSIM PSNR(dB) SSIM
Shearlet 33.10 0.9667 29.99 0.9361
LFEPICNN 35.35 0.9864 32.06 0.9640
SepConv 35.30 0.9836 32.46 0.9712
Ours 36.76 0.9876 33.62 0.9767
TABLE III: Numerical results on the synthetic HCI
dataset [31]
α= 2 α= 4
PSNR(dB) SSIM PSNR(dB) SSIM
Shearlet 34.81 0.9734 29.88 0.8911
LFEPICNN 34.25 0.9692 30.42 0.9172
SepConv 38.88 0.9943 36.23 0.9888
Ours 39.87 0.9953 37.44 0.9913
TABLE IV: Numerical results on the real-world Stanford
Gantry datasets [29]
α= 2 α= 4
PSNR(dB) SSIM PSNR(dB) SSIM
Shearlet 31.44 0.8977 29.03 0.8484
LFEPICNN 34.68 0.9407 30.46 0.8762
SepConv 37.80 0.9843 35.19 0.9788
Ours 38.23 0.9853 36.47 0.9791
Fig. 5: Visual comparison on the INRIA ChezEdgar and Stanford
Lego Knights light fields. (a) Ground-truth. (b) Shearlet [8]. (c)
LFEPICNN [10]. (d) Ours
can be observed on challenging areas, such as the tip of the
sword and bricks on the background wall. In comparison, our
proposed approach generates results closer to the ground-truth.
It demonstrates that our method is more robust to different
real-world scenes and is able to produce more photo-realistic
results for large disparity view synthesis.
A visual comparison of synthetic scenes is presented in
Figure 6 using Herbs and Bicycle from the HCI dataset [31].
As we can observe, Shearlet [8] fails to reconstruct sharp
details in texture-less regions, such as the door in Bicycle. The
results of LFEPICNN [10] are blurry in occluded regions, such
as the leaves in Herbs and the metal bin in Bicycle. Our method
achieves best quantitative and qualitative performance on the
synthetic HCI dataset, and shows robustness to occlusions and
texture-less surfaces.
Fig. 6: Visual comparison on the synthetic HCI dataset [31]. (a)
Ground-truth. (b) Shearlet [8]. (c) LFEPICNN [10]. (d) Ours
V. C ONCLUSIONS
In this work, we proposed a novel self-supervised frame-
work to reconstruct dense light fields by synthesizing novel
intermediate light field views. To adopt small-sized light field
datasets, we introduced the cycle consistency mechanism to
fine-tune a pre-trained video interpolation method in a self-
supervised fashion. In this context, this method does not
require paired ground-truth and is able to use for any low
angular resolution light field input. The proposed method
outperforms other state-of-the-art approaches on various light
fields, in particular, given handling wide disparity inputs. In
addition, our method can be adopted to any video interpolation
approach, and let any 2D video interpolation into apply to
light field data. For future work, we may focus on adopting
the proposed method to more challenging scenarios, such as
very sparse light fields captured by camera arrays. This may
require additional priors handle the sparsity.
REFERENCES
[1] M. Levoy and P. Hanrahan, “Light field rendering,” in Proc. SIGGRAPH,
1996, pp. 31–42.
[2] J. Yu, “A light-field journey to virtual reality,IEEE MultiMedia, vol. 24,
no. 2, pp. 104–112, 2017.
[3] G. Wu, B. Masia, A. Jarabo, Y. Zhang, L. Wang, Q. Dai, T. Chai, and
Y. Liu, “Light field image processing: An overview,IEEE Journal of
Selected Topics in Signal Processing, vol. 11, no. 7, pp. 926–954, 2017.
[4] R. Ng, M. Levoy, M. Br ´
edif, G. Duval, M. Horowitz, and P. Hanra-
han, “Light Field Photography with a Hand-Held Plenoptic Camera,”
Stanford University CSTR, Tech. Rep., Apr. 2005.
[5] T. Herfet, T. Lange, and K. Chelli, “5D light field video capture,” in
Proceedings of the 16th ACM SIGGRAPH European Conference on
Visual Media Production, 2019.
[6] M. Alain and A. Smolic, “Light field super-resolution via LFBM5D
sparse coding,” in 2018 25th IEEE International Conference on Image
Processing (ICIP). IEEE, 2018, pp. 2501–2505.
[7] Y. Yoon, H.-G. Jeon, D. Yoo, J.-Y. Lee, and I. So Kweon, “Learning
a deep convolutional network for light-field image super-resolution,” in
Proceedings of the IEEE International Conference on Computer Vision
Workshops, 2015, pp. 24–32.
[8] S. Vagharshakyan, R. Bregovic, and A. Gotchev, “Light field reconstruc-
tion using shearlet transform,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 40, no. 1, pp. 133–147, 2017.
[9] N. K. Kalantari, T.-C. Wang, and R. Ramamoorthi, “Learning-based
view synthesis for light field cameras,ACM Transactions on Graphics
(TOG), vol. 35, no. 6, pp. 1–10, 2016.
[10] G. Wu, Y. Liu, L. Fang, Q. Dai, and T. Chai, “Light field reconstruction
using convolutional network on epi and extended applications,IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 41,
no. 7, pp. 1681–1694, 2018.
[11] Y. Wang, F. Liu, Z. Wang, G. Hou, Z. Sun, and T. Tan, “End-to-end view
synthesis for light field imaging with pseudo 4DCNN,” in Proceedings
of the European Conference on Computer Vision (ECCV), 2018, pp.
333–348.
[12] Z. Cheng, Z. Xiong, C. Chen, and D. Liu, “Light field super-resolution:
A benchmark,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
[13] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image
translation using cycle-consistent adversarial networks,” in Proceedings
of the IEEE International Conference on Computer Vision, 2017, pp.
2223–2232.
[14] Y.-L. Liu, Y.-T. Liao, Y.-Y. Lin, and Y.-Y. Chuang, “Deep video frame
interpolation using cyclic frame generation,” in Proceedings of the AAAI
Conference on Artificial Intelligence, vol. 33, 2019, pp. 8794–8802.
[15] F. A. Reda, D. Sun, A. Dundar, M. Shoeybi, G. Liu, K. J. Shih, A. Tao,
J. Kautz, and B. Catanzaro, “Unsupervised video interpolation using
cycle consistency,” in Proceedings of the IEEE International Conference
on Computer Vision, 2019, pp. 892–900.
[16] L. Shi, H. Hassanieh, A. Davis, D. Katabi, and F. Durand, “Light field
reconstruction using sparsity in the continuous fourier domain,” ACM
Transactions on Graphics (TOG), vol. 34, no. 1, pp. 1–13, 2014.
[17] H. Wing Fung Yeung, J. Hou, J. Chen, Y. Ying Chung, and X. Chen,
“Fast light field reconstruction with deep coarse-to-fine modeling of
spatial-angular clues,” in Proceedings of the European Conference on
Computer Vision (ECCV), 2018, pp. 137–152.
[18] T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely, “Stereo
magnification: Learning view synthesis using multiplane images,” arXiv
preprint arXiv:1805.09817, 2018.
[19] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K. Kalantari, R. Ra-
mamoorthi, R. Ng, and A. Kar, “Local light field fusion: Practical view
synthesis with prescriptive sampling guidelines,ACM Transactions on
Graphics (TOG), vol. 38, no. 4, pp. 1–14, 2019.
[20] Y. Gao and R. Koch, “Parallax view generation for static scenes using
parallax-interpolation adaptive separable convolution,” in 2018 IEEE
International Conference on Multimedia & Expo Workshops (ICMEW).
IEEE, 2018, pp. 1–4.
[21] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adaptive
convolution,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, 2017, pp. 670–679.
[22] S. Niklaus, L. Mai, and F. Liu, “Video frame interpolation via adap-
tive separable convolution,” in Proceedings of the IEEE International
Conference on Computer Vision, 2017, pp. 261–270.
[23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[24] S. Niklaus and F. Liu, “Context-aware synthesis for video frame inter-
polation,” 2018 IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 1701–1710, 2018.
[25] H. Jiang, D. Sun, V. Jampani, M.-H. Yang, E. Learned-Miller, and
J. Kautz, “Super slomo: High quality estimation of multiple intermediate
frames for video interpolation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, 2018, pp. 9000–9008.
[26] W. Bao, W.-S. Lai, C. Ma, X. Zhang, Z. Gao, and M.-H. Yang, “Depth-
aware video frame interpolation,” in IEEE Conferene on Computer
Vision and Pattern Recognition, 2019.
[27] M. Rerabek and T. Ebrahimi, “New light field image dataset,” in
Proceedings of the International Conference on Quality of Multimedia
Experience, 2016.
[28] “Inria Lytro Illum dataset,” http://www.irisa.fr/temics/demos/lightField/
CLIM\/DataSoftware.html, accessed: 26-01-2018.
[29] “The stanford light field archive,” http://lightfield.stanford.edu/lfs.html,
accessed: 05-03-2019.
[30] P. Matysiak, M. Grogan, M. Le Pendu, M. Alain, and A. Smolic, “A
pipeline for lenslet light field quality enhancement,” in 2018 25th IEEE
International Conference on Image Processing (ICIP). IEEE, 2018, pp.
639–643.
[31] K. Honauer, O. Johannsen, D. Kondermann, and B. Goldluecke, “A
dataset and evaluation methodology for depth estimation on 4D light
fields,” in Asian Conference on Computer Vision. Springer, 2016.
[32] A. Alperovich, O. Johannsen, M. Strecke, and B. Goldluecke, “Light
field intrinsics with a deep encoder-decoder network,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
2018, pp. 9145–9154.
... Generally, the learning-based methods produce more plausible visual results, but they also require a large amount of training data paired with labels. To overcome such a problem, Chen et al. [55] come up with a self-supervised approach by fine-tuning a video interpolation framework based on cycle consistency. Gao et al. [56] employ a CNN to restore EPI coefficients in the shearlet domain. ...
Article
Full-text available
This paper presents a learning-based approach to synthesize the view from an arbitrary camera position given a sparse set of images. A key challenge for this novel view synthesis arises from the reconstruction process, when the views from different input images may not be consistent due to obstruction in the light path. We overcome this by jointly modeling the epipolar property and occlusion in designing a convolutional neural network. We start by defining and computing the aperture disparity map, which approximates the parallax and measures the pixel-wise shift between two views. While this relates to free-space rendering and can fail near the object boundaries, we further develop a warping confidence map to address pixel occlusion in these challenging regions. The proposed method is evaluated on diverse real-world and synthetic light field scenes, and it shows better performance over several state-of-the-art techniques.
Chapter
Full-text available
Limited angular resolution has become the main bottleneck of microlens-based plenoptic cameras towards practical vision applications. Existing view synthesis methods mainly break the task into two steps, i.e. depth estimating and view warping, which are usually inefficient and produce artifacts over depth ambiguities. In this paper, an end-to-end deep learning framework is proposed to solve these problems by exploring Pseudo 4DCNN. Specifically, 2D strided convolutions operated on stacked EPIs and detail-restoration 3D CNNs connected with angular conversion are assembled to build the Pseudo 4DCNN. The key advantage is to efficiently synthesize dense 4D light fields from a sparse set of input views. The learning framework is well formulated as an entirely trainable problem, and all the weights can be recursively updated with standard backpropagation. The proposed framework is compared with state-of-the-art approaches on both genuine and synthetic light field databases, which achieves significant improvements of both image quality (+2 dB higher) and computational efficiency (over 10X faster). Furthermore, the proposed framework shows good performances in real-world applications such as biometrics and depth estimation.
Conference Paper
Full-text available
In recent years, light fields have become a major research topic and their applications span across the entire spectrum of classical image processing. Among the different methods used to capture a light field are the lenslet cameras, such as those developed by Lytro. While these cameras give a lot of freedom to the user, they also create light field views that suffer from a number of artefacts. As a result, it is common to ignore a significant subset of these views when doing high-level light field processing. We propose a pipeline to process light field views, first with an enhanced processing of RAW images to extract sub-aperture images, then a colour correction process using a recent colour transfer algorithm, and finally a denoising process using a state of the art light field denoising approach. We show that our method improves the light field quality on many levels, by reducing ghosting artefacts and noise, as well as retrieving more accurate and homogeneous colours across the sub-aperture images.
Article
Video frame interpolation algorithms predict intermediate frames to produce videos with higher frame rates and smooth view transitions given two consecutive frames as inputs. We propose that: synthesized frames are more reliable if they can be used to reconstruct the input frames with high quality. Based on this idea, we introduce a new loss term, the cycle consistency loss. The cycle consistency loss can better utilize the training data to not only enhance the interpolation results, but also maintain the performance better with less training data. It can be integrated into any frame interpolation network and trained in an end-to-end manner. In addition to the cycle consistency loss, we propose two extensions: motion linearity loss and edge-guided training. The motion linearity loss approximates the motion between two input frames to be linear and regularizes the training. By applying edge-guided training, we further improve results by integrating edge information into training. Both qualitative and quantitative experiments demonstrate that our model outperforms the state-of-the-art methods. The source codes of the proposed method and more experimental results will be available at https://github.com/alex04072000/CyclicGen.
Article
We present a practical and robust deep learning solution for capturing and rendering novel views of complex real world scenes for virtual exploration. Previous approaches either require intractably dense view sampling or provide little to no guidance for how users should sample views of a scene to reliably render high-quality novel views. Instead, we propose an algorithm for view synthesis from an irregular grid of sampled views that first expands each sampled view into a local light field via a multiplane image (MPI) scene representation, then renders novel views by blending adjacent local light fields. We extend traditional plenoptic sampling theory to derive a bound that specifies precisely how densely users should sample views of a given scene when using our algorithm. In practice, we apply this bound to capture and render views of real world scenes that achieve the perceptual quality of Nyquist rate view sampling while using up to 4000X fewer views. We demonstrate our approach's practicality with an augmented reality smart-phone app that guides users to capture input images of a scene and viewers that enable realtime virtual exploration on desktop and mobile platforms.