Conference PaperPDF Available

Abstract and Figures

ISO/IEC MPEG and ITU-T VCEG have recently jointly issued a new multiview video compression standard, called 3D-HEVC, which reaches unpreceded compression performances for linear, dense camera arrangements. In view of supporting future highquality, auto-stereoscopic 3D displays and Free Navigation virtual/augmented reality applications with sparse, arbitrarily arranged camera setups, innovative depth estimation and virtual view synthesis techniques with global optimizations over all camera views should be developed. Preliminary studies in response to the MPEG-FTV (Free viewpoint TV) Call for Evidence suggest these targets are within reach, with at least 6% bitrate gains over 3D-HEVC technology.
Content may be subject to copyright.
New visual coding exploration in MPEG: Super-MultiView and
Free Navigation in Free viewpoint TV
Gauthier Lafruit, Université Libre de Bruxelles (Belgium); Marek Domański, Krzysztof Wegner and Tomasz Grajek, Poznań University
of Technology (Poland); Takanori Senoh, National Institute of Information and Communications Technology (Japan); Joël Jung,
Orange Labs (France); Péter Tamás Kovács, Holografika (Hungary); Patrik Goorts and Lode Jorissen, Hasselt University; Adrian
Munteanu and Beerend Ceulemans, Vrije Universiteit Brussel (Belgium); Pablo Carballeira and Sergio García, Universidad Politécnica
de Madrid (Spain); and Masayuki Tanimoto, Nagoya Industrial Science Research Institute (Japan)
Abstract
ISO/IEC MPEG and ITU-T VCEG have recently jointly issued
a new multiview video compression standard, called 3D-HEVC,
which reaches unpreceded compression performances for linear,
dense camera arrangements. In view of supporting future high-
quality, auto-stereoscopic 3D displays and Free Navigation
virtual/augmented reality applications with sparse, arbitrarily
arranged camera setups, innovative depth estimation and virtual
view synthesis techniques with global optimizations over all camera
views should be developed. Preliminary studies in response to the
MPEG-FTV (Free viewpoint TV) Call for Evidence suggest these
targets are within reach, with at least 6% bitrate gains over 3D-
HEVC technology.
Introduction
Since 25 years, MPEG has steadily been involved in the
development of video coding technologies. Today, the most
advanced single camera view coding standard, called HEVC (High
Efficiency Video Coding) offers a data rate reduction of two orders
of magnitude compared to uncompressed video. This provides
means to transmit Full-HD TV (High Definition) and soon UHD TV
(Ultra High Definition) over communication channels with bitrates
of around 15 Mbit/s, ensuring wide acceptance by the general public
in the near future.
Over the last decade, ISO/IEC MPEG and ITU-T VCEG have
also jointly developed multiview video coding standards (MV-
AVC, MV-HEVC) focusing on the compression of multiple camera
feeds “as is”, i.e. without means to facilitate the generation of
additional views that are not transmitted to the receiver. Depth-
based 3D formats and in particular 3D-HEVC, standardized in
February 2015 - have been developed to address this shortcoming:
with the use of Depth Image Based Rendering (DIBR) techniques,
the generation of additional views from a small number of
transmitted views was enabled, supporting glasses-free/auto-
stereoscopic 3D display applications with dozens of output views
from only a handful of input camera feeds. For example, 3 input
9 output and 5 input 28 output Horizontal Parallax Only (HPO)
glasses-free 3D displays have reached the prosumer market, while
Super-Multi-View (SMV) light field displays with hundreds of
ultra-dense output views and smooth motion parallax are prototyped
in R&D labs, e.g. Figure 1.
Unfortunately, aiming at very high quality viewing over a large
field of view, one would have to foresee a high number of densely
arranged input cameras, reaching 3D-HEVC bitrates in the order of
hundreds of Mbit/s for SMV at home cinema quality levels, which
might eventually hamper consumer market penetration.
Figure 1. Light Field display (Courtesy of Holografika)
Similarly, in a Virtual Reality (VR) context using Head
Mounted Displays (HMD), literally surrounding the scene to
visualize with an ultra-dense arrangement of several hundreds of
cameras would indeed offer correct motion parallax and Free
Navigation (FN) functionalities around the scene (cf. Figure 2),
similar to the Matrix bullet effect. Additionally, zoom-in/out
functionalities (cf. arrow 4 in Figure 3) would extend the walk-
around feeling to a truly immersive “fly through the scene” VR
experience on authentic looking content.
Figure 2. Motion parallax in Virtual Reality (Courtesy of Nozon)
However, to fully enable take-up of such VR technology in
each living-room, drastic cost reductions in multi-camera content
acquisition and transmission should be reached, which inevitably
calls for a reduction in the number of acquisition cameras and the
development of high-performance DIBR virtual view synthesis
techniques with sparse camera arrangements.
3D-HEVC being primarily developed for consumer
autostereoscopic 3D displays in linear camera arrangements with
small inter-camera distance (small/narrow baseline), new
compression and view synthesis challenges have to be explored for
the aforementioned Super-MultiView (SMV) and Virtual Reality
Free Navigation (VR-FN) application scenarios with moderately
dense or sparse, arbitrarily arranged multi-camera setups. MPEG
therefore recently issued a Call for Evidence (CfE), calling
companies and organizations to demonstrate technology that they
believe perform better than 3D-HEVC and accompanying pre/post-
processing. The present paper briefly summarizes the process, the
challenges and the expected outcomes for this future standard that
in the absence of an agreed naming in the standardization committee
at the time of writing - will be referred in the present paper by 3D-
HEVC++ (a naming convention borrowed from C++ that reaches
one step further over the well-established C programming
language).
Free Navigation technology by 2020
With respect to the MPEG CfE, the deadline of submission has
been settled to 17 February 2016, with an evaluation of the
proponents’ responses by the MPEG Free viewpoint TV (MPEG-
FTV) Ad-hoc Group during the 114th MPEG meeting in San Diego,
20-26 February 2016.
If any of the proposed technologies significantly outperforms
currently available MPEG technology, MPEG plans to issue a Call
for Proposals (CfP), subsequent to this CfE, to develop standards
that offer increased compression performance and viewing
experiences beyond 3D-HEVC in SMV and FN application
scenarios.
During this development, it is expected that the Olympic
Games of Rio de Janeiro in 2016 will bootstrap Multiview coding
technologies with discrete multi-viewpoint rendering experiences in
many sports events. However, the current view synthesis techniques
proposed in MV-HEVC and 3D-HEVC are only competitive in
narrow baseline camera setups. It is therefore expected that Free
viewpoint TV, allowing the user to navigate freely in the space
surrounded by a sparse set of fixed cameras, will need an additional
3-4 years cycle of development before reaching the necessary
quality standards at the Olympic Games of Tokyo in 2020. This
timeline is well synchronized with the MPEG-FTV CfE and
expected CfP schedules.
Moreover, [1] forecasts that VR with multi-camera captured
content will represent a $30 billion market by 2020, with 20% VR
films and 45% covering VR games. Already 170 million VR gamers
are expected worldwide by 2018 with an annual VR gaming revenue
of $8.6 billion, equally divided over hardware and software. The
study also pinpoints the need to develop new image capture and
processing technologies (aka Computational Imaging) to overcome
the limitation of the user looking around (360 degrees video) from
the perspective of the camera’s position only, without any capability
to navigate freely within the scene. The technology to allow such
Free Navigation (FN) is believed to be based on light field capture
[2], which is in line with the multi-camera approach proposed in
MPEG-FTV (MPEG Free viewpoint TV), further studied in a newly
established Light Field Ad-hoc Group in MPEG [3], as well as in
other standardization committees like JPEG-PLENO [4].
3D-HEVC extensions for SMV and FN
Figure 3 shows a generic multi-camera setup for real-life
application scenarios, with extensions on the current 3D-HEVC
codec architecture to support the newly proposed non-linear SMV
and/or sparse FN camera arrangements. This should lead to an agile
Multiview+Depth transmission scheme, referred to as 3D-
HEVC++. The solid line cameras correspond to physical cameras
that are setup around the scene, typically in a non-linear
arrangement. The eye icons correspond to user requested virtual
viewpoints for which no physical camera views exist. Depth range
cameras might also be present to deliver meta-data to the DIBR
processing pipeline for virtual view generation, performed in the
VSRS (View Synthesis Reference Software) module [5]. The depth
meta-data might also be obtained directly from the color cameras
through DERS (Depth Estimation Reference Software) [6]. DERS
and VSRS are non-normative modules, but nevertheless play an
important role in the codec quality-bitrate performance figures,
calling for their in-depth study and improvement in the development
of the future 3D-HEVC++ standard.
Figure 3. Multiview plus depth video pipeline for 3D-HEVC (top-left) and 3D-
HEVC++ (bottom-right) showing the input cameras and user requested views
(eyes) that are synthesized along linear (1, 2) and non-linear/curved pathways
(3), as well as zoom-in/out functionalities (4) to obtain viewpoints within the
enclosed camera volume.
Indeed, the (optional) depth maps are compressed together with
the color images, and view synthesis is also used during
compression in order to provide a prediction to a physical camera
view from its direct neighbors for transmitting a low-entropy
difference image to the receiver. This View Synthesis Prediction
(VSP) is a codec-in-the-loop method, hence will not impact the
decoded view quality in case of an imperfect view synthesis (though
it will then increase the bitrate). However, an additional view
synthesis (VSRS) step will be applied from the decoded views to
generate additional virtual viewpoints that are not transmitted to the
receiver. Since this view synthesis works in an open-loop mode, any
artefact in the generated views will have a dramatic impact on the
perceived output quality. This is an important reason to explore new
view synthesis techniques that can work properly in large baseline
conditions.
Finally, it is worth noticing that in an SMV display, all the
physical and virtual camera viewpoints have to be rendered
simultaneously. In a VR-FN application scenario, however, only
two adjacent viewpoints (physically existing and/or virtual
viewpoints) have to be rendered at any given moment in time in the
stereo HMD, based on the user’s current position. Since VR does
not tolerate high response latencies, the complexity of the employed
techniques should remain acceptable.
SMV and FN test sequences
The MPEG-FTV group recommends specific SMV and FN test
sequences in the MPEG-FTV CfE, in order to conduct comparative
studies between the submitted technologies [7]. The SMV
sequences contain 80 narrow-baseline views, while the FN
sequences contain only 7 views, each view being complemented by
a depth map that has been estimated offline, either by DERS, or by
a proponent’s in-house technique.
The Big Buck Bunny SMV sequences are generated from 3D
graphics files donated by the Blender Foundation. Eighty adjacent
viewing directions were synthetically rendered by Holografika to
obtain the Big Buck Bunny color and depth map videos used in the
CfE evaluation. Seven of these views are also used as sparse FN
sequences (Flowers, Butterfly). The Big Buck Bunny Flowers and
Butterfly depth maps do not contain any artefacts, since they are
synthetically rendered from a 3D model by conversion from the z-
buffer during rendering. However, the depth maps of all other
sequences (Champagne Tower, Pantomime, Soccer-Arc1, Soccer-
Linear2 and Poznan Blocks) have been estimated algorithmically
(DERS or proprietary software) and show some artefacts, possibly
impeding the subsequent view synthesis quality. For instance, since
DERS uses a Graph Cut stereo matching technique [8] applied
pairwise on adjacent physical camera views, some spatial
inconsistencies might appear during view synthesis (VSRS) of
virtual views; these are even more apparent in large baseline setups,
as will be discussed later in Figures 7 and 8.
3D-HEVC in non-linear, large-baseline conditions
The 3D-HEVC technology standardized in February 2015 has
originally been developed and tested for linear, narrow baseline
camera arrangements. In contrast, convergent cameras in the typical
3D-HEVC++ coding pipeline of Figure 3 will create both positive
and negative disparities, cf. Figure 4, requiring to bring minor
format and syntax changes into the codec specifications.
Figure 4. Positive and negative disparities (d) in convergent camera setup
Also more essential codec modifications will be required in the
development of 3D-HEVC++. For instance, an increase of a
distance between cameras results in the reduction of inter-view
correlation. This yields deterioration of the 3D-HEVC compression
performance that converges to a drastically lower HEVC simulcast
compression efficiency. Moreover, for the viewpoint located outside
the connecting line between camera views, the interview-prediction
model should be more complex than that based on simple
compensation of the horizontal disparity, as it is currently
implemented in 3D-HEVC reference software [9].
It is hence expected that new coding developments will be
needed, including even non-normative DERS and VSRS
developments, which will eventually ripple into the normative 3D-
HEVC++ codec specifications. The boundaries between normative
and non-normative extensions of 3D-HEVC are consequently
gradually blurring away, considerably adding complexity to the 3D-
HEVC++ developments.
For instance, View Synthesis Prediction (VSP) is the process
of predicting a physical camera view from its two adjacent
neighbors, and transmitting the entropy-coded difference image.
[10] reports an average bitrate gain of 6.25% over 3D-HEVC by
back-and-forth projection between the respective 2D views and 3D
space, in non-linear, large baseline camera arrangements. More
generally, the implementation of the 3D-HEVC++ should hence
exploit the modified disparity vector derivation in such tools as the
View Synthesis Prediction (VSP), Disparity Compensated
Prediction (DCP), Neighboring Block Disparity Vector (NBDV),
Depth oriented NBDV (DoNBDV), Interview Motion Prediction
(IvMP) and Illumination Compensation (IC).
Figure 5. Homography inpainting for VSP in soccer sequence
In extended scenes like soccer fields, even more elaborated
techniques are required, cf. Figure 5. For instance, [11] proposes a
homography reprojection and inpainting technique for VSP,
correcting mainly the outside borders of the camera views. This
extension towards novel 3D-HEVC++ technology for arbitrary
camera positions is still under development.
DERS and VSRS in non-linear, large-baseline
conditions
As already mentioned, though DERS and VSRS are non-
normative in the codec processing pipeline of Figure 3, their
performance has an important impact on the quality-bitrate
performance figures (e.g. the VSP tool discussed in previous
section) and hence also on future developments and updates of the
3D-HEVC codec towards 3D-HEVC++. We therefore give an
overview of some improvements that have been studied over the
past year in the MPEG-FTV group in supporting the new SMV and
FN application scenarios for 3D-HEVC++.
Figure 8. VSRS (top) vs. Epipolar Plane Imaging (bottom) view synthesis
Large Baseline View Synthesis
In order to serve autostereoscopic- and light field displays with
real-life video, advanced view synthesis technologies are needed as
it is often impractical to record the high number of camera views
that are requested by such displays. The main idea for increasing the
compression performance consists in not transmitting some physical
camera views at all (in contrast to VSP which transmits a difference
image) and generate the missing views with VSRS. For example, in
Figure 6, skipping some views during transmission will effectively
decrease the bitrate with a factor of 2 in the successive skipping tests
(horizontal arrows), but unfortunately the corresponding VSRS
generated views also cause a large PSNR drop (4-6 dB in the
example of the Champagne Tower test sequence) yielding
suboptimal PSNR-bitrate curves.
Figure 7 shows a more detailed view of the PSNR quality
degradation when performing an open-loop VSRS view synthesis to
recover all output views from a dyadic decreasing number of
transmitted views. From the sixty middle views under test, an
increasing number of views are not transmitted to the receiver
(skip1, skip3, skip5, etc) but rather generated through VSRS. One
clearly observes the huge quality degradation of up to a dozen of dB
in terms of PSNR for large baselines (high skip numbers). Figure 8
shows the typical horizontal stripes artefacts that are caused at
increasing baselines in the VSRS reference software.
Clearly, more in-depth studies are required to evaluate the
potential of skipping some views in large baseline scenarios, not
only in SMV applications, but foremost in FN applications where
VSRS will remain an open-loop tool without error correction post-
processing capabilities.
Figure 6. PSNR vs. bitrate results for different coding configurations of the
Champagne sequence, by skipping views before the coding (Skip<n>: n
consecutive views are skipped and need to be synthesized, between 2
transmitted views)
Figure 7. PSNR variation of synthesized views vs. transmitted views
Figure 9. View synthesis with Depth-based view blending (left) vs. VSRS view
blending (right)
Recently, some modifications in the VSRS software have been
proposed to exclude object depth contributions that are not visible
in all camera views, largely improving the view synthesis as shown
in Figure 9.
Figure 10. View synthesis without ghosting (left) vs. VSRS (right)
Figure 11. Globally optimized view synthesis (left) vs. VSRS (right)
Figure 12. Objective comparison against VSRS on the Big Buck Bunny
sequence. Results are reported for single (1p) versus quarter pixel (4p)
precision in the warping and turning the view blending option on or off (nb).
Moreover, [12] demonstrates additional improvements to
VSRS. Firstly, the algorithm used to perform 3D warping between
camera views has been modified in order to avoid ghosting artefacts,
cf. Figure 10. Secondly, a new inpainting algorithm is proposed in
order to fill disoccluded regions in the image by optimizing a
Markov random field using a form of priority-belief propagation
[13]. The inpainting algorithm analyzes the depth map in the
synthesized view and is designed to reconstruct the disoccluded area
using image patches from background regions. Figure 11 clearly
shows visual improvements with respect to the current VSRS result.
In terms of objective quality expressed by the PSNR, average gains
of 0.64dB have been measured for the Big Buck Bunny Flowers
sequence. Average PSNR values over time are shown in Figure 12
for each camera in the array.
Multi-Camera Depth Estimation
Thanks to the techniques described in previous section and the
perfect depth map of the Big Buck Bunny Flowers sequence, we
have observed that with this test sequence, when skipping a limited
number of views (skip1, skip3), the PSNR-bitrate curves remain
roughly Pareto optimal with large bitrate gains, as shown in Figure
13.
Figure 13. PSNR vs. bitrate results for different coding configurations of the
Bunny sequence, by skipping views before coding
For this particular case, the view skipping method as described
in the previous section remains interesting, in contrast to the severe
PSNR drops observed for large baselines in Figure 7. This is,
however, believed to be an exceptional case made possible by the
use of perfect depth maps, synthetically calculated for the Big Buck
Bunny sequence, hence avoiding further VSRS artefacts induced by
depths error.
Figure 14. Depth Estimation (top) and View Synthesis results (bottom) with Segmentation-guided Plane Sweeping (left) and DERS/VSRS (right)
Recent studies show indeed that there is an intricate
relationship between the depth and view synthesis distortion in the
current DERS and VSRS tools. In particular, [14] provides an
exhaustive analysis of the correlation between depth distortion and
synthesis distortion at different coding levels, concluding that depth
coding distortion reflects well the synthesis distortion at the frame
level and MB-row level, while lower correlation values are achieved
at the MB level. This analysis also reveals that the distortion on a
depth block is aggregated better with a lower-degree norm, Sum of
Absolute Error (SAE), than the commonly used Sum of Squared
Error (SSE).
In [15], the authors propose a synthesis distortion metric to
optimize the coding of depth in coding schemes such as 3D-AVC,
3D-HEVC and 3D-HEVC++. This metric enhances the overall
coding efficiency at the cost of a computational complexity
overhead introduced by the new metric itself, and the fact that it
requires joint processing of depth and texture in a single encoder.
Designing better depth estimation techniques than the current
DERS hence provides interesting perspectives to improve view
synthesis. In particular, depth estimation based on all available
camera views instead of only a subset of them will intuitively be
beneficial. In the example of Figure 8, an Epipolar Plane Image
(EPI) depth estimation technique [16] using all available camera
views, inspired by [17], provides indeed better depth maps and view
synthesis results at large baselines, with a 5 dB PSNR gain [16],
compensating the typical 4-6 dB losses observed when skipping
views in Figure 6. Also [18] reports valuable gains using similar EPI
techniques. Finally, [19] has shown large view synthesis subjective
quality gains using a segmentation-guided plane-sweep depth
estimation method on the Soccer-Arc1 test sequence, cf. Figure 14.
Improving DERS so as to include all available input cameras
in the depth estimation, in conjunction with improving VSRS with
the techniques described in previous subsection, are clearly
interesting directions to be further investigated in the 3D-HEVC++
CfE and subsequent CfP.
Subjective evaluation of SMV and FN
Though objective quality evaluations based on PSNR give
good indications on the most promising candidate compression
tools, measuring the Quality of Experience (QoE) plays a crucial
role in the determination of the technologies that are adopted in the
final standard [20]. For 2D images and video, the well-known ITU-
R BT.500-11 recommendation [21] describes the methodology that
should be used when performing subjective quality studies
involving human participants. In [22], an extension of these
guidelines is proposed for the evaluation of 3D content on
stereoscopic and multiview autostereoscopic displays.
It is important to notice that SMV and FN content, and their
visualization on 3D displays, place new challenges on subjective
evaluation of MPEG-FTV coding technology. Some works [23, 24]
have helped to provide a parametrization that describes the relations
between content, display mode and user experience. Such a
parametrization is a very valuable tool to guide the subjective
evaluation or even content creation, giving guidelines to configure
scene parameters such as depth or density of cameras for an
acceptable viewing experience. Particularly, [24] proposes an
approach to this parametrization which captures new elements that
are relevant in the subjective evaluation of SMV and that do not
apply on the evaluation of 2D or fixed-viewpoint stereoscopic video.
The main advantage of this novel parametrization is that it is based
on the disparity between adjacent views, instead of angle or camera
distance, and thus:
It aggregates the contribution of different parameters that
influence the MPEG-FTV subjective experience, better representing
the perception of visual comfort.
It is common to different camera arrangements, such as
linear, non-linear convergent or arc.
In particular, such parametrization has been very useful in
defining the minimum comfortable camera density in a view path
for the FN scenarios, setting the number of intermediate virtual
viewpoint positions between physical cameras [24].
Figure 15. Viewsweeping scheme for the stereoscopic evaluation of SMV
content in the CfE on FTV.
CfE stereoscopic viewing
In the CfE process, submissions will be evaluated on a
stereoscopic monitor and spatial back-and-forth view sweeps
between the left- and right-most views will be generated from the
decoded and generated virtual views, cf. Figure 15. Test participants
will then provide a Mean Observation Score (MOS) comparing the
different technology submissions.
Light Field SMV viewing
Since subjective viewing on stereoscopic, auto-stereoscopic
and light field displays [25] might be very different, rendering
quality evaluations should be conducted on a multitude of displays
in order to evaluate the best compression technology amongst the
CfE proponents.
[26] has shown a linear quality relationship between
stereoscopic and auto-stereoscopic displays, but no clear studies are
available between the latter and SMV light field displays.
Furthermore, to accurately evaluate visual quality in 3D video, it is
of paramount importance to avoid any possible visual artifacts
introduced by the display’s internal light field transmission system,
which has to use Gbps communication lines to transmit raw data. To
make this possible, Holografika has built a custom light field display
of 73 MPixel, with a 2D equivalent resolution of 1280x720 pixels,
24 bit RGB, 70 degrees field of view and an angular resolution of
0.96 degrees, using cluster nodes over a 40GBs Ethernet switch
[27]. This system is located at the Electronics and Informatics
Department (ETRO) of the Vrije Universiteit Brussel (VUB) in
Brussels, Belgium. Raw light field data transport provided by this
system offers the possibility to carry out visual tests in MPEG-FTV
CfE and subsequent CfP. To this end, the testing environment at
VUB-ETRO’s 3DLab has also been equipped with appropriate
lighting conditions (non-flickering lights with controllable
temperature, specific environmental color), as requested by the ITU-
R BT.500-11 methodology for subjective assessment of picture
quality [21].
Conclusion
In order to support Super-MultiView and Free Navigation
application scenarios with mostly sparse and/or arbitrarily arranged
multi-camera setups, innovative 3D-HEVC extensions should be
developed. Preliminary experiments show that the severe quality
degradation under large baseline conditions of the MPEG-FTV
VSRS view synthesis can be compensated with global optimization
and view synthesis techniques involving all camera views with
epipolar plane imaging or plane sweeping techniques. Moreover,
better exploiting the non-horizontal-only modified disparity vector
derivation in the different coding tools is expected to bring at least
6% bitrate coding gains over 3D-HEVC. Such improvements make
applications that generate additional virtual views from a cost-
effective multi-camera system viable towards the future.
References
[1] Philip Lelyveld, “Virtual Reality Primer with an Emphasis on
Camera-Captured VR,” Enterntainment Technology Center, July
2015, http://www.etcenter.org/wp-content/uploads/2015/07/ETC-VR-
Primer-July-2015o.pdf
[2] Mike Seymour, “Light fields – the future of VR-AR-MR,” fxguide,
26 May 2015, https://www.fxguide.com/featured/light-fields-the-
future-of-vr-ar-mr/
[3] _, “List of AHGs Established at the 113th Meeting in Geneva,
ISO/IEC JTC1/SC29/WG11 MPEG2015/N15622, Geneva,
Switzerland, October 2015.
[4] _, “JPEG PLENO Abstract and Executive Summary,” 20 March
2015, https://jpeg.org/items/20150320_pleno_summary.html
[5] Krzysztof Wegner, Olgierd Stankiewicz, Masayuki Tanimoto, Marek
Domanski, “Enhanced View SynthesisReference Software (VSRS)
for Free-viewpoint Television,” ISO/IEC JTC1/SC29/WG11
MPEG2013/M31520, Geneva, Switzerland, October 2013.
[6] Krzysztof Wegner, Olgierd Stankiewicz, Masayuki Tanimoto, Marek
Domanski, “Enhanced Depth Estimation Reference Software (DERS)
for Free-viewpoint Television,” ISO/IEC JTC1/SC29/WG11
MPEG2013/M31518, Geneva, Switzerland, October 2013.
[7] _, “Call for Evidence on Free-Viewpoint Television: Super-
Multiview and Free Navigation,”MPEG 113th meeting, contribution
M37296, Geneva, Switzerland, October 2015.
[8] Yuri Boykov and Vladimir Kolmogorov, “An Experimental
Comparison of Min-Cut/Max-Flow Algorithms for Energy
Minimization in Vision,” IEEE Transactions on Pattern Analysis and
Machine Intelligence (PAMI), pp. 1124 1137, September 2004.
[9] M. Domański, A. Dziembowski, D. Mieloch, A. Łuczak, O.
Stankiewicz, K. Wegner, “A Practical Approach to Acquisition and
Processing of Free Viewpoint Video”, 31st Picture Coding
Symposium PCS 2015, Cairns, Australia, pp. 10-14, 2015.
[10] Jakub Stankowski, Łukasz Kowalski, Jarosław Samelak, Marek
Domański, Tomasz Grajek, Krzysztof Wegner, “3D-HEVC Extension
for Circular Camera Arrangements,” 3DTV Conference: The True
Vision-Capture, Transmission and Display of 3D Video, 3DTV- Con
2015, Lisbon, Portugal, 8-10 July 2015.
[11] T. Senoh, A. Ishikawa, M. Okui, K. Yamamoto, N. Inoue, “FTV
AHG: Soccer Arc1 Homography Prediction Results”, MPEG 113th
meeting, contribution M37296, Geneva, Switzerland, October 2015.
[12] Beerend Ceulemans, et. al., “Efficient MRF-Based disocclusion
inpainting in multiview video,” submitted to ICME 2016.
[13] Komodakis, Nikos, and Georgios Tziritas. "Image completion using
efficient belief propagation via priority scheduling and dynamic
pruning." IEEE Transactions on Image Processing, vol. 16, no. 11 pp.
2649-2661, 2007.
[14] P. Carballeira, J. Cabrera, F. Jaureguizar, N. García, “Analysis of the
depth-shift distortion as an estimator for view synthesis distortion",
Signal Processing: Image Communication, (accepted on Dec. 2015),
http://dx.doi.org/10.1016/j.image.2015.12.007
[15] B. Oh, J. Lee and D. Park, “Depth Map Coding Based on Synthesized
View Distortion Function,” IEEE Journal of Selected Topics in Signal
Processing, vol.5, no.7, pp.1344-1352, Nov. 2011.
[16] Lode Jorissen, Patrik Goorts, Sammy Rogmans, Gauthier Lafruit,
Philippe Bekaert, “Multi-Camera Epipolar Plane Image Feature
Detection for Robust View Synthesis,” Proceedings of the 3DTV-
Conference: The True Vision - Capture, Transmission and Display of
3D Video (3DTV-CON), pp. 1-4, 2015.
[17] Changil Kim, Henning Zimmer, Yael Pritch, Alexander Sorkine-
Hornung, Markus Gross, “Scene Reconstruction from High Spatio-
Angular Resolution Light Fields,” ACM Siggraph, vol. 32, no. 4,
2013.
[18] Catarina Brites, Jaoa Ascenso, Fernando Pereira, Epipolar plane
image based rendering for 3D video coding, IEEE 17th International
Workshop on Multimedia Signal Processing (MMSP), pp. 1-6,
October 2015.
[19] Goorts Patrik, Bekaert Philippe, Lafruit Gauthier, “Real-time,
Adaptive Plane Sweeping for Free Viewpoint Navigation in Soccer
Scenes,” PhD thesis, Hasselt University, 2014.
[20] Dricot, Jung, Cagnazzo, Pesquet-Popescu, Dufaux, Kovacs, Kiran
Adhikarla, “Subjective Evaluation of Super Multi-View Compressed
Content on High End Light Field 3D Display”, Signal Processing:
Image Communication, Elsevier, June 2015.
[21] ITU-R BT.500-13, "Methodology for the subjective assessment of the
quality of television pictures," January 2012.
[22] Lewandowski, Filip, et al., “Methodology for 3D Video Subjective
Quality Evaluation,” International Journal of Electronics and
Telecommunications, vol. 59, no. 1, pp. 25-32, 2013.
[23] P. Carballeira, J. Gutiérrez, F. Morán, J. Cabrera, N. García,
“Subjective Evaluation of Super Multiview Video in Consumer 3D
Displays”, Seventh International Workshop on Quality of Multimedia
Experience, QoMEX 2015, Costa Navarino, Greece, pp. 1-6, 26-29
May 2015.
[24] P. Carballeira, J. Gutiérrez, F. Morán, N. García, "New view-sweep
parametrization and subjective evaluation of SMV content", ISO/IEC
JTC1/SC29/WG11 MPEG2015/M36448, Warsaw, Poland, June
2015.
[25] Kovács, Péter Tamás, et al., Quality measurements of 3D light-field
displays,” Proc. Eighth International Workshop on Video Processing
and Quality Metrics for Consumer Electronics. 2014.
[26] Krzysztof Wegner, Tomasz Grajek, Marek Domański, “Comparison
of 3D video subjective quality evaluated using polarisation and
autostereoscopic displays,” Electronics Letters, Vol. 50, No. 18, pp.
1283-1285, August 2014.
[27] Kovacs, Peter Tamas, et al., “Analysis and optimization of pixel
usage of light-field conversion from multi-camera setups to 3D light-
field displays, “IEEE International Conference on Image Processing
(ICIP), pp. 86-90, October 2014.
Author Biography
Gauthier Lafruit is Professor at l’Université Libre de Bruxelles, Brussels,
Belgium, in the Laboratory for Image, Signal and Audio processing (LISA).
He received his Ph.D. degree in Electrical Engineering from the Vrije
Universiteit Brussel, Brussels, Belgium, in 1995. His current research
includes Virtual Reality from camera captured content, Light Fields,
Computational Imaging and GPU acceleration. He is currently co-chair of
the MPEG-FTV group.
Marek Domański is a Professor with the Poznań University of Technology,
where he leads the Chair (Department) of Multimedia Telecommunications
and Microelectronics. He is the author or co-author of six books and over
300 research papers in journals and conference proceedings. His
contributions were mostly on image, video and audio compression, image
processing, multimedia systems 3-D video and color image technology,
digital filters, and multidimensional signal processing.
Krzysztof Wegner received the M.Sc. degree from the Poznań University of
Technology, Poznań, Poland, in 2008, where he is currently pursuing the
Ph.D. degree. He is the co-author of several papers on free view television,
depth estimation, and view synthesis. He is involved in ISO standardization
activities where he contributes to the development of future 3-D video coding
standards.
Tomasz Grajek received the M.Sc. and Ph.D. degrees from the Poznań
University of Technology, Poznań, Poland, in 2004 and 2010, respectively.
He is the author or co-author of several papers on digital video compression,
entropy coding, and modeling of advanced video encoders. He has been
taking part in several projects for industrial research and development.
Takanori Senoh received the Ph.D. degree in Engineering from the
University of Tokyo, Japan, in 2007. He is currently with National Institute
of Information and Communications Technology, Japan and his current
research interests include 3D image processing and electronic holography.
He is a member of IEEE, ITE, IIEEJ, and JSAP.
Joël Jung received the Ph.D. degree in Electrical Engineering from the
University of Nice-Sophia Antipolis, Nice, France, in 2000. He is currently
with Orange Labs Paris and B<>Com Institute of Research and Technology,
and his current research interests include next generation image and video
coding, 3D super multi-view and depth coding. He is an active contributor
to the HEVC standard (JCT-VC) and the 3D-HEVC annex (JCT-3V).
Péter Tamás Kovács has been working at Holografika since 2006,
contributing to the development of the real 3D light-field display product
line HoloVizio and related technologies (glasses-free 3D cinema, real-time
light field capture and rendering system, full-angle 180 degree light-field
display).
Patrik Goorts is a postdoctoral researcher at Hasselt University, Belgium,
specialized in free viewpoint interpolation and depth estimation.
Lode Jorissen is a Ph.D candidate in the Expertise Centre for Digital Media
(EDM) at Hasselt University, Belgium. He previously worked on 360 degree
video and currently focuses his work on view interpolation using light fields.
Adrian Munteanu is professor at Vrije Universiteit Brussel, Belgium. His
research interests include image, video and 3D graphics compression, error-
resilient coding and multimedia transmission over networks. He is the author
of more than 250 journal and conference publications, book chapters and
contributions to standards, and received several awards for his work. Adrian
Munteanu currently serves as Associate Editor for IEEE Transactions on
Multimedia.
Beerend Ceulemans is a Ph.D candidate in the Department of Electronics
and Informatics (ETRO) at the Vrije Universiteit Brussel (VUB). His
research interest are centered on virtual viewpoint synthesis for
autostereoscopic 3D screens and free viewpoint video.
Pablo Carballeira received the Ph.D. degree in Telecommunication
Engineering from the Universidad Politécnica de Madrid (UPM) in 2014.
He is with the Grupo de Tratamiento de Imágenes at UPM since 2007 and
his current research interests include coding and subjective evaluation of
Super Multiview and Free Navigation Video.
Sergio García is a Ph.D candidate in the Grupo de Tratamiento de Imágenes
(GTI) at the Universidad Politécnica de Madrid (UPM), where he has been
working since 2013. His research interests include adaptive streaming
techniques and algorithms, as well as 3D graphics compression and
rendering, especially in the field of point-cloud-based models.
Masayuki Tanimoto received the B.E., M.E., and Dr.E. degrees from the
University of Tokyo. He was Professor at Nagoya University and developed
FTV (Free-viewpoint Television). Currently, he is Emeritus Professor at
Nagoya University and Senior Research Fellow at Nagoya Industrial
Science Research Institute. He is Honorary Member of the ITE, Fellow of
the IEICE and IEEE Life Fellow. He is chair of the MPEG-FTV group.
... However, the correlations between views and the correlations between frames are not fully utilized in these methods. On the other hand, multi-view video coding (MVC) and multi-view HEVC (MV-HEVC) as the extensions of current encoders can be used to encode LF videos [22,40]. MVC/MV-HEVC for LF video coding exploits both the temporal inter-frame correlations and the spatial inter-view correlations in an LF video sequence, which can achieve a much better compression ratio compared to the aforementioned coding methods. ...
Article
Full-text available
The massive amount of data usage for light field (LF) information poses grand challenges for efficient compression designs. There have been several LF video compression methods focusing on exploring efficient prediction structures reported in the literature. However, the number of possible prediction structures is infinite, and these methods fail to fully exploit the intrinsic geometry between views of an LF video. In this paper, we propose a deep learning-based high-efficiency LF video compression framework by exploiting the inherent geometrical structure of LF videos. The proposed framework is composed of several crucial components, namely sparse coding based on a universal view sampling method (UVSM) and a CNN-based LF view synthesis algorithm (LF-CNN), a high-efficiency adaptive prediction structure (APS), and a synthesized candidate reference (SCR)-based inter-frame prediction strategy. Specifically, instead of encoding all the views in an LF video, only parts of views are compressed while the remaining views are reconstructed from the encoded views with LF-CNN. The prediction structure of the selected views is able to adapt itself to the similarity between views. Inspired by the effectiveness of view synthesis algorithms, synthesized results are served as additional candidate references to further reduce inter-frame redundancies. Experimental results show that the proposed LF video compression framework can achieve an average of over 34% bitrate savings against state-of-the-art LF video compression methods over multiple LF video datasets.
... The methods of creating natural three-dimensional content are recently under constant development, as the emergence of new applications of immersive media systems such as free-viewpoint television [Tan12] and virtual reality systems can be easily seen [Dom17] [Laf16]. ...
... This method does not take into consideration the difference between LF videos and conventional monocular videos, which do not acknowledge the strong correlation between views. Different scan orders (e.g., rotary [5], zigzag [6]) can be used to convert the two-dimensional LF video sequence into a one-dimensional video sequence, and the classic multi-view video coding standards [24,25] can be used for LF video compression [11]. However, the correlations between views and frames are not fully utilized in these methods. ...
Article
Full-text available
The sheer size and complex structure of light field (LF) videos bring new challenges to their compression and transmission. There have been numerous LF video compression algorithms reported in the literature to date. All of these algorithms compress and transmit all the views of an LF video. However, in some interactive or selective applications where users can choose the area of interest to be displayed, these algorithms generate a significant computational load and enormous data redundancies. In this paper, we propose an interaUser-dependent Interactive light field video streaming system eaming system based on a user-dependent view selection scheme and an LF video coding method, which streams only the required data. Specifically, by predicting trajectories and using projection models, the viewing area of users in a limited consecutive number of time slots is firstly calculated, and then a user-dependent view selection method is proposed to determine the selected views of users for streaming. Finally, with the novel LF video sequence formed by only the selected sets of views, an adaptive coding method is presented for different LF video sequences based on users’ gestures. Experimental results illustrate that the proposed interactive LF video streaming system can achieve the best performance compared with other comparison methods.
... The methods of creating natural three-dimensional content are recently under constant development, as the emergence of new applications of immersive media systems such as free-viewpoint television [Tan12] and virtual reality systems can be easily seen [Dom17] [Laf16]. ...
Conference Paper
Full-text available
In this paper, the novel correspondence search method called point-to-block matching was proposed. Recently, in many proposed multimedia systems, depth estimation is performed on compressed input views. To address this problem and increase the quality of such depth maps, in the proposal, a point in a view is not compared simply with a point in another view, but to the most similar point in a small block surrounding it. The introduction of this method in the depth estimation process is beneficial to the quality of depth maps, as it decreases the influence of small shifts in images, caused e.g., by encoding-related errors introduced to input views. The method was implemented in one of the state-of-the-art depth estimation methods and tested in series of experiments. Based on the comparison of synthesized virtual views with the input views, the proposal increases the quality of estimated depth maps in most of the tested configurations.
... The scene may be of very different types: a play court of a sports event, a theater stage, a wilderness scene, street environment, etc. The practical limitations (cost, portability, video processing time, system calibration complexity, etc.) imply a limited number of cameras, i.e., the cameras are often located quite distant from each other [53], [17]. ...
Article
Full-text available
In this paper, the color correction method developed for immersive video systems is presented. The proposed method significantly increases the consistency of color characteristics of multiview sequences, understood both as the temporal and the inter-view consistency, what highly improves the subjective quality of the synthesized virtual views presented to the final user of the immersive video system. Moreover, the proposal allows to significantly increase the quality of the depth maps calculated for natural sequences, e.g., for views with colors inconsistent due to different lighting conditions. It enables more efficient compression of natural multiview video, as the newest encoding standards highly depend on the quality of depth maps. In order to evaluate the performance of the proposal, three experiments were conducted. In the first one, the proposal was compared to state of the art in the typical immersive video application – color correction of a natural multiview sequence. In the second experiment, the performance of the proposal was tested on the Middlebury stereo dataset. In both experiments, the quality of synthesized virtual views was assessed subjectively by a group of 70 naove viewers. The third experiment assessed the influence of color correction on the quality of estimated depth maps. All the experiments showed that the proposal significantly increases the color consistency of the multiview content. Due to the high usefulness and robustness, the proposed color correction method became the MPEG reference software for color correction. The implementation of the method is available for other researchers on the public repository.
... Telewizja swobodnego punktu widzenia (ang. free-viewpoint television -FTV) [5], [11], [15] umożliwia swobodną nawigację poprzez naturalną trójwymiarową scenę. Użytkownik systemu FTV może oglądać nie tylko widoki zarejestrowane przez rzeczywiste kameryidea swobodnej nawigacji zakłada możliwość oglądania sceny z dowolnego miejsca i kierunku obserwacji. ...
Article
Stworzenie nowej metody estymacji map głębi przeznaczo-nej dla systemów telewizji swobodnego widzenia jest głów-nym celem przedstawionych badań. W telewizji swobodne-go punktu widzenia możliwości widza są rozszerzone po-przez możliwość kontroli aktualnie oglądanego przez niego punktu widzenia sceny. Nowa metoda estymacji map głębi zaproponowana przez autora składa się z trzech części: przestrzennie spójnej estymacji map głębi opartej na seg-mentacji widoków, metodzie zwiększenia spójności czaso-wej map głębi zmniejszającej złożoność obliczeniową esty-macji oraz nową metodę zrównoleglania procesu optymali-zacji opartego na wykorzystaniu grafów.
... For a video to be fully random accessed in time for example, all video frames must be able to be independently decoded, and consequently independently accessed. Besides random access in time, VR and free viewpoint television (FTV) applications also have view random access as a crucial requirement for delivering high quality content at any possible instance both in time and in space (free navigation) [12,14]. For this work, we define view random access as the ability of a decoder to switch to a different view immediately at any point in time. ...
Article
Full-text available
Computational imaging and light field technology promise to deliver the required six-degrees-of-freedom for natural scenes in virtual reality. Already existing extensions of standardized video coding formats, such as multi-view coding and multi-view plus depth, are the most conventional light field video coding solutions at the moment. The latest multi-view coding format, which is a direct extension of the high efficiency video coding (HEVC) standard, is called multi-view HEVC (or MV-HEVC). MV-HEVC treats each light field view as a separate video sequence, and uses syntax elements similar to standard HEVC for exploiting redundancies between neighboring views. To achieve this, inter-view and temporal prediction schemes are deployed with the aim to find the most optimal trade-off between coding performance and reconstruction quality. The number of possible prediction structures is unlimited and many of them are proposed in the literature. Although some of them are efficient in terms of compression ratio, they complicate random access due to the dependencies on previously decoded pixels or frames. Random access is an important feature in video delivery, and a crucial requirement in multi-view video coding. In this work, we propose and compare different prediction structures for coding light field video using MV-HEVC with a focus on both compression efficiency and random accessibility. Experiments on three different short-baseline light field video sequences show the trade-off between bit-rate and distortion, as well as the average number of decoded views/frames, necessary for displaying any random frame at any time instance. The findings of this work indicate the most appropriate prediction structure depending on the available bandwidth and the required degree of random access.
... Nowadays, the most commonly used spatial representation of 3D scenes are depth maps [39], which are widely used not only in the context of free-viewpoint television and virtual navigation [1], [29], [40], but also in 3D scene modeling [36], and machine vision applications [37], [51]. In FTV and VN systems, the fidelity and quality of depth maps deeply influence the quality of the synthesized video, thus the quality of experience in the navigation through a 3D scene. ...
Article
Full-text available
The paper presents a new method of depth estimation, dedicated for free-viewpoint television (FTV) and virtual navigation (VN). In this method, multiple arbitrarily positioned input views are simultaneously used to produce depth maps characterized by high inter-view and temporal consistencies. The estimation is performed for segments and their size is used to control the trade-off between the quality of depth maps and the processing time of depth estimation. Additionally, an original technique is proposed for the improvement of temporal consistency of depth maps. This technique uses the temporal prediction of depth, thus depth is estimated for P-type depth frames. For such depth frames, temporal consistency is high, whereas estimation complexity is relatively low. Similarly, as for video coding, I-type depth frames with no temporal depth prediction are used in order to achieve robustness. Moreover, we propose a novel parallelization technique that significantly reduces the estimation time. The method is implemented in C++ software that is provided together with this paper, so other researchers may use it as a new reference for their future works. In performed experiments, MPEG methodology was used whenever possible. The provided results demonstrate the advantages over the Depth Estimation Reference Software (DERS) developed by MPEG. The fidelity of a depth map, measured by the quality of synthesized views, is higher on average by 2.6 dB. This significant quality improvement is obtained despite a significant reduction of the estimation time, on average 4.5 times. The application of the proposed temporal consistency enhancement method increases this reduction to 29 times. Moreover, the proposed parallelization results in the reduction of the estimation time up to 130 times (using 6 threads). As there is no commonly accepted measure of the consistency of depth maps, the application of compression efficiency of depth is proposed as a measure of depth consistency.
... Views presented to the user are synthesized, i.e., rendered from a compact representation of a 3D scene [38]. One of the spatial representations of a 3D scene are depth maps [39], which are widely used not only in the context of free-viewpoint television systems [1], [29], [40], but also in 3D scene modeling [36], and machine vision applications [37], [51]. In FTV systems, the quality of depth maps is crucial for the quality of the synthesized video, and thus the quality of experience in the navigation through a 3D scene. ...
Preprint
The paper presents a new method of depth estimation dedicated for free-viewpoint television (FTV). The estimation is performed for segments and thus their size can be used to control a trade-off between the quality of depth maps and the processing time of their estimation. The proposed algorithm can take as its input multiple arbitrarily positioned views which are simultaneously used to produce multiple inter view consistent output depth maps. The presented depth estimation method uses novel parallelization and temporal consistency enhancement methods that significantly reduce the processing time of depth estimation. An experimental assessment of the proposals has been performed, based on the analysis of virtual view quality in FTV. The results show that the proposed method provides an improvement of the depth map quality over the state of-the-art method, simultaneously reducing the complexity of depth estimation. The consistency of depth maps, which is crucial for the quality of the synthesized video and thus the quality of experience of navigating through a 3D scene, is also vastly improved.
Article
Full-text available
Synthetic aperture imaging (SAI) technology gets the light field information of the scene through the camera array. With the large virtual aperture, it can effectively acquire the information of the partially occluded object in the scene, and then we can focus on the arbitrary target plane corresponding to the reference perspective through the refocus algorithm. Meanwhile, objects that deviate from the plane will be blurred to varying degrees. However, when the object to be reconstructed in the scene is occluded by the complex foreground, the optical field information of the target cannot be effectively detected due to the limitation of the linear array. In order to deal with these problems, this paper proposes a nonlinear SAI method. This method can obtain the occluded object’s light field information reliably by using the nonlinear array. Experiments are designed for the nonlinear SAI, and refocusing is performed for the occluded objects with different camera arrays, different depths, and different distribution intervals. The results demonstrate that the method proposed in this paper is advanced than the traditional SAI method based on linear array.
Conference Paper
Full-text available
In this paper, we propose a novel, fully automatic method to obtain accurate view synthesis for soccer games. Existing methods often make assumptions about the scene. This usually requires manual input and introduces artifacts in situations not handled by those assumptions. Our method does not make assumptions about the scene; it solely relies on feature detection and utilizes the structures visible in a 3D light field to limit the search range of traditional view synthesis methods. A visual comparison between a standard plane sweep, a depth-aware plane sweep and our method is provided, showing that our method provides more accurate results in most cases.
Conference Paper
Full-text available
We deal with the processing of multiview video acquired by the use of practical thus relatively simple acquisition systems that have a limited number of cameras located around a scene on independent tripods. The real-camera locations are nearly arbitrary as it would be required in the real-world Free-Viewpoint Television systems. The appropriate test video sequences are also reported. We describe a family of original extensions and adaptations of the multiview video processing algorithms adapted to arbitrary camera positions around a scene. The techniques constitute the video processing chain for Free-Viewpoint Television as they are aimed at estimating the parameters of such a multi-camera system, video correction, depth estimation and virtual view synthesis. Moreover, we demonstrate the need for new compression technology capable of efficient compression of sparse convergent views. The experimental results for processing the proposed test sequences are reported.
Article
Aiming for 3D Video encoders with reduced computational complexity, we analyze the performance of depth-shift distortion in depth-image based rendering algorithms, incurred when coding depth maps in 3D Video, as an estimator of the distortion of synthesized views. We propose several distortion models that capture (i) the geometric distortion caused by the depth coding error, (ii) the pixel-mapping precision in view synthesis and (iii) the method to aggregate depth-shift distortion caused by the coding error in a depth block. Our analysis starts with the evaluation of the correlation between the depth-shift distortion values obtained with these models, and the actual distortion on synthesized views, with the aim of identifying the most accurate one. The correlation results show that one of the models can be used as a reasonable estimator of the synthesis distortion in low complexity depth encoders. These results also show that the Sum of Absolute Error (SAE) captures better the distortion on a depth block than the Sum of Squared Error (SSE).
Article
In this dissertation, we present a system to generate a novel viewpoint using a virtual camera, specifically for soccer scenes. We demonstrate the applicability for following players, freezing the scene, generating 3D images, et cetera. The method is demonstrated and investigated for 2 camera arrangements, i.e. a curved and a linear setup, where the distance between the cameras can be up to 10 meters. The virtual camera should be located on a position between the real camera positions. The method is designed to be automatic and has high quality results using high performance rendering. We presented an image-based method to generate the novel viewpoints based on the wellknown plane sweep approach. The method consists of a preparation phase and a rendering phase. In the preparation phase, geometric calibration is performed. Here, we presented a calibration system for large setups using the images of the recordings itself. No specific objects must be placed in the scene, but this is nevertheless possible. We applied feature detection on the input streams and match features between pairs of cameras. We present a method based on graphs that select multicamera feature matches using a voting mechanism. Furthermore, the matches are filtered based on the general direction in which the features appear to move across the different cameras, which is a robust outlier detection. These filtered multicamera feature matches are then used to generate the calibration data. The results demonstrate the quality of the calibration, which is sufficiently high for our method. Due to the automatic nature of the calibration method, we have achieved a convenient and practical solution for multicamera calibration in large scenes. Once the calibration is known, we can start rendering. We demonstrate that normal plane sweeping is not sufficient for soccer scenes due to the high number of artifacts, such as ghost legs, ghost players, and halo effects. Therefore, we propose a depth-aware plane sweep approach. We have shown that the depth values of the artifacts differ from the depth values of the players. This can be used to filter out the artifacts. We determine the initial depth using a plane sweep approach. Next, we filter the depth map using a median-based or histogrambased approach, where each group of pixels is processes independently. The depth is furthermore compared to the depth of the background, eliminating ghost player artifacts. The results show that the artifacts are effectively eliminated in most cases. We employed modern and traditional GPGPU technologies for the complete processing pipeline to develop a scalable and fast solution. The performance is higher than a few frames per second for a single GPU and HD resolution, which makes it practical and affordable to scale up to a realtime solution. The results are visually compared to existing systems, which demonstrates that our method can eliminate many artifacts visible in other systems. Furthermore, a novel plane distribution method is developed to assign more processing power to the depths where there actually are objects and to reduce wasted processing power on empty space. The quality is checked qualitative and it is demonstrated that the difference between a high number of planes and a redistributed low number of planes is negligible, and the difference between a uniformly distributed low number of planes and a redistributed low number of planes is significant. This shows the usefulness of the optimization by reducing the required processing power, while keeping quality levels comparable.
Conference Paper
We present preliminary experiments on subjective evaluation of Super Multiview Video (SMV) in stereoscopic and auto-stereoscopic displays. SMV displays require a large number of views (typically 80 or more), but are not yet widely available. Subjective evaluation in legacy displays, though not optimal, will therefore be necessary for the development SMV video technologies. This has lead us to perform standardized subjective evaluation of uncompressed SMV test sequences, simulating SMV displays through view sweep, which is controlled by three parameters: View-Sweep Speed (VSS), Viewing Range, and View Density (VD). In our analysis we have identified ranges of most comfortable values of VSS and VD, providing a comfortable view sweep with smooth transition between views.
Article
Super Multi-View (SMV) video content is composed of tens or hundreds of views that provide a light-field representation of a scene. This representation allows a glass-free visualization and eliminates many causes of discomfort existing in current available 3D video technologies. Efficient video compression of SMV content is a key factor for enabling future 3D video services. This paper first compares several coding configurations for SMV content, and several inter-view prediction structures are also tested and compared. The experiments mainly suggest that large differences in coding efficiency can be observed from one configuration to another. Several ratios for the number of coded and synthesized views are compared, both objectively and subjectively. It is reported that view synthesis significantly affects the coding scheme. The amount of views to skip highly depends on the sequence and on the quality of the associated depth maps. Reported ranges of bitrates required to obtain a good quality for the tested SMV content are realistic and coherent with future 4 K/8 K needs. The reliability of the PSNR metric for SMV content is also studied. Objective and subjective results show that PSNR is able to reflect increase or decrease in subjective quality even in presence of synthesized views. However, depending on the ratio of coded and synthesized views, the order of magnitude of the effective quality variation is biased by PSNR. Results indicate that PSNR is less tolerant to view synthesis artifacts than human viewers. Finally, preliminary observations are initiated. First, the light-field conversion step does not seem to alter the objective results for compression. Secondly, the motion parallax does not seem to be impacted by specific compression artifacts. The perception of the motion parallax is only altered by variations of the typical compression artifacts along the viewing angle, in cases where the subjective image quality is already low. To the best of our knowledge, this paper is the first to carry out subjective experiments and to report results of SMV compression for light-field 3D displays. It provides first results showing that improvement of compression efficiency is required, as well as depth estimation and view synthesis algorithms improvement, but that the use of SMV appears realistic according to next generation compression technology requirements.