Rectification-Based View Interpolation and Extrapolation for Multiview Video Coding.
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 6, JUNE 2011 693
Rectification-Based View Interpolation and
Extrapolation for Multiview Video Coding
Xiaoyu Xiu, Student Member, IEEE, Derek Pang, Student Member, IEEE, and Jie Liang, Member, IEEE
Abstract—In this paper, we first develop improved projective
rectification-based view interpolation and extrapolation methods,
and apply them to view synthesis prediction-based multiview
video coding (MVC). A geometric model for these view synthesis
methods is then developed. We also propose an improved model to
study the rate-distortion (R-D) performances of various practical
MVC schemes, including the current joint multiview video coding
standard. Experimental results show that our schemes achieve
superior view synthesis results, and can lead to better R-D
performance in MVC. Simulation results with the theoretical
models help explaining the experimental results.
Index Terms—Multiview video coding, rate-distortion theory,
view extrapolation, view interpolation.
visual communication services such as 3-D TV and free view-
point video . The former offers a 3-D depth impression of
the observed scenery, while the latter further allows interactive
selection of viewpoints and generation of new views from any
viewpoints. Since multiple cameras are used to capture the
scenes, efficient compression of the multiview video data is
crucial to these services.
Many methods have been developed for multiview video
coding (MVC), ranging from disparity compensated prediction
to view synthesis prediction (VSP). In addition, the theoretical
performance analyses of some approaches have also been
studied. In this section, we give a brief review of the practical
and theoretical MVC works, and point out the contributions
of this paper in the two aspects.
ECENT advances in computer, display, camera, and sig-
nal processing make it possible to deploy next generation
A. Review of MVC Algorithms
A straightforward way to exploit the statistical dependencies
among different viewpoints is to use disparity-compensated
prediction. Similar to the motion-compensated prediction in
Manuscript received October 27, 2009; revised June 26, 2010; accepted
November 7, 2010. Date of publication March 17, 2011; date of current
version June 3, 2011. This work was supported in part by the Natural Sciences
and Engineering Research Council of Canada, under Grants RGPIN312262,
EQPEQ330976-2006, STPGP350740-07, and STPGP380875-09. This paper
was recommended by Associate Editor Y.-S. Ho.
X. Xiu and J. Liang are with the School of Engineering Science, Simon
Fraser University, Burnaby, BC V5A 1S6, Canada (e-mail: email@example.com;
D. Pang was with Simon Fraser University, Burnaby, BC V5A 1S6, Canada.
He is now with Stanford University, Stanford, CA 94305 USA (e-mail:
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TCSVT.2011.2129230
single-view video coding, for each block in the current view,
the disparity compensation finds the best matched block in
its neighboring views and encodes the prediction residual. In
, the motion compensation and disparity compensation are
combined to encode stereo sequences. The concept of group of
group of pictures for inter-view prediction is introduced in ,
which allows a picture to refer to decoded pictures of other
views even at different time instants. In  and , various
modified hierarchical B structures are developed for inter-view
prediction. One of them is implemented in the H.264-based
joint multiview video coding (JMVC) software , which uses
the hierarchical B structure in the temporal direction and the
I-B-P disparity prediction structure in the inter-view direction.
To reduce the complexity of finding the best matching, the
multiview geometry is employed in  to predict the disparity
values, but only multiview image coding is considered.
However, the translational inter-view motion assumed by the
disparity compensation method could not accurately represent
the geometry relationships between different cameras; there-
fore, this method is not always efficient. For example, larger
disparities than the search window size can frequently occur,
due to different depths of an object in different views . In
addition, effects such as rotation and zooming are difficult to
be modeled as pure translational motion.
An alternative to disparity-compensated prediction is VSP,
where a synthesized view for a target view is created, us-
ing the geometry relationship between different views. The
synthesized view is then used as an additional reference to
predictively encode the target view.
Some VSP methods are based on depth estimation –
. In this paper, VSP schemes without involving depth
information are investigated. In particular, we focus on VSP
schemes that do not need camera parameters, which are not
always available. In this case, the disparity estimation (or
stereo matching) is usually used to calculate the disparity
map between two neighboring views, and the virtual view
is then synthesized using the disparity information. Disparity
estimation has been extensively studied in computer vision.
In , the cost function for disparity estimation considers
the smoothness of disparity transition. This method is used
in  for view interpolation-based MVC. In this paper, the
disparity estimation method in  is adopted, which achieves
better performances in terms of the accuracy and disparity
smoothness, as well as robustness to occlusions.
Most view synthesis methods are designed for stereo vision
and assume aligned cameras, i.e., the two cameras are parallel
1051-8215/$26.00 c ? 2011 IEEE
694IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 6, JUNE 2011
and only differ from each other by a small horizontal shift. To
deal with more general camera setups, a rectification-based
view interpolation (RVI) method is proposed in . It first
rectifies the two views using the projective rectification method
in  and . This involves the calculation of the fun-
damental matrix between two views and resampling of them
such that they have horizontal and matched epipolar lines. A
modified version of the disparity estimation method in  is
then used to create the interpolated view before mapping back
to the original domain. The algorithm does not require camera
parameters, and has little requirement on camera setup, as long
as the distance between the cameras is not too far. Therefore, it
is suitable for multiview video systems with unaligned cameras
and unknown camera parameters.
In this paper, by modifying the method in  and ,
we first develop an improved rectification-based view interpo-
lation RVI method and apply it to MVC.
Most view synthesis methods deal with view interpolation
from a left view and a right view. If these methods are used
in MVC, the VSP can only benefit half of the views. To
overcome this limitation, in this paper we also develop a
rectification-based view extrapolation (RVE) algorithm using
two left views or two right views; hence VSP can be applied
to the coding of all views after the first two views. Our results
show that although the average quality of the extrapolated
views is lower than that of the interpolated views, the overall
R-D performance of all views of the entire MVC system can
outperform that of the view interpolation-based approach, as
the increase of the number of views in the system.
B. Review of Theoretical Analyses of View Synthesis and MVC
Another important topic in MVC is the theoretical R-D
analysis of various MVC algorithms. Such an analysis can
provide important guidelines for the design of practical MVC
systems. The R-D analysis can be achieved by generalizing
that of the traditional single view video coding. The key
problem is how to model the inter-view correlations and the
underlying inter-view prediction algorithm.
The theory of the R-D analysis of motion compensation-
based single view video coding was established by Girod
–. It was generalized to wavelet based video coding
in  and light field coding in , where the impacts of
the statistical properties of multiple light field images, the
accuracy of the disparity and the transform coding on the
compression efficiency are studied. In , the R-D analyses
of multiview image coding with texture-based and model-
aided methods are presented, using the same model as in
. Recently, these theories are generalized to multiview
video coding in , where the R-D efficiency of motion and
disparity estimation (MDE)-based MVC is investigated.
However, the R-D analysis of VSP based MVC has not been
reported in the literature. In fact, even the mathematical models
of view synthesis algorithms have not been well established. A
couple of important progresses toward this direction have been
obtained recently. In , a model is proposed to describe
the relationship between the accuracy of disparity and the
quality of the interpolated view, based on the framework in
, , and . A prefilter method is also proposed to
Fig. 1. Block diagram of the proposed RVI algorithm.
improve the view interpolation quality. However, the model for
the disparity error is oversimplified, and only parallel cameras
are considered. In , a similar model for view interpolation
is used to analyze the theoretical R-D performance of a view
subsampling-based multiview image coding scheme. However,
MVC is not considered in both  and .
In this paper, we develop a more accurate geometric model
than that in . Our model enables the study of the impact
of projective rectification on the quality of the interpolated or
extrapolated view when unaligned cameras are used. To the
best of our knowledge, this is the first attempt to quantify the
improvement of the projective rectification in view synthesis.
Another contribution of this paper is that we develop an
improved R-D model to study the performances of different
practical MVC schemes, e.g., the MDE-based JMVC and our
VSP-based schemes. Compared to the models in  and ,
our model characterizes the practical MVC schemes more
accurately. Simulation results of this model agree well with
the experimental results of various MVC schemes.
This paper is organized as follows. In Section II, we
present the proposed RVI method and its application in MVC.
Section III extends the result to view extrapolation and applies
it to MVC. In Section IV, a geometric model is developed to
analyze the performance of rectification-based view synthesis.
An improved R-D model for practical MVC schemes is
developed in Section V. Experimental and simulation results of
the proposed methods and models are presented in Sections VI
and VII, respectively, followed by the concluding remarks in
II. Projective Rectification-Based View
Interpolation and Application in MVC
In this section, we propose an improved version of the RVI
algorithms in  and , and apply it to MVC. In particular,
a more robust method is used to rectify the two reference
views to reduce their vertical mismatches. A sub-pixel view
interpolation is also developed to improve the accuracy of the
integer-pixel interpolation in .
A. The Proposed RVI Algorithm
Fig. 1 shows the main steps in the proposed RVI algorithm,
which are explained below.
Projective View Rectification: To rectify two non-
parallel input views, we first estimate the fundamental matrix,
which characterizes the epipolar geometry between the two
views . The matrix can be obtained without using any
Suppose a point X in the 3-D space is projected to point xl
in one view. Its projection point xrin the other view lies on
XIU et al.: RECTIFICATION-BASED VIEW INTERPOLATION AND EXTRAPOLATION FOR MULTIVIEW VIDEO CODING695
the line Fxl, where F is the 3 × 3 rank-2 fundamental matrix
with seven degrees of freedom . In addition, xl and xr
coordinates. This equation is a linear function of the entries
of F. If enough point correspondences between two views are
known, various algorithms can be used to calculate F, such as
the 7-point, the 8-point, or the least-squares algorithm .
In this paper, the point correspondences are selected using
corner detection and the random sample consensus (RANSAC)
algorithms . The implementation in  is modified to
calculate F from the selected point correspondences. Note that
other correspondence matching algorithms such as the scale
invariant feature transform  can also be used to find the
Given F, the epipoles of the two views (the intersections
between the line joining the two camera centers and the two
image planes) can be obtained from the left and right null
spaces of F. After this, the rectification matrix of each view
can be obtained as follows , . First, the coordinate
origin is translated to the image center via a transform
rFxl = 0, where xl and xr are 3 × 1 homogeneous
where c = (cx,cy) is the image center. Suppose the epipole of
a view is at e = (ex,ey,1)Tafter the translation. The next step
is to rotate the image such that the epipole moves to the x-axis,
i.e., its homogeneous coordinate has the format (v,0,1)T. The
required rotation R is thus
where α = 1 if ex≥ 0 and α = −1 otherwise.
Given the new epipole position (v,0,1)T, the following
transformation is applied to map the epipole to infinity:
As a result, the rectification matrix for a view is
H = GRT.
In , the scheme in (4) is used to obtain the rectification
matrices Hland Hrfor the left and right view, respectively, in
order to create two parallel views. However, its performance
relies mainly on the accuracy of the calculated epipoles. In
, a more robust and accurate matching transform method
is used, where the transformation Hlfor the left view is still
obtained by (4), but Hr for the right view is obtained by
finding a matching transform that minimizes the mismatch of
the two rectified views. However, this method needs to solve
the camera matrices, which are not always available.
In this paper, we optimize the rectification matrix Hr for
the right view by minimizing the distances between a group
of rectified corresponding points in the two views, that is
where xli and xri are some of the most accurate point cor-
respondences in the two images, selected by the RANSAC
algorithm. The Levenberg–Marquardt algorithm  is used
to find the optimal solution of Hr, with the initial value given
by the method in (4). Our experimental results show that using
(5) can reduce the average vertical mismatch of the two views
by as much as 80% compared to the method in .
After the rectification, the resolutions of some regions in
the rectified views are down-scaled, which can decrease the
quality of the interpolated view. The down-scaled factor at a
pixel position (˜ x, ˜ y) in the rectified view is given by 
position (˜ x, ˜ y) is extended to the unfilled pixels in a square
region around (˜ x, ˜ y) with a side length of√m(˜ x, ˜ y).
Disparity Estimation: Since two parallel views are
created after rectification, disparity estimation can be per-
formed in 1-D, which has been studied extensively in computer
vision. A 1-D dynamic programming method is used in 
to estimate the disparity. However, independent processing of
different scan lines leads to horizontal stripes in the disparity
map. Several graph cut algorithms have been proposed ,
which achieve more accurate disparity estimation, but they
cannot handle occlusions well, because they assume that each
pixel in the left view can be mapped into multiple pixels in
the right view, but in reality some pixels in the left view
can be occluded and do not correspond to any pixel in the
right view. In , a smoothness term is introduced into the
cost function to favor solutions with small changes between
neighbors, while preserving the advantages of graph cut. The
energy cost function for a pixel at (x,y) is defined as
m(˜ x, ˜ y) =
To compensate the loss of resolution, the pixel value at
E(x,y) = Edata(x,y) + Eocc(x,y) + Esmooth(x,y)(7)
where Edata results from the intensity differences between
corresponding pixels, Eocc imposes a penalty for making
a pixel as occlusion, and the smooth term Esmooth ensures
that neighboring pixels have similar disparities. Moreover,
an uniqueness constraint is imposed in  to deal with
occlusions, in which a pixel can correspond to at most one
pixel in the other view, i.e., a pixel can only be labeled as
either a matching point that corresponds to one pixel, or an
occluded point that corresponds to no pixel in the other view.
The disparity estimation in  is based on the method
in , by adding an extra term in the cost function to
improve the smoothness of the disparity map. However, our
experimental results show that the improvement is not always
satisfactory. In this paper, we use the more accurate method
in  for disparity estimation.
3) Sub-Pixel View Interpolation: View interpolation can be
performed after disparity estimation. Although two neighbor-
ing views are available, there is no guarantee that every pixel
in one view has its corresponding pixel in the other view, due
to occlusion. Therefore different cases need to be considered.
In addition, in  and , the interpolated coordinates of the
696IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 21, NO. 6, JUNE 2011
View interpolation. (a) Pixels from both v?
i+1(tj). (c) Occlusion pixels in v?
i−1(tj) and v?
i+1(tj) are visible. (b) Pixels whose correspondences are out of the boundary of v?
i−1(tj). (d) Occlusion pixels in v?
pixels in the middle view are directly rounded to integer, which
reduces the quality of the interpolated view and creates more
occlusion regions. Although an occlusion padding algorithm
is performed in  to improve the view interpolation, the
quality of the synthesized images is still not satisfactory.
In this paper, we propose a sub-pixel interpolation method,
by distributing the contribution of each interpolated pixel
with floating-point coordinates to the two nearest horizontal
neighbors with integer coordinates.
Let vm(tn) be the image of View m at time tn, v?
rectified vm(tn), and w?
View i at time tj. Also let v?
pixel value of v?
position (x,y) and time tj. As in  and , we interpolate
the middle view by considering three cases.
If a pixel is visible in both views, as shown in Fig. 2(a),
the corresponding pixel position in the intermediate view can
be easily obtained by scaling the disparity value, and the
pixel value of the intermediate pixel is interpolated from the
correspondences in the left and right views by
?x − αdi+1
i(tj) the generated virtual image for
m(x,y,tn) and w?
m(tn) and w?
n(x,y,tj) the disparity of View m relative to View n at
i(x,y,tj) be the
i(tj) at position (x,y), respectively,
= (1 − α)v?
i−1(x,y,tj) + αv?
?x − di+1
where α is the ratio between the distance from View i − 1
to View i and that from View i − 1 to View i + 1. Note that
x − αdi+1
For pixels whose corresponding pixels are out of the valid
image area in the other view [Fig. 2(b)], we extend the
disparity of the border pixel, and the pixel color is copied
accordingly. That is, if the correspondence of v?
invalid in v?
?x − α · di+1
Similarly, if the correspondence of v?
i−1(x,y,tj) generally has floating-point value.
i+1(tj), the interpolated pixel is taken as
i+1(x,y,tj) is invalid in
i−1(tj), the interpolated pixel is
?x + (1 − α) · di+1
In (9) and (10), xr and xlare the horizontal axis of the first
neighbor of v?
correspondence in the other view, as shown in Fig. 2(b).
Due to occlusions, some pixels are only seen in one view.
Their disparity values are therefore unavailable. In our system,
these pixels are detected by the disparity estimation method
in . As shown in Fig. 2(c) and (d) (see also ), the
i−1(x,y,tj) and v?
i+1(x,y,tj) with valid point
occlusion areas in the left view (View i − 1) are occluded
by the objects at their right side, and the occlusion pixels in
the right view (View i + 1) are occluded by objects at their
left side. Therefore, view interpolation can use the disparities
of the neighboring background pixels. For view interpolation
involving occlusion pixels in v?
available pixel to the left is used
i−1(tj), the disparity of the first
?x − α · di+1
For view interpolation involving occlusion pixels in v?
the disparity of the first available pixel to the right is used
?x + (1 − α) · di+1
In (11) and (12), xland xrare shown in Fig. 2(c) and (d).
Finally, to obtain the interpolated pixels at an integer
location (x0,y0), we use the weighted combination of all pixels
within unit distance from (x0,y0), that is
i(x0,y0,tj) = round
γ(x,x0) · w?
where C(x,x0) =?
in our implementation.
Note that if the distance between the left/right view and the
target view is equal, the factor α in (8) to (12) will be 0.5,
and the interpolated coordinates will be either integer or half-
integer. In this case, the complexity of (13) can be simplified.
4) Projective Un-Rectification: Similar to , the recti-
fication algorithm above could generate non-rectangular inter-
polated images. Therefore, the last step of the RVI method is to
back-project the intermediate view to the original coordinates
at the same position. To do so, we first locate the positions of
the four corners from the interpolated image w?
matrix B that minimizes the mapping error from these points
to the four corners of the unrectified image wi(tj), that is
where xi are homogeneous coordinates of the four corners
in wi(tj). The direct linear transform method in  can
be applied to convert (14) into a constrained least-squares
|x−x0|<1γ(x,x0), γ(x,x0) = 1/(|x−x0|+c0),
and c0is a constant to prevent overflow, and is set to be 0.1
i,i = 1,...,4. Our goal is to find an 3 × 3 un-rectification
?b? = 1(15)
XIU et al.: RECTIFICATION-BASED VIEW INTERPOLATION AND EXTRAPOLATION FOR MULTIVIEW VIDEO CODING697
Fig. 3.Proposed MVC schemes using (a) view interpolation and (b) view extrapolation.
where b = [b1b2b3]T(bi is the ith row of B), i.e., the
vectorized version of B. Matrix A is an 8 × 9 matrix, and
each pair of corner correspondences contributes to two rows
of A. The optimal solution to (15) is the unit singular vector
that corresponds to the smallest singular value of A.
B. RVI-Based MVC
In this section, we apply our RVI method to H.264-based
MVC, by modifying the MDE-based JMVC software ,
which uses hierarchical B structure in the temporal direction
and I-B-P prediction structure in the inter-view direction.
The coding structure of our RVI-based MVC is illustrated in
Fig. 3(a) for a system with five views and a group of pictures
(GOP) size of 8. The coding of the even-indexed views is
identical to the even-indexed views in the JMVC. That is, v0is
coded using hierarchical B structure in the temporal direction.
Other even-indexed views are coded by hierarchical B struc-
ture in the temporal direction, as well as disparity-compensated
inter-view prediction using the previously reconstructed even-
indexed view as reference.
For the odd-indexed views v2k+1, in addition to temporal B
references, two inter-view reference pictures are used in our
method. The first is a synthesized frame w2k+1(tj) generated
by the proposed RVI method. The second is the left view. The
encoder then uses R-D optimization to find the best coding
mode for each block, by treating the synthesized view as
an additional reference picture. The synthesized views can
be generated at the decoder using the reconstructed reference
views, thus no additional bits need to be sent to the decoder.
It should be mentioned that the frames of v2k+1are coded
as B pictures in the inter-view direction in the JMVC, using
the left view and the right view as references. Therefore our
scheme has the same number of inter-view references as the
JMVC. However, since the quality of our view interpolation-
based prediction is usually better than that of the disparity
compensation, the proposed MVC scheme can achieve a better
coding efficiency than JMVC, as shown in Section VI.
III. Projective Rectification-Based View
Extrapolation and Application in MVC
View interpolation requires a left view and a right view.
To apply it to MVC, VSP can only be applied to half views
in order to get satisfactory performance. In this section, we
generalize the RVI method to get a RVE algorithm using two
left views or two right views. We then apply the RVE method
to MVC to encode all views after the first two views.
A. The Proposed RVE Algorithm
In this paper, we assume that the view extrapolation algo-
rithm uses two left views to synthesize a right view. Similar to
the view interpolation algorithm in Section II, the extrapolation
algorithm first performs projective rectification and disparity
estimation to the two left views. After that, instead of inter-
polating the disparity to find the corresponding pixel locations
in the middle view, the algorithm extrapolates the disparity
and estimates the pixel locations in the right view. The final
step of un-rectification is still similar to the view interpolation
method. The disparity extrapolation is described below, since
it is the only different step.
Using the same notations as in Section II-A3, two frames
from the two previous views, vi−2(tj) and vi−1(tj), are used to
extrapolate a frame for View i. Let v?
be the rectified frames of vi−2(tj), vi−1(tj) and the synthesized
View i at tj, respectively.
If the horizontal camera distance between u?
is c times of that between v?
disparities have the same scaling factor, that is
i−1(tj) and u?
i(tj) and v?
i−2(tj) and v?
i−1(tj), we assume their
i−1(x,y,tj) = c · di−1
The following three cases need to be handled.
If a pixel is visible in both v?
in Fig. 4(a), we extrapolate their disparity, and the synthesized
pixel in u?
?x − (1 + c)di−1
For pixels whose correspondences are out of the valid region
first left pixel (xl, y) with valid point correspondence
i−2(tj) and v?
i−1(tj), as shown
i(tj) is the average of the pixel pair. That is
i−2(tj), as shown in Fig. 4(b), we scale the disparity of the
i−1(x − di−1
i−2(x,y,tj) + v?
i−2(x,y,tj), y,tj)?. (17)
?x − c · di−2
If a pixel at (x,y) is only visible in v?
Fig. 4(c), it is also assumed to be visible in the extrapolated
view, and the first available disparity to the right of this pixel,
i−1(tj), as shown in