Human Object Inpainting Using Manifold Learning-Based Posture Sequence Estimation
ABSTRACT We propose a human object inpainting scheme that divides the process into three steps: 1) human posture synthesis; 2) graphical model construction; and 3) posture sequence estimation. Human posture synthesis is used to enrich the number of postures in the database, after which all the postures are used to build a graphical model that can estimate the motion tendency of an object. We also introduce two constraints to confine the motion continuity property. The first constraint limits the maximum search distance if a trajectory in the graphical model is discontinuous, and the second confines the search direction in order to maintain the tendency of an object's motion. We perform both forward and backward predictions to derive local optimal solutions. Then, to compute an overall best solution, we apply the Markov random field model and take the potential trajectory with the maximum total probability as the final result. The proposed posture sequence estimation model can help identify a set of suitable postures from the posture database to restore damaged/missing postures. It can also make a reconstructed motion sequence look continuous.
- SourceAvailable from: Kedar A Patwardhan[show abstract] [hide abstract]
ABSTRACT: A framework for inpainting missing parts of a video sequence recorded with a moving or stationary camera is presented in this work. The region to be inpainted is general: it may be still or moving, in the background or in the foreground, it may occlude one object and be occluded by some other object. The algorithm consists of a simple preprocessing stage and two steps of video inpainting. In the preprocessing stage, we roughly segment each frame into foreground and background. We use this segmentation to build three image mosaics that help to produce time consistent results and also improve the performance of the algorithm by reducing the search space. In the first video inpainting step, we reconstruct moving objects in the foreground that are "occluded" by the region to be inpainted. To this end, we fill the gap as much as possible by copying information from the moving foreground in other frames, using a priority-based scheme. In the second step, we inpaint the remaining hole with the background. To accomplish this, we first align the frames and directly copy when possible. The remaining pixels are filled in by extending spatial texture synthesis techniques to the spatiotemporal domain. The proposed framework has several advantages over state-of-the-art algorithms that deal with similar types of data and constraints. It permits some camera motion, is simple to implement, fast, does not require statistical models of background nor foreground, works well in the presence of rich and cluttered backgrounds, and the results show that there is no visible blurring or motion artifacts. A number of real examples taken with a consumer hand-held camera are shown supporting these findings.IEEE Transactions on Image Processing 03/2007; 16(2):545-53. · 3.20 Impact Factor
Conference Proceeding: Video Completion for Perspective Camera Under Constrained Motion.[show abstract] [hide abstract]
ABSTRACT: This paper presents a novel technique to fill in missing background and moving foreground of a video captured by a static or moving camera. Different from previous efforts which are typically based on processing in the 3D data vol- ume, we slice the volume along the motion manifold of the moving object, and therefore reduce the search space from 3D to 2D, while still preserve the spatial and temporal co- herence. In addition to the computational efficiency, based on geometric video analysis, the proposed approach is also able to handle real videos under perspective distortion, as well as common camera motions, such as panning, tilting, and zooming. The experimental results demonstrate that our algorithm performs comparably to 3D search based methods, and however extends the current state-of-the-art repairing techniques to videos with projective effects, as well as illumination changes.18th International Conference on Pattern Recognition (ICPR 2006), 20-24 August 2006, Hong Kong, China; 01/2006
Article: Space-time completion of video.[show abstract] [hide abstract]
ABSTRACT: This paper presents a new framework for the completion of missing information based on local structures. It poses the task of completion as a global optimization problem with a well-defined objective function and derives a new algorithm to optimize it. Missing values are constrained to form coherent structures with respect to reference examples. We apply this method to space-time completion of large space-time "holes" in video sequences of complex dynamic scenes. The missing portions are filled in by sampling spatio-temporal patches from the available parts of the video, while enforcing global spatio-temporal consistency between all patches in and around the hole. The consistent completion of static scene parts simultaneously with dynamic behaviors leads to realistic looking video sequences and images. Space-time video completion is useful for a variety of tasks, including, but not limited to: 1) Sophisticated video removal (of undesired static or dynamic objects) by completing the appropriate static or dynamic background information. 2) Correction of missing/corrupted video frames in old movies. 3) Modifying a visual story by replacing unwanted elements. 4) Creation of video textures by extending smaller ones. 5) Creation of complete field-of-view stabilized video. 6) As images are one-frame videos, we apply the method to this special case as well.IEEE Transactions on Pattern Analysis and Machine Intelligence 04/2007; 29(3):463-76. · 4.80 Impact Factor
3124IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011
Human Object Inpainting Using Manifold
Learning-Based Posture Sequence Estimation
Chih-Hung Ling, Yu-Ming Liang, Chia-Wen Lin, Senior Member, IEEE, Yong-Sheng Chen, Member, IEEE, and
Hong-Yuan Mark Liao, Senior Member, IEEE
Abstract—We propose a human object inpainting scheme that
divides the process into three steps: 1) human posture synthesis; 2)
graphical model construction; and 3) posture sequence estimation.
Human posture synthesis is used to enrich the number of postures
in the database, after which all the postures are used to build a
graphical model that can estimate the motion tendency of an ob-
ject. We also introduce two constraints to confine the motion con-
tinuity property. The first constraint limits the maximum search
distance if a trajectory in the graphical model is discontinuous,
and the second confines the search direction in order to maintain
the tendency of an object’s motion. We perform both forward and
backward predictions to derive local optimal solutions. Then, to
compute an overall best solution, we apply the Markov random
field model and take the potential trajectory with the maximum
total probability as the final result. The proposed posture sequence
estimation model can help identify a set of suitable postures from
the posture database to restore damaged/missing postures. It can
also make a reconstructed motion sequence look continuous.
Index Terms—Dimensionality reduction, isomap, manifold
learning, object completion, video inpainting.
and recovering. A number of algorithms for automatic video
inpainting have been proposed – in the past few years.
Conventional video inpainting methods can be roughly classi-
fied into two types: The first type is patch-based–, and the
other type is template based , . In , Patwardhan et al.
proposed a video inpainting technique that makes use of motion
information and image inpainting technique together. Motion
information is adopted to help find the most suitable patch. In
IDEO inpainting is a popular research field in recent
years, owing to its powerful capability in video editing
Manuscript received October 27, 2010; revised February 14, 2011 and May
rent version October 19, 2011. This work was supported in part by the National
Science Council of Taiwan under Grant NSC98-2221-E-007-080-MY3 and in
part by the Taiwan E-learning and Digital Archives Program sponsored by the
Grant NSC100-2631-H-001-013. The associate editor coordinating the review
of this manuscript and approving it for publication was Dr. James E. Fowler.
C.-H. Ling and Y.-S. Chen are with the Department of Computer Science,
National Chiao Tung University, Hsinchu 300, Taiwan.
Y.-M. Liang is with the Department of Computer Science and Information
Engineering, Aletheia University, Taipei 251, Taiwan.
C.-W. Lin is with the Department of Electrical Engineering, National Tsing
Hua University, Hsinchu 300, Taiwan (e-mail: email@example.com).
H.-Y. M. Liao is with the Institute of Information Science, Academia Sinica,
Taipei 115, Taiwan, and also with the Department of Computer Science, Na-
tional Chiao Tung University, Hsinchu 300, Taiwan.
Color versions of one or more of the figures in this paper are available online
Digital Object Identifier 10.1109/TIP.2011.2158228
, the space–time volume is sliced up into motion manifolds
to perform video completion. The proposed manifolds are
composed of 2-D patches (one for the spatial dimension and
the other for the temporal dimension). These patches cover the
entire trajectory of pixels, and the method in  applies the
approach of Sun et al.  to inpaint those missing regions.
However, these approaches would cause spatial or temporal
structure inconsistency artifacts. In , Wexler et al. adopted
a 3-D fix-sized patch as a unit for video inpainting. The value
of a missing pixel is estimated by a set of constituent patches,
and a multiscale solution is used to speed up the process. In ,
Cheung et al. introduced a probabilistic patch model for video
inpainting. They use a video epitome method to compress an
original video by learning, after that the epitome is used to
synthesize data for the damaged areas of a video.
In the template-based video inpainting category, Cheung et
al.  proposed a technique to deal with the problem of missing
objects in videos captured by a stationary camera. All available
object templates are used to inpaint the foreground. Then, for
each missing object, a fixed-size sliding window that covers the
missing object and its neighboring templates is used to find the
most similar object template. Although the sliding window can
help find similar object templates, the inpainting result may be
unsatisfactory if the number of postures is insufficient. Further-
more, a good filling position is crucial for an object inpainting
process because an inappropriate position may cause visually
annoying artifacts. In , Jia et al. proposed a user-assisted
video layer segmentation technique that decomposes an input
video into color and illumination videos. A tensor voting tech-
nique is then used to address the pertinent spatio–temporal is-
sues in background and foreground. Image repairing is used for
background inpainting, and occluded objects are reconstructed
by synthesizing other available objects. However, a synthesized
object created under this approach does not have a real trajec-
Although an object can perform a broad variety of move-
ments, the set of typically performed movements is usually
located on a latent space that is low dimensional, particu-
larly when the period of object occlusion is not long, where the
missing part usually onlycontains a simpleclass of movements.
Therefore, motion priors can aid in relaxing the ill-posedness
of video inpainting by projecting the high-dimensional video
data to a low-dimensional manifold learned from training data
and then recovering the missing information in the low-dimen-
sional manifold. Ding et al.  proposed a nonlinear dimension
reduction-based video inpainting technique that utilizes local
1057-7149/$26.00 © 2011 IEEE
LING et al.: HUMAN OBJECT INPAINTING USING MANIFOLD LEARNING-BASED POSTURE SEQUENCE ESTIMATION 3125
linear embedding  to transform data observed in frames
into embedded features in a low-dimensional manifold. Then,
the embedded features are organized to form a Hankel matrix,
and missing data can be determined by minimizing the rank of
the matrix. Finally, the radial basis function (RBF) is used for
inverse mapping. Again, the drawback of this method is that it
causes blurring and ghost image artifacts if the object’s motion
is not periodic.
Motion prior models derived from training data have been
also successfully applied in applications of marker-free human
motion capture and analysis –. Generally, two main
classes of motion priors can be identified . The first class
utilizes an explicit motion model to guide motion analysis and
tracking of body parts. For example, the method proposed in
 utilizes variable length Markov models (VLMMs) to char-
acterize both the short-term dynamics and long-term history
of video data. Similar to the approach in  and this paper,
the second class learns a low-dimensional posture manifold
and performs analysis and tracking in the low-dimensional
manifold , . The inverse mapping from the low-dimen-
sional manifold to the high-dimensional full body configuration
can be accomplished via RBF or locally linear coordination
. Although the basic components for dimensionality re-
duction and inverse mapping are similar, as motion analysis
is aimed at tracking of human motion, the key component of
object inpainting-recovering missing trajectories in the learned
low-dimensional manifold was usually not addressed in these
motion analysis works.
Our literature survey shows that most video inpainting algo-
rithms generate artifacts if the object to be inpainted is com-
pletely occluded or its motion is not periodic. To void gener-
ating such artifacts, a posture sequence estimation process of
good accuracy is required for object inpainting. To this end,
Xu et al.  proposed a method for animating animal mo-
tions. The model rearranges available animal templates to form
a new animal motion sequence by minimizing a predefined en-
ergy function. In this paper, rather than using an optimization
approach, which is time consuming, we propose a posture se-
quence estimation method that maintains the continuity of the
local motion of an object. The proposed framework consists
of three steps: 1) human posture synthesis; 2) graphical model
construction; and 3) posture sequence estimation. Human pos-
ture synthesis is used to enrich the number of postures in the
database, after which all the postures are used to build a graph-
ical model that can predict motion tendency. We also propose
two constraints to confine the motion continuity property. The
first constraint limits the maximum search distance if a trajec-
tory in a graphical model is discontinuous, and the second con-
fines the searchdirection in order to maintainthe tendencyof an
object’s motion. We perform both forward and backward pre-
diction to derive local optimal solutions. Finally, we apply the
Markov random field (MRF) model to compute an overall best
solution, and the potential trajectory with the maximum total
probability is taken as the final result. The proposed posture se-
quence estimation model can help identify a set of suitable pos-
tures. It can also make a reconstructed motion look continuous.
The advantage of this posture sequence estimation strategy is
Fig. 1. Projecting posture differences onto the ?-axis.
that it can handle cases such as nonperiodic motion or complete
model-based motion prediction methods , ,  must
use a training process to achieve the same goal.
The remainder of this paper is organized as follows: In
Section II, we explain how to perform object inpainting based
on the proposed posture sequence estimation method. In
Section III, we discuss the results of experiments conducted
to evaluate the method. Section IV contains some concluding
II. HUMAN OBJECT INPAINTING USING POSTURE
Here, we explain how to perform human object inpainting
based on the proposed posture sequence estimation method. As
mentioned earlier, the method includes three steps: 1) human
posture synthesis; 2) graphical model construction; and 3) pos-
ture sequence estimation. We discuss the steps in detail in the
A. Human Posture Synthesis
The problem of an insufficient number of postures will affect
the visual quality of any video sequence generated by a pos-
ture-prediction-based approach. To solve the shortage-of-pos-
ture problem, we utilize our previous posture synthesis method
 that was mainly designed for generating synthetic human
postures to increase the number of postures. The human pos-
ture creation process combines the constituent parts of different
available postures to enrich the contents of a posture database.
Specifically, the first step performs appropriate segmentation of
the postures in the database. To improve the segmentation of a
more intermediate postures must be generated to interpolate the
posture database, we use a bounding rectangle to bind each pos-
ture; then, we align the two postures, as indicated by the middle
part in Fig. 1. Finally, we take the difference between the two
postures and project the difference onto the -axis, as shown on
the right-hand side of Fig. 1.
To detect which parts of a human body significantly move, it
is necessary to calculate the differences between a posture and
all theother postures in thedatabase. All the posture differences
are projected onto the -axis such that the accumulated -axis
component will be like the distribution shown on the right-hand
3126IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011
Fig. 2. Projecting all the differences between any two postures onto the ?-axis
and calculating the cumulative amount.
Fig. 3. Constituent components of a posture are partitioned based on the local
variance. The dashed line that separates the postures into constituent compo-
nents can be determined based on the distribution of the local variance shown
on the left-hand side of the figure.
Fig. 4. New posture can be synthesized by combining different components
(e.g., the torso and the legs).
side of Fig. 2. Then, from the peaks and valleys of the pro-
jected distribution, it is possible to properly segment a posture,
as shownbytheposturesequencein Fig.3.Fromthesegmented
parts derived from many postures, new postures can be synthe-
sized by combining constituent parts, as shown in Fig. 4.
Note, for the sake of simplicity, in Fig. 1, we assume that
the object moves along the direction parallel to the image plane
(i.e., the horizontal direction). If the object moves along another
direction, the posture difference should be projected to the axis
that is orthogonal to the direction of object movements(e.g., the
-axis for vertical movement). The proposed synthesis method
is of low complexity and can only synthesize object postures
that can be explicitly decomposed into two or more constituent
parts. For coping with sophisticated cases in body part local-
ization, one can refer to  and . Moreover, the proposed
posture synthesis step is to offer more postures with a limited
set of configurations of body parts in the posture database to in-
crease the spatio–temporal continuity of a reconstructed trajec-
tory in the low-dimensional manifold, rather than synthesizing
B. Graphical Model Construction
After creating synthetic postures, the posture database will
contain a lot of postures that can be used to build a graphical
model of an object’s motion, as shown in Fig. 5. The model
Fig. 5. Graphical model of an object’s motion in a low-dimensional manifold.
The blue points represent the feature points of the postures, and the red lines
connect two feature points whose corresponding postures appear in adjacent
frames. In this example, occlusion occurs between frames ? ? ? and ? ? ?;
hence, we try to find a motion path with ? internal points that can be used to link
points ? and ? .
Fig. 6. Extracting the local context of a posture: (a) the object’s original pos-
ture; (b) the object’s silhouette described by a set of feature points; (c) using the
convex hull to extract critical reference points; and (d) a shape context mask on
a feature point.
provides a simple representation of an object’s motion. To ob-
tain such a model, all postures (both synthesized and existing
postures) must be projected onto a feature space. Then, we link
the postures that appear in adjacent frames in the constructed
feature space. After applying the above procedure, we can ob-
tain a graphical representation of the object’s motion. To model
the distribution of the postures in the feature space, we need to
know the distances between distinct postures. We use a shape
context descriptor that we developed in a previous work ,
which is a modified version of the descriptor proposed in ,
to compile a detailed description of each posture. The value of
the shape context is calculated along the silhouette of the pos-
ture. In the posture sequence estimation stage, the values of the
shape contexts will be used to compare the degree of similarity
between two distinct postures.
To calculate the value of a shape context, the silhouette of
a posture must be represented as a set of sampled points
, as shown in Fig. 6(b). A convexhull is used to
select some critical reference points among the sampled points
[see Fig. 6(c)]. Then, for each critical reference point
a corresponding local histogram of feature points in
in a circle of radius
is computed in a log-polar space to rep-
resent the local shape context of
[see Fig. 6(d)]. The cost of
LING et al.: HUMAN OBJECT INPAINTING USING MANIFOLD LEARNING-BASED POSTURE SEQUENCE ESTIMATION3127
matching two sampled points that belong to different postures
is defined as follows:
set to be 60 for all sequences, and the value of
by an algorithm described in . The best match between two
different postures can be accomplished by minimizing the fol-
lowing total matching cost:
anddenote the th bin of the two sampled
, respectively. The value ofandis empirically
one-to-one matching requirement, shape matching can be
considered as an assignment problem that can be solved by a
bipartite graph matching method. Therefore, the shape context
distance between shapes
is a permutation of . Because of the
can be computed as follows:
By using the context descriptor proposed in , we can cal-
culate the degree of similarity between two distinct postures.
Then, based on the similarity scores of the postures, we cluster
all the postures in the database by using a nonlinear dimension
reduction method called isometric feature mapping (Isomap)
. In our application, existing and synthesized postures are
regarded as input data points for Isomap, and the distance be-
tween two data points is equivalent to the degree of similarity
between two corresponding postures. We modify the Isomap al-
gorithm to fit our requirements as follows.
Step 1) Construct a neighborhood graph: If
-nearest neighbors (K-NN) of
that connects data points
the edge between
degree of similarity between postures
Step 2) Compute the shortest paths: Find the shortest path
between each pair of feature points in
and are the numbers of sample points on shapes
is one of the
, define a graph
. The length of
is used to measure the
contains all the shortest paths
between all pairs of data points in
Step 3) Construct a -dimensional embedding: Find eigen-
(Operatoris defined as
to derive the final result, we apply classical multi-
dimensional scaling  to the matrix of graph dis-
A special feature of Isomap is that it can preserve the dis-
tances between data points in each local region during dimen-
ilarity information between postures in each local region of a
Fig. 7. Neighborhood constraint.
Fig. 8. Motion tendency constraint.
graphical model and utilize the information to check the motion
continuity property between adjacent postures.
C. Posture Sequence Estimation
Based on the graphical model of an object’s motion shown in
Fig. 5, we obtain suitable postures to replace damaged/missing
postures by finding an approximate path that links data points
andin a low-dimensional manifold. Intuitively, a motion
path can be reconstructed by taking the shortest path between
two nodes or by an optimization process , but these two ap-
proaches cannot guarantee the smoothness of a recovered mo-
tion. To resolve the problem, we propose using two constraints
to regulate the motion continuity property in the local region of
a graphical model. Specifically, we need a strategy to select a
certain number of data points that satisfy the continuous motion
constraint. The first constraint limits the search range to within
a reasonable neighborhood, as shown in Fig. 7. Therefore, we
need to define the search range of the complete trajectory of an
object’s motion. In the manifold domain, such trajectories are
comprised of a number of linked data points (see Fig. 5). To de-
termine the distance between any two consecutive data points
on a trajectory, we calculate the shape context difference be-
tween their corresponding postures. Then, the maximum dis-
tance among all the measured distances is taken as the search
range to satisfy the first constraint. Since the search range is cir-
cular, we calculate the radius as follows:
ject’s motion trajectory.
The second constraint is introduced to maintain the tendency
of an object’s motion in each local region. It can be realized
by checking the tendency of an object’s motion trajectory in a
graphical model. In a low-dimensional manifold, a motion tra-
jectory does not significantly change direction in a neighbor-
hood region. Based on this observation, a variance constraint
of motion tendency is designed to ensure that the variance of
motion tendency stays within a reasonable range (see Fig. 8). In
the manifold domain, the complete trajectory of an object’s mo-
tion is comprised of a number of linked segments, as shown by
the red lines in Fig. 5. For the segments indicated by the lines,
represents the distance betweenand on an ob-
3128IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011
Fig. 9. Some snapshots extracted from test sequence 1.
Fig. 10. (a)–(b) Some forward prediction steps, (c)–(d) some backward prediction steps, and (e) the combined results of a two-way prediction at time ?.
we compute the change in direction between any two consecu-
tive segments based on the inner product of their corresponding
vectors. Among all the computed direction changes, the largest
direction change is taken as the maximum allowable angle for
direction change. This angle, which is the basis for executing
the second constraint, is calculated by
on an object’s motion trajectory.
The above constraints are designed to maintain the local con-
tinuity, we propose a two-way (forward–backward) prediction
mechanism. We use three time instants, i.e.,
operation, we make a forward prediction on each data point at
. The motion tendency constraint and the search range
constraint are applied to determine
represents the angle between vectorsand
, , and,
probable data points at
next time instant . Selected data points
dict the candidate data points at time
strategy in the reverse direction and collect related information
to and fromto
sults from the bidirectional processing to obtain the final results
for time . To illustrate the two-way prediction process further,
we use a test sequence containing 245 frames. Some snapshots
extracted from test sequence 1 are shown in Fig. 9. The can-
didate points chosen at time instant 19
by the blue dots in Fig. 10(a), and their corresponding postures
are shown on the left-hand side of the figure. Those candidate
points are used to perform forward prediction. The predicted
candidate points at time instant 20 are shown in Fig. 10(b). We
apply the same procedure in the reverse direction and generate
to[shown in Fig. 10(c) and (d)]. The
two sets of results are then combined to form the final results,
as shown in Fig. 10(e). Table I provides detailed information
about the aforementioned processes, including the distance and
angle information calculated during the forward and backward
will be used to pre-
. We apply the same
. Then, we combine the re-
LING et al.: HUMAN OBJECT INPAINTING USING MANIFOLD LEARNING-BASED POSTURE SEQUENCE ESTIMATION3129
DETAILED INFORMATION DERIVED DURING THE FORWARD–BACKWARD PREDICTION PROCESS
Fig. 11. Example of the MRF process.
Since the motion continuity constraint is only effective on
local regions, we use the MRF model to derive global motion
continuity. MRF provides a convenient and accurate way to
model context-dependent entities, such as image pixels and cor-
an object’s motion, instead of following the Markov assump-
tion, we assign one node of the Markov network to each time
state. Then, the constructed network can reflect statistical de-
pendences. Given a set of data points located at the intervening
nodes, every node of a Markov network is statistically indepen-
dent of other nodes in the network. Since our Markov network
does not contain loops, the aforementioned Markov assump-
probability during inference. The data point estimated at node
denotes the candidate point associated with node ,
is the self-probability of candidate point
the message derived from node
calculated as follows:
to node . can be
information of all the candidate data points of node . The ini-
message is set as a column vector with the initial
probability of all the elements associated with node . Function
is defined as follows:
is the previous message, which is used to gen-
by executing (7).includes the probability
3130IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011
Fig. 12. Experiments on test sequence 1: (a) partial sequence of test sequence 1 in which the red rectangle indicates missing frames; (b) frames reconstructed
by the approach in ; (c) frames reconstructed by the approach in ; (d) frames reconstructed by the proposed approach; and (e) the corresponding trajectory
information of predicted object motion generated by the three approaches.
angles in a complete trajectory of an object’s motion.
To better explain how (6)–(8) find an optimal
three nodes shown in Fig. 11 as an example.
is the angle between vectors
are the mean and standard deviation, respectively, of all
, we use the
column vector with the initial probabilities of the elements
associated with nodes
respectively. The messages contain the probability information
receives two messages in the form of a
. It then sends the two
, to nodesand,
LING et al.: HUMAN OBJECT INPAINTING USING MANIFOLD LEARNING-BASED POSTURE SEQUENCE ESTIMATION3131
COMPARISON OF THE GROUND-TRUTH POSTURES AND THE RECONSTRUCTED MISSING POSTURES (THE PARTS IN BLACK, RED, AND GRAY REPRESENT THE
GROUND-TRUTH POSTURES, THE RECONSTRUCTED POSTURES, AND THE PERFECTLY MATCHED PORTIONS, RESPECTIVELY)
of all the candidate data points associated with node . Before
the information is sent, it is reordered to form a column vector.
On receipt of the information, nodes
by sending messages
When each candidate point of node
it finds a matching point in node
, respectively, to node .
is the new self-probability of candidate point
is the previous self-probability of candidate point
andare the probabilities propagated by
and, respectively. After normalizing the
probability value of each candidate point calculated by (9), we
obtain a new probability value for each candidate point. Then,
sends the updated message
bility to node
. Similarly, if node
message from node
, the probability values of all the can-
didate points of nodeare recomputed and sent to node
Freeman et al.  showed that after, at most, one global it-
eration of (7) on each node of the network, (6) can derive the
desired optimal estimate of
at node .
with the new proba-
receives an updated
III. EXPERIMENTAL RESULTS
To test the effectiveness of the proposed posture sequence
estimation method, we performed experiments on eight se-
quences, wherein part of them were captured with a camcorder
and the remaining were grabbed from the Weizmann database
 and the Internet. In addition to test sequence 1 shown
in Fig. 9, we used sequences 2 and 3 to evaluate the pro-
posed method. In the experiments, we first removed several
consecutive frames to simulate a real-world situation where
objects in a number of consecutive frames were damaged due
to packet loss. Then, we applied the proposed posture sequence
estimation method to reconstruct the motion of each object. We
also compared the performance of our approach with that of
the approaches in  and . For all the test sequences, the
Fig. 13. Experiments on test sequence 2: (a) some snapshots of the occluded
object in the test sequence; (b) frames reconstructed by the approach in ; (c)
frames reconstructed by the approach in ; (d) frames reconstructed by the
proposed approach; and (e) the inpainting result derived by our approach.
3132IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011
Fig. 14. Experiments on test sequence 3: (a) partial sequence of the test sequence in which the red rectangle indicates the seven missing frames; (b) the frames
reconstructed by the approach in ; (c) the frames reconstructed by the approach in ; (d) the frames reconstructed by the proposed approach; and (e) the
corresponding trajectory information of predicted object motion generated by the compared approaches.
proposed method maintained the motion continuity of a recon-
structed motion and yielded better results than the compared
approaches. For subjective performance comparison, readers
can find more test sequences and thecomplete set of test results,
LING et al.: HUMAN OBJECT INPAINTING USING MANIFOLD LEARNING-BASED POSTURE SEQUENCE ESTIMATION 3133
COMPARISON OF THE GROUND-TRUTH POSTURES AND RECONSTRUCTED MISSING POSTURES (THE PARTS IN BLACK, RED, AND GRAY REPRESENT THE
GROUND-TRUTH POSTURES, THE RECONSTRUCTED POSTURES, AND THE PERFECTLY MATCHED PORTIONS, RESPECTIVELY)
including the original videos, the videos after object removal,
and the inpainted videos, from our project website .
In the first experiment, we removed ten of the 245 frames in
test sequence 1. Part of the sequence (28 frames) is shown in
Fig. 12(a). In the figure, the ten frames that we removed are
bounded by the red rectangle. Fig. 12(b)–(d) show the missing
sequence that was reconstructed by applying the approaches
in  and  and ours, respectively, and Fig. 12(e) shows
the corresponding trajectories reconstructed by the three ap-
proaches in the manifold space. Among the trajectories, the red,
blue, yellow, and green colors represent the ground-truth tra-
jectory, and the trajectories reconstructed by the approaches in
 and  and the proposed approach, respectively. We ob-
serve that the trajectory reconstructed by our approach main-
tains the best motion continuity, and it is also the smoothest of
the three trajectories. Because the proposed posture sequence
estimation method is more effective in recovering an object’s
motion and maintaining motion continuity simultaneously, we
conclude that it is more suitable for object inpainting than the
Table II details the results of the ground truth and the three
compared methods. The top row shows the sequence of missing
ground-truth postures, and the second, third, and fourth rows
show the missing frames reconstructed by the methods in 
and  and our method, respectively. The black parts of
the figures are the ground-truth postures, the gray parts are
perfectly matched portions, and the red parts belong to recon-
structed postures. We observe that the frames reconstructed by
our method are consistently better than those derived by the
In the second experiment, we used test sequence 2, which
contained 100 frames. In the sequence, two people are walking
toward each other, and one person occludes the other in about
20 frames [some of the frames are shown in Fig. 13(a)].
Fig. 13(b)–(d) show the parts of the frames reconstructed by the
methods in  and  and our approach, respectively. From
the reconstructed frames, it is apparent that our approach was
the most effective in recovering the occluded frames. Using the
recovered sequence generated, our approach yielded the best
inpainting results among the three compared approaches, as
shown in Fig. 13(e).
In the third experiment, we used a video sequence (test se-
quence 3) from the Weizmann database  to evaluate our
method. We removed seven of the 55 frames in the sequence.
Fig. 14(a) shows part of the sequence (21 frames). The seven
frames bounded by the red rectangle were the ones removed be-
fore the experiment. Fig. 14(b)–(d) show the missing frames re-
shows the trajectories reconstructed by the three approaches in
the manifold space.
Table III details the results of the ground-truth method and
the three compared methods. The top row shows the sequence
of missing ground-truth postures. The second, third, and fourth
in  and  and our method, respectively. The black parts
of the figures are the ground-truth postures, the gray parts are
perfectly matched portions, and the red portions belong to the
reconstructed postures. Note that the first frame reconstructed
by the method in  covers a broad area (the red area above
the head). Only this method may generate such results. In terms
of the accuracy of the reconstructed frames, our method re-
constructed the most accurate postures overall. However, the
method in  reconstructed the most accurate posture in the
last of the seven missing frames. The match rate was 94.3%
the postures reconstructed by the method in  and our method
was 67.7% and 77.2%, respectively, compared to that of the
As can be observed from the results shown in our demo page
, since the proposed method uses nonoccluded postures
3134IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011
taken from the same video to completely replace the occluded
postures, rather than completing the missing parts of occluded
postures, it can avoid the blurring and deformation artifacts,
which may be produced by patch-based inpainting approaches.
In addition, since in our method the nonoccluded posture se-
quences for training the MRF models are taken from the same
video containing the to-be-inpainted posture sequence, they all
have the same frame rate. Therefore, no additional temporal
scaling or time warping is required for matching different
temporal scales. One shortage of our method is that, since it is
an object-based approach, inaccurate object segmentation may
lead to visually unpleasant artifacts.
We have proposed a human object inpainting scheme that
divides theprocess into threesteps: 1) human posturesynthesis;
2) graphical model construction; and 3) posture sequence es-
timation. In addition, we have defined two constraints on the
motion continuity property. The first constraint sets a threshold
to limit the maximum search distance, and the second confines
the range of the search direction. With the two constraints, the
number of possible candidates between any two consecutive
postures can be significantly reduced. We then apply the MRF
model to perform global matching. The experiment results
demonstrate that the proposed approach outperforms two ex-
isting state-of-the-art approaches.
 K. A. Patwardhan, G. Sapiro, and M. Bertalmío, “Video inpainting
under constrained camera motion,” IEEE Trans. Image Process., vol.
16, no. 2, pp. 545–553, Feb. 2007.
 Y. Shen, F. Lu, X. Cao, and H. Foroosh, “Video completion for per-
spective camera under constrained motion,” in Proc. IEEE Conf. Pat-
tern Recognit., , Hong Kong, Aug. 2006, pp. 63–66.
 Y. Wexler, E. Shechtman, and M. Irani, “Space-time completion of
video,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 3, pp.
463–476, Mar. 2007.
 V. Cheung, B. J. Frey, and N. Jojic, “Video epitomes,” in Proc. IEEE
Conf. Comput. Vis. Pattern Recognit., San Diego, CA, Jun. 2005, pp.
 S.-C. S.Cheung,J. Zhao, andM. V. Venkatesh,“Efficientobject-based
video inpainting,” in Proc. IEEE Conf. Image Process., Atlanta, GA,
Oct. 2006, pp. 705–708.
 J. Jia, Y.-W. Tai, T.-P. Wu, and C.-K. Tang, “Video repairing under
variable illumination using cyclic motions,” IEEE Trans. Pattern Anal.
Mach. Intell., vol. 28, no. 5, pp. 832–839, May 2006.
 C.-H.Ling, C.-W.Lin,C.-W.Su,Y.-S.Chen,andH.-Y.M.Liao,“Vir-
tual contour-guided videoobject inpaintingusing posture mapping and
retrieval,” IEEE Trans. Multimedia, vol. 13, no. 2, pp. 292–302, Apr.
 T. Ding, M. Sznaier, and O. I. Camps, “A rank minimization approach
to video inpainting,” in Proc. IEEE Conf. Comput. Vis., Rio de Janeiro,
Brazil, Oct. 2007, pp. 1–8.
 S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by
locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326,
 L. Wang, W. Hu, and T. Tan, “Recent developments in human motion
analysis,” Pattern Recognit., vol. 36, no. 3, pp. 585–601, Mar. 2003.
 T. B. Moeslund, A. Hilton, and V. Krüger, “A survey of advances in
vision-based human motion capture and analysis,” Comput. Vis. Image
Understand., vol. 104, no. 2/3, pp. 90–126, Nov./Dec. 2006.
 R. Poppe, “Video-based human motion analysis: An overview,”
Comput. Vis. Image Understand., vol. 108, no. 1/2, pp. 4–18,
 F. Caillette, A. Galata, and T. Howard, “Real-time 3-D human body
tracking using variable length Markov models,” in Proc. Brit. Mach.
Vis. Conf., Oxford, U.K., Sep. 2005, pp. 469–478.
 A. M. Elgammal and C.-S. Lee, “Inferring 3D body pose from silhou-
ettes using activity manifold learning,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., Washington, DC, Jun. 2004, pp. 681–688.
inverse kinematics,” ACM Trans. Graph., vol. 23, no. 3, pp. 522–531,
 Y.W. TehandS. T. Roweis,“Automatic alignment oflocal representa-
tion,” in Proc. Adv. Neural Inf. Process. Syst., Vancouver, BC, Canada,
Dec. 2002, vol. 15, pp. 841–848.
 X. Xu, L. Wan, X. Liu, T.-T. Wong, L. Wang, and C.-S. Leung, “Ani-
mating animal motion from still,” ACM Trans. Graph., vol. 27, no. 5,
pp. 1–8, Dec. 2008.
 J. K. Aggarwal and Q. Cai, “Human motion analysis: A review,” in
Proc. Nonrigid Articulated Motion Workshop, Jun. 1997, pp. 90–102.
 D. M. Gavrila, “The visual analysis of human movement: A survey,”
Comput. Vis. Image Understand., vol. 73, no. 1, pp. 82–98, Jan. 1999.
 G. Mori, X. Ren, A. A. Efros, and J. Malik, “Recovering human body
configuration: Combining segmentation and recognition,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., New York, Jun. 2006, pp.
 D. Ramanan and C. Sminchisescu, “Training deformable models for
localizations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Washington, DC, Jun. 2004, pp. 326–333.
 Y.-M. Liang, S.-W. Shih, C.-C. A. Shih, H.-Y. M. Liao, and C.-C.
Lin, “Learning atomic human actions using variable-length Markov
models,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 39, no. 1,
pp. 268–280, Jan. 2009.
 S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object
recognition using shape contexts,” IEEE Trans. Pattern Anal. Mach.
Intell., vol. 24, no. 4, pp. 509–522, Apr. 2002.
 J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric
framework for nonlinear dimensionality reduction,” Science, vol. 290,
no. 5500, pp. 2319–2323, Dec. 2000.
Applications, 2nd ed.New York: Springer-Verlag, 2005.
 A. Criminisi, P. Perez, and K. Toyama, “Region filling and object
removal by exemplar-based image inpainting,” IEEE Trans. Image
Process., vol. 13, no. 9, pp. 1200–1212, Sep. 2004.
 J. Sun, L. Yuan, J. Jia, and H.-Y. Shum, “Image completion with struc-
ture propagation,” in Proc. SIGGRAPH, Los Angeles, CA, 2005, pp.
 W. T. Freeman, E. C. Pasztor, and O. T. Carmichael, “Learning low-
level vision,” Int. J. Comput. Vis., vol. 40, no. 1, pp. 25–47, Oct. 2000.
 L. Gorelick, M. Blank, E. Shechtman, M. Irani, and R. Basri, “Actions
no. 12, pp. 2247–2253, Dec. 2007.
 C.-H. Ling, Y.-M. Liang, C.-W. Lin, Y.-S. Chen, and H.-Y. M. Liao,
“Video object inpainting using manifold-based action prediction,” in
 “NTHU Human Object Inpainting Project,” NTHU, Hsinchu,
Taiwan. [Online]. Available: http://www.ee.nthu.edu.tw/cwlin/in-
Chih-Hung Ling received the B.S. and M.S. degrees
in computer science and information engineering
from National Chung Cheng University, Chiayi,
Taiwan, in 2003 and 2005, respectively. He is
currently working toward the Ph.D. degree in the
Department of Computer Science, National Chiao
Tung University, Hsinchu, Taiwan.
His research interests include computer vi-
sion, pattern recognition, and multimedia signal
LING et al.: HUMAN OBJECT INPAINTING USING MANIFOLD LEARNING-BASED POSTURE SEQUENCE ESTIMATION 3135
Yu-Ming Liang received the B.S. and M.S. degrees
Taiwan Normal University, Taipei, Taiwan, in 1999
and 2002, respectively, and the Ph.D. degree from
NationalChiao Tung University, Hsinchu, Taiwan,in
doctoral Fellow in the Institute of Information Sci-
ence, Academia Sinica, Taipei, Taiwan. Since Feb-
puter Science and Information Engineering, Aletheia
University, Taipei, Taiwan, as an Assistant Professor. His research interests in-
clude computer vision, pattern recognition, and multimedia signal processing.
Chia-Wen Lin (S’94–M’00–SM’04) received the
Ph.D. degree in electrical engineering from National
Tsing Hua University (NTHU), Hsinchu, Taiwan, in
partment of Electrical Engineering, NTHU. He was
with the Department of Computer Science and Infor-
mation Engineering, National Chung Cheng Univer-
to joining academia, from 1992 to 2000, he worked
with the Information and Communications Research
post was as a Section Manager. From April 2000 to August 2000, he was a
Visiting Scholar in the Information Processing Laboratory, Department of Elec-
trical Engineering, University of Washington, Seattle. He has authored or coau-
thored over 90 technical papers. He is the holder of more than 20 patents. His
research interests include video content analysis and video networking.
Dr. Lin served as the Technical Program Cochair of the IEEE International
Conference on Multimedia and Expo (ICME) in 2010 and the Special Session
Cochair of the IEEE ICME in 2009. He is an Associate Editor of the IEEE
TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY and the
Journal of Visual Communication and Image Representation. He has served
as a Guest Coeditor of four special issues for the IEEE TRANSACTIONS ON
MULTIMEDIA, the EURASIP Journal on Advances in Signal Processing, and the
Journal of Visual Communication and Image Representation. He was a recip-
ient of the 2001 Ph.D. Thesis Award presented by the Ministry of Education,
Taiwan, the Young Faculty Award presented by CCU in 2005, and the Young
InvestigatorAwardpresentedbythe NationalScienceCouncil,Taiwan,in 2006.
His paper won the Young Investigator Award presented by SPIE VCIP 2005.
Yong-Sheng Chen (M’03) received the B.S. degree
in computer and information science from National
Chiao Tung University, Hsinchu, Taiwan, in 1993
and the M.S. and Ph.D. degrees in computer science
and information engineering from National Taiwan
University, Taipei, Taiwan, in 1995 and 2001,
He is currently an Associate Professor with the
Department of Computer Science, National Chiao
Tung University. His research interests include
biomedical signal processing, medical image pro-
cessing, and computer vision.
Dr. Chen was the recipient of the Best Paper Award in the 2008 Robot Vision
Workshop and the Best Annual Paper Award of the 2008 Journal of Medical
and Biological Engineering.
Hong-Yuan Mark Liao (SM’01) received the
B.S. degree in physics from National Tsing Hua
University, Hsinchu, Taiwan, in 1981, and the M.S.
and Ph.D. degrees in electrical engineering from
Northwestern University, Evanston, IL, in 1985 and
In July 1991, he joined the Institute of Informa-
tion Science, Academia Sinica, Taipei, Taiwan, as an
Assistant Research Fellow. He was promoted to As-
sociate Research Fellow and then Research Fellow in
has been jointly appointed as the Multimedia Information Chair Professor with
National Chung Hsing University, Taichung, Taiwan. In August 2010, he was
appointed asan Adjunct Chair Professor with Chung Yuan Christian University,
also jointly appointed as a Professor with the Department of Computer Science,
NationalChiao Tung University, Hsinchu. Hiscurrent researchinterests include
multimedia signal processing, video-based surveillance systems, content-based
multimedia retrieval, and multimedia protection.
Dr. Liao started to serve as a member of the Information Forensics and Se-
curity Technical Committee of the IEEE Signal Processing Society in January
2010. From 2006 to 2008, he served as the President of the Image Processing
and Pattern Recognition Society of Taiwan. From 2004 to 2007, he served as a
member of the Multimedia Signal Processing Technical Committee of the IEEE
Signal Processing Society. In June 2004, he served as the Conference Cochair
for the 5th International Conference on Multimedia and Exposition (ICME) and
the Technical Cochair for the 8th ICME held in Beijing, China. In 2011, he will
serve as the General Cochair for the 17th International Conference on Multi-
media Modeling. He is on the editorial boards of the IEEE SIGNAL PROCESSING
MAGAZINE, the IEEE TRANSACTIONS ON IMAGE PROCESSING, and the IEEE
TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY. He served as a
GuestEditoroftheIEEETRANSACTIONS ONCIRCUITSandSYSTEMS FORVIDEO
TECHNOLOGY, Special Issue on Video Surveillance (September 2008). He was
an Associate Editor of the IEEE TRANSACTIONS ON MULTIMEDIA from 1998
to 2001. He was a recipient of the Young Investigators’ Award from Academia
Sinica in 1998, the Distinguished Research Award from the National Science
Council of Taiwan in 2003, the National Invention Award of Taiwan in 2004,
the Distinguished Scholar Research Project Award from the National Science
Council of Taiwan, and the Academia Sinica Investigator Award in 2010.