Conference PaperPDF Available

Temporal stabilization of video object segmentation for 3D-TV applications

Authors:

Abstract and Figures

We present a method for improving the temporal stability of video object segmentation algorithms for 3D-TV applications. First, two quantitative measures to evaluate temporal stability without ground-truth are presented. Then, a pseudo-3D curve evolution method, which spatio-temporally stabilizes the estimated object segments is introduced. Temporal stability is achieved by re-distributing existing object segmentation errors such that they are less disturbing when the scene is rendered and viewed in 3D. Our starting point is the hypothesis that if making segmentation errors are inevitable, they should be made in a temporally consistent way for 3D TV applications. This hypothesis is supported by the experiments, which show that there is significant improvement in segmentation quality both in terms of the objective quantitative measures and in terms of the viewing comfort in subjective perceptual tests. This shows that it is possible to increase the object segmentation quality without increasing the actual segmentation accuracy.
. (a) The illustration of spatio-temporal smoothing. The gray-shaded regions stacked one after another represent the object segmentation maps in each frame. (b) The flowchart of the spatio- temporal smoothing using curve evolution. TV movie. The segmentation of the objects in this realistic se- quence is particularly difficult since the object and background colors are quite similar. The initial segmentation of the car, the walking lady and the man objects are carried out using the algo- rithm [8, 3]. The results on 168 frames of the “walking lady” ob- ject will be presented here. In Fig. 3, the smoothing results for sev- eral frames in the x-y domain are provided. The top row shows the given temporally unstable object segmentation maps and the bot- tom row shows the object segmentation maps after convergence of the curve. The weight of the curvature term in (3) is selected as α = 0 . 4 (determined experimentally). We can observe that unwanted high-curvature parts and missegmented background re- gions are eliminated easily. In Fig. 4 (a), an x-t cross-section of the “lady” object for a fixed y value is shown (after processing in the x-y domain). The bottom figure shows the result of x-t curve evolution. We can see in Fig. 4 (a) that the elimination of the high curvature part in the x-t domain corresponds to the elimination of the missegmented background pixels in the x-y domain which is marked by the horizontal line in Fig. 4 (b). Fig. 4(b) shows the segmentation map of frame 111 in the spatial (x-y) domain before and after x-t processing. In Fig. 5 (a), a y-t cross section is given for a fixed x value. Two disconnected group of black regions can be seen due to the motion of the lady, who first walks towards left and then towards right. Motion compensation is utilized to make the cross sections more aligned, as seen in Fig. 5 (b). Fig. 5 (c) shows the y-t smoothing results. We can see that some high curva- ture lines are eliminated, which correspond to the legs of the lady, which actually introduces a loss of segmentation accuracy. How- ever, this is not noticeable when the scene is viewed in 3D. The effect of y-t smoothing in the spatial domain is shown in Fig. 5 (d), where the temporal unstability caused by the legs is eliminated. In Fig. 6, several frames of the Flikken sequence are shown after applying the complete spatio-temporal smoothing algorithm. We can see from the bottom row that the smoothed results do not display sudden changes as compared to the top row, which implies a better temporal stability. Although the accuracy of segmentation decreases in several frames after temporal stabilization, the overall decrease in segmentation accuracy for 168 frames was marginal (a 3% increase in the average number of missegmented pixels). However, this shows that it is possible to increase the quality of object segmentation without decreasing the segmentation errors, as explained below.
… 
Content may be subject to copyright.
TEMPORAL STABILIZATION OF VIDEO OBJECT SEGMENTATION FOR 3D-TV
APPLICATIONS
C¸ i
ˇ
gdem Ero
˘
glu Erdem
1
, Fabian Ernst
2
, Andre Redert
2
and Emile Hendriks
3
1
Momentum A. S¸.,
˙
Istanbul, Turkey
2
Philips Research Laboratories, Eindhoven, The Netherlands
3
Faculty of Electrical Engineering, Delft University of Technology, The Netherlands
E-mail: cigdem@ieee.org, {fabian.ernst,andre.redert}@philips.com, E.A.Hendriks@ewi.tudelft.nl
ABSTRACT
We present a method for improving the temporal stability of video
object segmentation algorithms for 3D-TV applications. First, two
quantitative measures to evaluate temporal stability without ground-
truth are presented. Then, a pseudo-3D curve evolution method,
which spatio-temporally stabilizes the estimated object segments
is introduced. Temporal stability is achieved by re-distributing ex-
isting object segmentation errors such that they will be less dis-
turbing when the scene is rendered and viewed in 3D. Our starting
point is the hypothesis that if making segmentation errors are in-
evitable, they should be made in a temporally consistent way for
3D TV applications. This hypothesis is supported by the exper-
iments, which show that there is significant improvement in seg-
mentation quality both in terms of the objective quantitative mea-
sures and in terms of the viewing comfort in subjective perceptual
tests. This shows that it is possible to increase the object segmen-
tation quality without increasing the actual segmentation accuracy.
1. INTRODUCTION
The task of building 3D models of a time-varying scene, using the
2D views recorded by uncalibrated cameras is an important but un-
solved task to provide content for the newly emerging 3D TV [1].
One approach to this problem is to segment the objects in the scene
and order their video object planes (VOPs) with respect to their in-
ferred relative depths. This approach gives a satisfactory sense of
three dimensions when the scene is viewed in stereo. However, one
of the most important requirements is the temporal stability of the
video object planes. The changes in video due to occlusions, cam-
era motion, changing background and noise should not cause sud-
den changes (temporal instabilities) in the shape and color compo-
sition of the video object planes (see Fig.1(c)), as they cause very
disturbing flickering effects when the scene is viewed in stereo in
3D TV applications.
Many object segmentation and tracking algorithms exist in the
literature [2]. These algorithms may loose temporal stability un-
der difficult conditions, e.g. when the colors of the object and the
background are similar causing missing object boundaries or when
the motion can not be estimated with sufficient accuracy. In this
paper we try to answer the question: “If making object segmenta-
tion errors are inevitable, how can we conceal them in our appli-
cation?” Our approach is based on the hypothesis that if making
segmentation errors are inevitable, they should be done in a tem-
porally consistent way to increase the viewing comfort in 3D TV
applications. To this effect, we propose a pseudo-3D curve evolu-
tion technique, which distributes the existing segmentation errors
such that they will be less visible when the scene is rendered and
viewed in stereo. The input to the proposed algorithm is a set of
temporally unstable object segmentation maps which is estimated
by any algorithm in the literature, for example by [3].
(a) (b)
(c) (d)
Fig. 1. (a), (b) First and last frames of “Flikken” sequence. (c)
The given temporally unstable video object planes for the “lady”
object (frames 8, 9, 10, 80, 81) from left to right. (d) Ground-truth
VOPs for frames 8, 80 and 145.
2. MEASURES FOR TEMPORAL STABILITY
Assuming that the color histogram of the object does not change
drastically from frame to frame, we can expect that a temporally
stable object segmentation exhibits small differences between the
color histograms of the estimated video object planes (VOPs) [4].
One shortcoming of the histogram measure is that it cannot distin-
guish if a portion of the object is removed and replaced by another
block of the same color belonging to the background. Therefore,
we can also require that the shape of two successive video object
planes should not differ drastically. Hence, histogram and shape
differences between successive video object planes are two candi-
dates for evaluating the temporal stability of object segmentation.
Histogram Measure: The difference between two histograms
can be calculated using the chi-square measure as follows [4]:
d
χ
2
(H
t1
, H
t
) =
B
X
j=1
[H
t1
(j) H
t
(j)]
2
H
t1
(j) + H
t
(j)
, (1)
0-7803-8554-3/04/$20.00 ©2004 IEEE.
357
where H
t
and H
t1
denote the RGB color histograms of the video
object planes at frames t and t 1; B is the number of bins in
the histogram. A prior normalization of the histograms may be
necessary (see [4] for details).
Shape Measure: One way to represent the “shape” of a video
object is to use the turning angle function of the boundary pixels
[5]. The turning angle function (TAF) plots the counter clockwise
angle from the x-axis as a function of the boundary length [5]. Af-
ter obtaining the TAFs belonging to the video objects in successive
frames, which are one dimensional vectors describing the shapes
(denoted by θ
t
and θ
t1
), the distance between them is calculated
as follows:
d(θ
t1
, θ
t
) =
P
K
j=1
kθ
t1
(j) θ
t
(j)k
2πK
(2)
where K is the total number of points on the boundary. In order
for this function to be independent of rotation and of the choice of
the starting point, the difference calculation (2) should be repeated
after shifting one of the turning angle functions horizontally and
vertically by increasing amounts, and then the minimum of the
differences should be taken.
3. TEMPORAL STABILIZATION OF OBJECT
SEGMENTATION MAPS
3.1. Background Theory
Region-based curve evolution techniques have been used for im-
age segmentation in the literature [6, 7], where the region to be
segmented can be characterized by a predetermined set of distinct
features such as mean, variance, and texture, which may be in-
ferred from the data.
A simple image segmentation problem is the case where there
are just two types of regions in the image. Starting with an arbi-
trary initialization as denoted by
C , the curve
C is evolved in such
a way that it will eventually snap to the desired object boundary
R. The reader should refer to [6] for details. Let us parameterize
the curve as
C (s, t) = [x(s, t) y(s, t)]. The aim is to minimize
the following energy function:
E =
1
2
(u v)
2
+ α
I
C
ds, (3)
where the parameters u and v denote the mean gray level inten-
sities inside and outside the curve
C and the second term is the
length of the curve weighted by a constant α. Our aim is to move
every point on the curve such that it moves in the negative direc-
tion of the energy gradient. After some manipulations (see [6]) the
equation describing the motion of the curve is obtained as follows:
d
C (s, t)
dt
= f(x, y)
N ακ
N , (4)
f(x, y) = (u v)
µ
I(x, y) u
A
u
+
I(x, y) v
A
v
, (5)
which tells us to move each boundary point on the curve in a di-
rection parallel to the normal vector drawn to the boundary at that
point using a speed function derived from the image statistics and
the curvature κ of the boundary defined at that boundary point. In
the above equation I(x, y) denotes a pixel intensity, and A
u
, A
v
denote areas inside and outside the curve. In [7], a polygonal im-
plementation of the above curve evolution equation has been pre-
sented, which makes the implementation easier and faster, and has
been adopted and generalized to pseudo-3D in this paper.
3.2. Pseudo-3D Generalization of Curve Evolution
Given a set of temporally unstable video object segmentation maps,
we first stack them together so that a three-dimensional “object
blob” in x-y-t space is formed (see Fig. 2 (a)). We propose to
improve the temporal stability of this “object blob” by smoothing
its surface using a surface evolution approach. If a polygonal sur-
face is initialized so that it includes this “object blob”, and if it is
allowed to evolve so as to minimize its energy (3), it will eventu-
ally converge to a smoothed version of the 3D object volume. The
smoothing effect is expected both due to the curvature term, which
tries to make the surface as smooth as possible, and also due to the
fact that the evolving surface is represented by polygonal patches
which leaves out high curvature segments.
This 3D smoothing approach can be converted into a combina-
tion of simpler 2D smoothing steps by considering different cross
sections (slices) of the “object blob” in x-y-t space. If we apply
the curve evolution equation (4) to the segmentation maps in the
x-y domain (at each t value), we can achieve spatial smoothness.
In order to achieve temporal stability, we apply the curve evolution
technique for each x-t and y-t cross section (slice) of the “object
blob” iteratively as follows:
O
n+1
= P
yt
(P
xt
(P
xy
(O
n
))) (6)
where O
n
denotes the “object blob” at iteration n and P
yt
denotes
the processing of each y-t cross-section of the “object blob” using
a polygonal representation for
C
yt
(s, t):
V
yt
(k)
t
=
e
f
k,k1
N
k,k1
+
e
f
k+1,k
N
k+1,k
ακ
N
b
, (7)
where V
yt
(k) denotes a vertex on the polygonal boundary in the
yt cross section of the “object blob”,
e
f
k,k1
and
N
k,k1
denote
the interpolated speed function (4) and the outward normal vector
along the line connecting the vertices V(k) and V(k 1), respec-
tively. The functions P
xt
and P
xy
are defined similarly.
This idea is illustrated in Fig. 2 (a), where the horizontal rect-
angle shows the x-t cross section and the vertical rectangle shows
the y-t cross section. By using this pseudo-3D approach, we can
obtain spatio-temporally stable object segmentation by processing
the x-y, x-t and y-t slices of the “object volume” iteratively, until
the shape convergences. The order of processing in the x-y, x-t and
y-t slices does not produce significant changes in the experimental
results.
Sometimes the y-t or x-t cross sections of the “object blob” do
not consist of a single connected group of black pixels as can be
observed in Fig. 5 (a), both due to the oscillatory (direction chang-
ing) motion of the object, and the natural topology of the object.
The effect of the motion can be eliminated by motion compensat-
ing the binary object segmentation maps to align them with respect
to the first frame. This transforms the 3D “object volume” into
a more uniform block, thus minimizing the number of separate
black regions in any y-t or x-t cross-section. If multiple discon-
nected black blobs still exist after motion compensation because
of the natural topology of the object, the curve evolution has to be
applied for each disconnected region of significant size. The over-
all flowchart of the proposed pseudo-3D smoothing algorithm is
given in Fig. 2 (b).
4. EXPERIMENTAL RESULTS
The proposed pseudo-3D temporal stabilization algorithm is tested
on the “Flikken” sequence (see Fig. 1), which is an extract from a
358
(a) (b)
Fig. 2. (a) The illustration of spatio-temporal smoothing. The
gray-shaded regions stacked one after another represent the object
segmentation maps in each frame. (b) The flowchart of the spatio-
temporal smoothing using curve evolution.
TV movie. The segmentation of the objects in this realistic se-
quence is particularly difficult since the object and background
colors are quite similar. The initial segmentation of the car, the
walking lady and the man objects are carried out using the algo-
rithm [8, 3]. The results on 168 frames of the “walking lady” ob-
ject will be presented here. In Fig. 3, the smoothing results for sev-
eral frames in the x-y domain are provided. The top row shows the
given temporally unstable object segmentation maps and the bot-
tom row shows the object segmentation maps after convergence
of the curve. The weight of the curvature term in (3) is selected
as α = 0.4 (determined experimentally). We can observe that
unwanted high-curvature parts and missegmented background re-
gions are eliminated easily. In Fig. 4 (a), an x-t cross-section of
the “lady” object for a fixed y value is shown (after processing in
the x-y domain). The bottom figure shows the result of x-t curve
evolution. We can see in Fig. 4 (a) that the elimination of the high
curvature part in the x-t domain corresponds to the elimination of
the missegmented background pixels in the x-y domain which is
marked by the horizontal line in Fig. 4 (b). Fig. 4(b) shows the
segmentation map of frame 111 in the spatial (x-y) domain before
and after x-t processing. In Fig. 5 (a), a y-t cross section is given
for a fixed x value. Two disconnected group of black regions can
be seen due to the motion of the lady, who first walks towards left
and then towards right. Motion compensation is utilized to make
the cross sections more aligned, as seen in Fig. 5 (b). Fig. 5 (c)
shows the y-t smoothing results. We can see that some high curva-
ture lines are eliminated, which correspond to the legs of the lady,
which actually introduces a loss of segmentation accuracy. How-
ever, this is not noticeable when the scene is viewed in 3D. The
effect of y-t smoothing in the spatial domain is shown in Fig. 5 (d),
where the temporal unstability caused by the legs is eliminated.
In Fig. 6, several frames of the Flikken sequence are shown
after applying the complete spatio-temporal smoothing algorithm.
We can see from the bottom row that the smoothed results do not
display sudden changes as compared to the top row, which implies
a better temporal stability. Although the accuracy of segmentation
decreases in several frames after temporal stabilization, the overall
decrease in segmentation accuracy for 168 frames was marginal
(a 3% increase in the average number of missegmented pixels).
However, this shows that it is possible to increase the quality of
object segmentation without decreasing the segmentation errors,
as explained below.
Fig. 3. Processing in the x-y domain: (Top Row) The original
segmentation maps for frames 5, 9, 20, 85, 102, and 110. (Bottom
Row) The results after processing in the x-y domain.
(a) (b)
Fig. 4. (a) Processing in the x-t domain: The x-t cross-section
across 168 frames of the segmentation maps of the “lady” object
before and after x-t smoothing. (b) Effects of x-t processing as
observed in the x-y domain.
Objective Evaluation of the Results: In order to quantify
the improvement in the temporal stability of the smoothed video
object planes, they are evaluated using the histogram and shape
measures, which were discussed in Section 2.
In Fig. 7(a), the plot of the histogram measure versus the frame
number is given for the “lady” object, where large peaks at frame
numbers such as 9, 10, 47, 48, . . . correctly signal the frames where
a large portion of the object has been removed from or added to
the video object plane (see Fig. 1). Therefore, the histogram dif-
ference measure is a good indicator of the instants where we loose
temporal stability.
In Fig. 7 (b), the plot of the histogram difference measure is
given for the temporally stable video object planes, after process-
ing with the proposed algorithm. If we compare the two plots (a)
and (b), we can see that most of the peaks have been eliminated.
Table 1 summarizes ratio of the mean and variance of the two plots,
as well as the scores for the shape measure. We can observe that
the mean and the variance of the histogram and shape measures
are considerably smaller after spatio-temporal smoothing, indicat-
ing that the segmentation maps are more temporally stable.
Subjective (Perceptual) Evaluation of the Results: In order
to see whether the proposed temporal stabilization algorithm im-
proved the quality of 3D viewing in 3D-TV applications, we also
carried out a set of perceptual evaluation tests. The depth infor-
mation is added to a given 2D video sequence by segmenting the
objects in the scene and then by placing each object at different in-
ferred depths [3]. Then, the left and right views are rendered using
a simple first-order extrapolation method for the disoccluded ar-
eas. Then the left and right sequences are displayed to the viewer
using a set-up with glasses.
The objects in the Flikken sequence were also hand-segmented
to obtain a reference (R) segmentation, with which the scenes ob-
359
(a) (b) (c)
(d)
Fig. 5. Processing in the y-t domain: (a) A y-t cross-section of
the “lady” object for a fixed x value. (b) Two y-t cross-sections af-
ter motion compensation. (c) The y-t cross-sections after y-t pro-
cessing. (d) Effects of y-t domain smoothing as observed in the
x-y domain for frames 49, 50 and 51.
Fig. 6. Top Row: Original video object planes for frames 0, 50,
100 and 150. Bottom Row: The same frames after temporal sta-
bilization.
tained by the unstable (U) and stable (S) object segmentation re-
sults are compared. During the perceptual tests, an observer was
shown two stereo sequences A and B one after another. The se-
quences A and B can be one of the three cases R, U and S, giving
us a total of nine combinations, named as Test1 - Test9. The ob-
server was asked to select one of the choices: “B is significantly
worse / slightly worse / the same as / slightly better / significantly
better than sequence A. The five options are assigned the scores -2
to 2 from left to right, respectively.
The perceptual evaluation results for fourteen observers are
summarized in Table 2. The tests where the two compared se-
quences A and B are exactly the same (such as UU, RR, SS) are
used for checking the reliability of the tests, since they should have
an average value of zero. The average score of the tests that com-
0 20 40 60 80 100 120 140 160
0
0.02
0.04
0.06
0.08
0.1
0.12
FRAME NUMBER
INTER−FRAME VOP HISTOGRAM DIFFERENCES
0 20 40 60 80 100 120 140 160
0
0.02
0.04
0.06
0.08
0.1
0.12
FRAME NUMBER
INTER FRAME VOP HISTOGRAM DIFFERENCES AFTER SMOOTHING
(a) (b)
Fig. 7. The histogram difference measure between successive
VOPs of the “lady” object before (a) and after (b) temporal sta-
bilization versus frame number.
Histogram Measure Shape Measure
Mean Var Mean Var
Before smoothing 11.52 696.76 38.87 158.69
After smoothing 1.64 3.83 9.90 36.22
Ratio :
Before
After
7 182 3.9 4.4
Table 1. The ratio of the objective evaluation scores for the lady
object before and after temporal stabilization, Histogram means
and variances have been scaled by 10
3
and 10
6
, respectively.
Tests 1-2 Tests 3-4 Tests 5-7 Tests 8-9
AB pairs -RU,UR SR,-RS UU,RR,SS -SU,US
Av. Score 1.05 0.59 0.08 0.52
Table 2. Subjective evaluation scores for the Flikken sequence.
pare S and U is 0.52, which indicates that S, the stabilized re-
sults are perceived as being better than the unstable results, when
viewed in 3D. The average scores in Table 2 also indicate a quality
ordering of the three cases as: g(R) > g(S) > g(U ), where g(.)
denotes the perceived quality of the rendered sequence.
5. CONCLUSIONS AND FUTURE WORK
Obtaining temporally stable video object segmentation maps is im-
portant for comfortable viewing in 3D TV applications. In this pa-
per, a pseudo-3D region-based curve evolution technique for tem-
porally stabilizing a set of estimated video object planes has been
introduced. It has been shown by experiments that the proposed
algorithm significantly improves the temporal stability in terms of
two quantitative objective measures based on histogram and shape
differences. Subjective evaluation tests indicate that there is an
improvement in the perceived quality of the scene when viewed in
3D, which also validates the effectiveness of the proposed quan-
titative measures. The experiments support our initial hypothesis
that if there are inevitable object segmentation errors, they should
be re-distributed in a temporally stable way. Hence, we conclude
that it is possible to increase the object segmentation quality with-
out increasing the segmentation accuracy. An object segmentation
algorithm which optimizes the temporal stability measures directly
is under development.
6. REFERENCES
[1] M. Op de Beeck and A. Redert, “Three dimensional video for the
home, in Proc. Int. Conf. On Augmented Virtual Environments and
Three-Dimensional Imaging, 2001, pp. 188–191.
[2] D. Zhang and G. Lu, “Segmentation of moving objects in image se-
quences: A review, Circuits, Systems and Signal Processing, vol. 20,
no. 2, pp. 143–183, 2001.
[3] F. Ernst, “2d-to-3d video conversion based on time-consistent seg-
mentation, in Proc. ICOB’03 Workshop, 2003.
[4] C. E. Erdem, B. Sankur, and A. M. Tekalp, “Performance measures
for video object segmentation and tracking, IEEE Transactions on
Image Processing, vol. 13, no. 7, 2004.
[5] E. M. Arkin, L. P. Chew, D. P. Huttenlocker, K. Kedem, and J. S. B.
Mitchell, An efficient computable metric for comparing polygonal
shapes, IEEE Trans. Pattern Analysis and Machine Intelligence, vol.
13, pp. 209–215, 1991.
[6] A.Yezzi, A. Tsai, and A. Willsky, A fully global approach to image
segmentation via coupled curve evolution equations, Journal of Vi-
sual Communication and Image Representation, vol. 13, pp. 195–216,
2002.
[7] G. Unal, H. Krim, and A.Yezzi, A vertex-based representation of ob-
jects in an image, in Proceedings of IEEE International Conference
on Image Processing (ICIP), 2002, vol. 1, pp. 896–899.
[8] F. Ernst, P. Wilinski, and K. van Overveld, “Dense structure-from-
motion: An approach based on segment matching, in Proceedings of
European Conference on Computer Vision, 2002.
360
... To the best of our knowledge, the literature on identifying segmentation errors in a track is relatively limited . For instance, Erdem et al. [1] have tried to identify and overcome segmentation errors for a 3D television application to improve the temporal stability of object segmentation , rather than identify and remove errors. They achieved their aim by minimising changes in the global colour histogram and turning angle function of the boundary pixels of the segmented object in each frame to maximise temporal stability. ...
... MCR are essentially colour histograms in the joint R, G, B space, built with sparse bins whose position and number is adjusted to fit the pixel distribution . Instead of just using a global colour feature, as in [1] and [5], we propose to add two extra colour features relating to the upper and lower clothing colours of a person. These features are chosen to represent the often different colours of the clothing on the upper torso, and those on the Extracting the MCR for each of these three colour features utilises the same process, but analyses different spatial components of the appearance of the segmented object. ...
Conference Paper
This paper presents a method to identify frames with significant segmentation errors in an individual's track by analysing the changes in appearance and size features along the frame sequence. The features used and compared include global colour histograms, local histograms and the bounding box' size. Experiments were carried out on 26 tracks from 4 different people across two cameras with differing illumination conditions. By fusing two local colour features with a global colour feature, probabilities of segmentation error detection as high as 83 percent of human expert-identified major segmentation errors are achieved with false alarm rates of only 3 percent. This indicates that the analysis of such features along a track can be useful in the automatic detection of significant segmentation errors. This can improve the final results of many applications that wish to use robust segmentation results from a tracked person.
... Unlike the traditional segmentation algorithm that aims to extract some uniform and homogeneous regions (like texture or color), recent segmentation algorithms can be defined as a process to separate an image into meaningful objects according to some specified semantics. To satisfy the coming content-based multimedia services [1], segmentation of meaningful objects in unsupervised manner is urgently required in the real-world scenes. But for an arbitrary scene (e.g., dynamic background ), fully automatic object segmentation is still a monumental challenge to the state-of-the-art techniques [2]-[4] due to a wide variety of possible objects' combination. ...
Article
In this paper, we propose an automatic human body segmentation system which mainly consists of human body detection and object segmentation. Firstly, an automatic human body detector is designed to provide hard constraints on the object and background for segmentation. And a coarse-to-fine segmentation strategy is employed to deal with the situation of partly detected object. Secondly, background contrast removal (BCR) and self-adaptive initialization level set (SAILS) are proposed to solve the tough segmentation problems of the high contrast at object boundary and/or similar colors existing in the object and background. Finally, an object updating scheme is proposed to detect and segment new object when it appears in the scene. Experimental results demonstrate that our body segmentation system works very well in the live video and standard sequences with complex background.
... The Major Colour Representation (MCR) used in this paper to define the colour features extends the method previously developed in [2]. We propose to add two extra colour features relating to the upper and lower clothing colours of an individual to the global colours used in [2, 3, 5]. These features are chosen to represent the often different colours of the clothing on the upper torso, and those on the legs. ...
Conference Paper
This paper presents a framework based on robust shape and appearance features for matching the various tracks generated by a single individual moving within a surveillance system. Each track is first automatically analysed in order to detect and remove the frames affected by large segmentation errors and drastic changes in illumination. The object's features computed over the remaining frames prove more robust and capable of supporting correct matching of tracks even in the case of significantly disjointed camera views. The shape and appearance features used include a height estimate as well as illumination-tolerant colour representation of the individual's global colours and the colours of the upper and lower portions of clothing. The results of a test from a real surveillance system show that the combination of these four features can provide a probability of matching as high as 91 percent with 5 percent probability of false alarms under views which have significantly differing illumination levels and suffer from significant segmentation errors in as many as 1 in 4 frames.
Article
Digital video content analysis is an important item for multimedia content-based indexing (MCBI), content-based video retrieval (CBVR) and visual surveillance systems. There are some frequently-used generic object detection and/or tracking (D&T) algorithms in the literature, such as Background Subtraction (BS), Continuously Adaptive Mean Shift (CMS), Optical Flow (OF) and etc. An important problem for performance evaluation is the absence of stable and flexible software for comparison of different algorithms. This software is able to compare them with the same metrics in real-time and at the same platform. In this paper, we have designed and implemented the software for the performance comparison and the evaluation of well-known video object D&T algorithms (for people D&T) at the same platform. The software works as an automatic and/or semi-automatic test environment in real-time, which uses the image and video processing essentials, e.g. morphological operations and filters, and ground-truth (GT) XML data files, charting/plotting capabilities and etc.
Chapter
Tracking the movements of people within large video surveillance systems is becoming increasingly important in the current security conscious environment. Such system-wide tracking is based on algorithms for tracking a person within a single camera, which typically operate by extracting features that describe the shape, appearance and motion of that person as they are observed in each video frame. These features can be extracted then matched across different cameras to obtain global tracks that span multiple cameras within the surveillance area. In this chapter, we combine a number of such features within a statistical framework to determine the probability of any two tracks being made by the same individual. Techniques are presented to improve the accuracy of the features. These include the application of spatial or temporal smoothing, the identification and removal of significant feature errors, as well as the mitigation of other potential error sources, such as illumination. The results of tracking using individual features and the combined system-wide tracks are presented based upon an analysis of people observed in real surveillance footage. These show that software operating on current camera technology can provide significant assistance to security operators in the system-wide tracking of individual people.
Article
Two new region-based methods for video object tracking using active contours are presented. The first method is based on the assumption that the color histogram of the tracked object is nearly stationary from frame to frame. The proposed method is based on minimizing the color histogram difference between the estimated objects at a reference frame and the current frame using a dynamic programming framework. The second method is defined for scenes where there is an out-of-focus blur difference between the object of interest and the background. In such scenes, the proposed “defocus energy” can be utilized for automatic segmentation of the object boundary, and it can be combined with the histogram method to track the object more efficiently. Experiments demonstrate that the proposed methods are successful in difficult scenes with significant background clutter.
Conference Paper
It is one challenge to select a general feature for object representation fixed the unconstrained videos. An object detection method which is robust to the target rotation and scales is proposed based on the histogram feature and particle swarm optimization. First, the characters of histogram are presented, and then the merits of histogram feature are analyzed. To cover the computation problem of pixel by pixel searching, particle swarm optimization (PSO) is employed. Then the flowchart of target detection algorithm using histogram and PSO is described. The experimental result proved that the histogram processes the merits of robustness and efficiency for target detection, and that the computation could be improved due to the performance of PSO.
Conference Paper
In this paper in order to introduce multiple target detection method. We combination histogram feature and Imperialist Competitive Algorithm (ICA). We use histogram feature because it is robust to the target rotation and scales. To overcome the computation problem of pixel by pixel searching, ICA is employed. Another advantage of ICA is that if several targets in the image or frame exist, we will be able to detect simultaneously all targets in the frame. Then we apply a threshold in order to remove weak empires which belong to objects which have similarity to targets. Then clustering empires based on the distance and selecting most powerful empire of each cluster as one of the targets contained in frame, therefore we can detect all targets existing in the frame. Finally we compare ICA method with PSO (Particle Swarm Optimization) method and show that ICA is faster and more accurate than PSO in the field of target detection.
Article
Digital video content analysis is an important item for multimedia content-based indexing (MCBI), content-based video retrieval (CBVR) and visual surveillance systems. There are some frequently-used generic object detection and/or tracking (D&T) algorithms in the literature, such as Background Subtraction (BS), Continuously Adaptive Mean Shift (CMS), Optical Flow (OF) and etc. An important problem for performance evaluation is the absence of stable and flexible software for comparison of different algorithms. This software is able to compare them with the same metrics in real-time and at the same platform. In this paper, we have designed and implemented the software for the performance comparison and the evaluation of well-known video object D&T algorithms (for people D&T) at the same platform. The software works as an automatic and/or semi-automatic test environment in real-time, which uses the image and video processing essentials, e.g. morphological operations and filters, and ground-truth (GT) XML data files, charting/plotting capabilities and etc.
Article
Full-text available
Indexing deals with the automatic extraction of information with the objective of automatically describing and organizing the content. Thinking of a video stream, different types of information can be considered semantically important. Since we can assume that the most relevant one is linked to the presence of moving foreground objects, their number, their shape, and their appearance can constitute a good mean for content description. For this reason, we propose to combine both motion information and region-based color segmentation to extract moving objects from an MPEG2 compressed video stream starting only considering low-resolution data. This approach, which we refer to as "rough indexing," consists in processing P-frame motion information first, and then in performing I-frame color segmentation. Next, since many details can be lost due to the low-resolution data, to improve the object detection results, a novel spatiotemporal filtering has been developed which is constituted by a quadric surface modeling the object trace along time. This method enables to effectively correct possible former detection errors without heavily increasing the computational effort.
Conference Paper
Full-text available
For 3-D video applications, dense depth maps are required. We present a segment-based structure-from-motion technique. After image segmentation, we estimate the motion of each segment. With knowledge of the camera motion, this can be translated into depth. The optimal depth is found by minimizing a suitable error norm, which can handle occlusions as well. This method combines the advantages of motion estimation on the one hand, and structure-from-motion algorithms on the other hand. The resulting depth maps are pixel-accurate due to the segmentation, and have a high accuracy: depth differences corresponding to motion differences of 1/8th of a pixel can be recovered.
Article
Full-text available
Model-based recognition is concerned with comparing a shape A, which is stored as a model for some particular object, with a shape B, which is found to exist in an image. If A and B are close to being the same shape, then a vision system should report a match and return a measure of how good that match is. To be useful this measure should satisfy a number of properties, including: (1) it should be a metric, (2) it should be invariant under translation, rotation, and change-of-scale, (3) it should be reasonably easy to compute, and (4) it should match our intuition (i.e., answers should be similar to those that a person might give). We develop a method for comparing polygons that has these properties. The method works for both convex and nonconvex polygons and runs in time O(mn logmn) where m is the number of vertices in one polygon and n is the number of vertices in the other. We also present some examples to show that the method produces answers that are intuitively reasonable.
Conference Paper
Full-text available
We describe the goals of the ATTEST project, which started in March 2002 as part of the Information Society Technologies (IST) programme, sponsored by the European Commission. In the 2-year project, several industrial and academic partners cooperate towards a flexible, 2D-compatible and commercially feasible 3D-TV system-for broadcast environments. An entire 3D-video chain will be developed. We discuss the goals for content creation, coding, transmission, display and the central role that human 3D perception research will play in optimizing the entire chain. The goals include the development of a new 3D camera, algorithms to convert existing 2D-video material into 3D, a 2D-compatible coding and transmission scheme for 3D video using MPEG-2/4/7, and two new autostereoscopic displays. With the combination of industrial and academic partners and the technological progress obtained from earlier 3D projects, we expect to achieve the ATTEST goal of developing the first commercially feasible European 3D-TV broadcast system.
Article
Full-text available
We propose measures to evaluate quantitatively the performance of video object segmentation and tracking methods without ground-truth (GT) segmentation maps. The proposed measures are based on spatial differences of color and motion along the boundary of the estimated video object plane and temporal differences between the color histogram of the current object plane and its predecessors. They can be used to localize (spatially and/or temporally) regions where segmentation results are good or bad; and/or they can be combined to yield a single numerical measure to indicate the goodness of the boundary segmentation and tracking results over a sequence. The validity of the proposed performance measures without GT have been demonstrated by canonical correlation analysis with another set of measures with GT on a set of sequences (where GT information is available). Experimental results are presented to evaluate the segmentation maps obtained from various sequences using different segmentation approaches.
Article
Segmentation of objects in image sequences is very important in many aspects of multimedia applications. In second-generation image/video coding, images are segmented into objects to achieve efficient compression by coding the contour and texture separately. As the purpose is to achieve high compression performance, the objects segmented may not be semantically meaningful to human observers. The more recent applications, such as content-based image/video retrieval and image/video composition, require that the segmented objects be semantically meaningful. Indeed, the recent multimedia standard MPEG-4 specifies that a video is composed of meaningful video objects. Although many segmentation techniques have been proposed in the literature, fully automatic segmentation tools for general applications are currently not achievable. This paper provides a review of this important and challenging area of segmentation of moving objects. We describe common approaches including temporal segmentation, spatial segmentation, and the combination of temporal-spatial segmentation. As an example, a complete segmentation scheme, which is an informative part of MPEG-4, is summarized.
Conference Paper
Novel polygon evolution models are introduced in this paper for capturing polygonal object boundaries in images which have one or more objects that have statistically different distributions on the intensity values. The key idea in our approach is to design evolution equations for vertices of a polygon that integrate both local and global image characteristics. Our method naturally provides an efficient representation of an object through a few number of vertices, which also leads to a significant amount of compression of image content. This methodology can effectively be used in the context of MPEG-7. We also propose usage of the Jensen-Shannon criterion as an information measure between the densities of regions of an image to capture more general statistical characteristics of the data.
Article
In this paper, we develop a novel region-based approach to snakes designed to optimally separate the values of certain image statistics over a known number of region types. Multiple sets of contours deform according to a coupled set of curve evolution equations derived from a single global cost functional. The resulting active contour model, in contrast to many other edge and region based models, is fully global in that the evolution of each curve depends at all times upon every pixel in the image and is directly coupled to the evolution of every other curve regardless of their mutual proximity. As such evolving contours enjoy a very wide “field of view,” endowing the algorithm with a robustness to initial contour placement above and beyond the significant improvement exhibited by other region based snakes over earlier edge based snakes.
Article
A method for comparing polygons that is a metric, invariant under translation, rotation, and change of scale, reasonably easy to compute, and intuitive is presented. The method is based on the L 2 distance between the turning functions of the two polygons. It works for both convex and nonconvex polygons and runs in time O ( mn log mn ), where m is the number of vertices in one polygon and n is the number of vertices in the other. Some examples showing that the method produces answers that are intuitively reasonable are presented
Article
In this paper, we discuss the challenge of 3DTV and present our requirements for a novel threedimensional video format. We will show that the ad-hoc stereoscopic video format does not provide the necessary flexibility towards user interactivity and non-stereo based 3D displays.