Conference PaperPDF Available

Towards interactive generation of ground-truth in background subtraction from partially labeled examples

Authors:

Abstract and Figures

Ground truth segmentation of foreground and background is important for performance evaluation of existing techniques and can guide principled development of video analysis algorithms. Unfortunately, generating ground truth data is a cumbersome and incurs a high cost in human labor. In this paper, we propose an interactive method to produce foreground/background segmentation of video sequences captured by a stationary camera that requires comparatively little human labor, while still producing high quality results. Given a sequence, the user indicates, with a few clicks in a GUI, a few rectangular regions that contain only foreground or background pixels. Adaboost then builds a classifier that combines the output of a set of weak classifiers. The resulting classifier is run on the remainder of the sequence. Based on the results and the accuracy requirements, the user can then select more example regions for training. This cycle of hand-labeling, training and automatic classification steps leads to a high-quality segmentation with little effort. Our experiments show promising results, raise new issues and provide some insight on possible improvements.
Content may be subject to copyright.
Towards Interactive Generation of "Ground-truth" in Background
Subtraction from Partially Labeled Examples
Etienne Grossmann, Amit Kale and Christopher Jaynes
Department of Computer Scicence and Center for Visualization and Virtual Environments
University of Kentucky, Lexington KY 40507
{etienne,amit,jaynes}@cs.uky.edu
Abstract
Ground truth segmentation of foreground and background is
important for performance evaluation of existing techniques
and can guide principled development of video analysis al-
gorithms. Unfortunately, generating ground truth data is a
cumbersome and incurs a high cost in human labor. In this
paper, we propose an interactive method to produce fore-
ground/background segmentation of video sequences cap-
tured by a stationary camera, that requires comparatively
little human labor, while still producing high quality results.
Given a sequence, the user indicates, with a few clicks in a
GUI, a few rectangular regions that contain only foreground
or background pixels. Adaboost then builds a classifier that
combines the output of a set of weak classifiers. The result-
ing classifier is run on the remainder of the sequence. Based
on the results and the accuracy requirements, the user can
then select more example regions for training. This cycle
of hand-labeling, training and automatic classification steps
leads to a high-quality segmentation with little effort. Our
experiments show promising results, raise new issues and
provide some insight on possible improvements.
1 Introduction
Ground truth is very important in computer vision because
of the central role it can play in the empirical analysis
and development of algorithms. Labeled image databases,
for example, are used to train image classifiers and detec-
tors [19] or face recognition methods [13]. Ground truth
also provides “perfect” data to test existing methods that
were built on the premise that this data is available. For
example, having exact silhouettes of walkers helps in the
evaluation of gait recognition algorithms [15]. Perhaps
the most common usage of ground truth is in quantitative
performance evaluation, where it sets the gold standard,
whether for ego-motion estimation [1],three-dimensional
This work was funded by NSF CAREER Award IIS-0092874
reconstruction [18], background subtraction [6, 16], or oth-
ers tasks.
Unfortunately ground truth is usually hard to obtain. La-
beling image databases or videos requires a person to view
and annotate hundreds or thousands of images [7]. Gener-
ating the ground truth for background subtraction requires
that the user hand-segments each image, a cumbersome task
which may require two to thirty minutes per image [6].
We propose an off-line method to generate near-perfect
foreground segmentation of video sequences requiring sig-
nificantly less manpower than extensive hand-labeling. The
approach utilizes the power of supervised learning [17] and
off-the-shelf weak classifiers to lift the burden of labeling
from the user. Given a sequence, the user indicates, with
a few clicks in a GUI, a few rectangular regions that con-
tain only foreground or background pixels. These regions
typically consist of large contiguous background regions as
well as certain identifiable regions from the foreground e.g.
head/torso/leg regions of people. Adaboost then builds a
classifier using the output from a set of weak classifiers ap-
plied to this data. The resulting classifier is run on the re-
mainder of the sequence. Based on the results and the ac-
curacy requirements of the ground truth data, the user can
then select more example regions for training.
This cycle of hand-labeling, learning and automatic
classification rounds will eventually lead to a satisfactory
level of precision, as long as the automatic classifier is capa-
ble of achieving an arbitrary precision on the training data.
Since we use Adaboost [17] to train the classifier, this con-
dition is equivalent to Adaboost’s “weak learner” condition.
In our case, the weak classifiers are ordinary image and
video filters, as shown in Figure 1 (b-d) as well as post-
processing operators applied to the boosted classifier. Our
method requires a single parameter to be set viz. the desired
classification error of Adaboost on the training dataset or the
total number of training steps. In short, we use supervised
learning and image processing tools to learn the distinction
between foreground and background.
Several attempts have been made towards achieving
1
semi-automatic generation of ground truth for background
subtraction. Black et al. [3] used consistency in shape and
appearance between successive frames is used to build ap-
proximate ground truth. Liu and Sarkar [12] used a hidden
Markov model coupled with a eigen stance model to gener-
ate silhouettes for evaluation of gait recognition algorithms.
It should be noted however that the quality of foregrounds
obtained using such techniques which rely on such “track-
ing” between key frames depends on how well the assump-
tions of spatio-temporal continuity hold and may lead to in-
creasing (and unknown) errors in the result. Furthermore
refinement of the computed foregrounds using these meth-
ods is not straightforward since it requires the user to manu-
ally label the foreground for frames where tracking fails. In
contrast, we make minimal assumptions about the dynamic
scene to obtain high quality foreground segmentation. Fur-
thermore iterative refinement of the quality of segmentation
is relatively straightforward to achieve. By utilizing an iter-
ative boosting method, applied to each frame, the approach
is free from tracking failures and robustness issues.
Recently there has been an upsurge in the use of super-
vised learning methods in computer vision. The main rea-
son is that supervised learning methods e.g. Adaboost, de-
cision trees, and neural networks combine simple decision
rules (weak classifiers, stumps, neurons) to obtain classi-
fiers that outperform ad-hoc methods [21]. Moreover the
latter require more field-specific knowledge from the de-
signer. Examples of recent uses of supervised learning ap-
proaches in computer vision include novel view genera-
tion [9, 8], face or pedestrian detection [21, 22].
The reader should note that, while these approaches
solve existing computer vision problems, our method serves
to alleviate the human effort required for accurate fore-
ground segmentation. In this respect our work is related
to that of Agarwala et al. [2] where human interaction is
combined with energy minimization based curve tracking
to produce rotoscope sequences, saving a huge amount of
labor with respect to prior methods. Our contribution, then,
will be apparent in the ease of high quality “ground truth”
production for domains that typically require significant ef-
fort and will, hopefully, lead to more complete and com-
monplace use of ground-truth both in the development and
analysis of vision algorithms. Finally, our work gives some
insight on the background segmentation problem, as the
composition of the boosted classifiers suggests an answer
to the question: “what are the useful features in background
subtraction in a particular sequence?”
2 Methodology
Having presented the principle of labeling, learning and val-
idation rounds, we show how Adaboost is used to learn clas-
sifiers.
2.1 Supervised learning with Adaboost
We use Adaboost [10, 17] for many reasons, one of them
being that its theoretical properties have been studied exten-
sively [17, 11] and it has been observed to generalize well.
Moreover, the algorithm itself requires a single parameter
to be set, the number of training rounds T .
The training process in Adaboost results in a classifier
H (X) = Sign
T
X
t=1
α
t
h
t
(X)
!
{−1, 1}, (1)
where X X is the observed data used in classification,
h
t
: X [1, 1] is a “weak classifier” belonging to
a class of functions H and α
t
R is a weighing coeffi-
cient. G returns +1 to indicate foreground and 1 to indi-
cate background.
Adaboost requires that the weak classifiers perform “bet-
ter than guessing. Mathematically, this means that there
exists a positive (which does not need to be known), such
that, given a sample of data (X
1
, y
1
) , ..., (X
N
, y
N
), where
y
n
{−1, 1} represents the class (background and fore-
ground, in our case) of input X
n
, and given positive weights
D(1), . . . , D(N) that sum to one, there exists a classifier
h H such that its error
N
X
n=1
D(n) [[h (X
n
) y
n
< 0]]
is less than 1/2 , where [[h (X
n
) y
n
< 0]] is 1 (resp. 0)
if h(X
n
)y
n
is negative or not, i.e if h wrongly (resp. cor-
rectly) predicts the class of X
n
. If this assumption holds,
then the classifier (1) built by Adaboost will have an error
on the training data that decreases exponentially with T .
The input X may e.g. be the RGB values and location of
a pixel, and may also include information on its spatial and
temporal neighborhood. At most, X could include, beyond
the pixel location (x, y) and RGB value, the whole image
and perhaps the whole sequence too. What the set X is
exactly is not important here because Adaboost only “sees”
the values h (X), for h H, and not X itself.
The weak classifiers h
t
and weights α
t
in Equa-
tion (1) are determined by the Adaboost rule at the t
th
training step. The training data consists of examples
(X
1
, y
1
) , ..., (X
N
, y
N
). At each training step, Adaboost
chooses the classifiers h
t
H and weights α
t
R that
minimize a criterion related to the error. In the present work,
we use the criterion used in [17, Sec. 4].
2.2 Image filters as weak classifiers
We now detail how image filters, i.e. image processing op-
erations, can be used as weak classifiers suitable for Ad-
aboost. Assume we have image filters f
1
, . . . , f
M
(listed
2
below) that produce, for a given pixel location and value
X X, a value f
m
(X) R
1
. Then, for every m
{1, . . . , M} and every threshold τ R, we define the weak
classifier
h
m,τ
(X) = Sigmoid (f
m
(X) τ ) . (2)
The set of weak classifiers is: H = {h
m,τ
| 1 m
M, τ R}.
Optimal threshold determination: it is of practical im-
portance to note that the optimal threshold τ that minimizes
the classification error can be determined with complexity
proportional to the number of examples - surprisingly, we
did not find prior reference to this fact, so we explain here
how this is done. This results from the image filters taking
only a finite number of values, e.g. 0, 1, . . . , 255, so that
only thresholds τ
1
= 0.5, τ
2
= 0.5, . . . , τ
257
= 255.5
need to be considered. By noting that the sum
S
i
=
X
n s.t. f (X
n
)=i
D
t
(n)y
n
is the difference between the cost of threshold τ
i
and that of
threshold τ
i+1
, one easily shows that the best threshold τ
i
is
that corresponding to the minimum of
P
j<i
S
i
. The abso-
lute error is then
P
n s.t. y=1
D
t
(n) +
P
j<i
S
i
. We may
thus determine in linear time the optimal threshold for both
the filters f (X
n
) and f (X
n
). This is obviously an im-
provement over the naive procedure, which has complexity
O(N log N ) at each boosting step.
More generally, when the filter may take R different
values, the optimal threshold can always be found in time
O(N R) after performing a preprocessing step of complex-
ity O(RNlogN ) in which the values the filter takes on the
training examples are sorted.
Having presented some general aspects of the boosting
procedure, we now specify the filters that are used in our
experiments. The output of these filters, on image 120 of
the MIT sequence used in Section 3, is shown in Figure 1.
2.2.1 Spatial Correlation filter
Classification with these filters is based on spatial correla-
tion of different sized neighborhoods of the input images
with a mean background-only image obtained from the be-
ginning of the sequence. The correlation for each pixel in
the output image is computed as:
f
Corr
m
(X) = Corr (B (N
m
, X) , T (N
m
, X)) , (3)
where B(N
m
, X) (resp. T (N
m
, X)) is a neighborhood of
width N
m
around X in the input (resp. mean background)
1
or {1, . . . , 255}.
image. The correlation will be low in foreground regions
and high in background regions. We used nine neighbor-
hood sizes N
m
{3, 5, 7, 9, 11, 15, 21, 27, 33}, leading
to varying levels of smoothing in the outputs. Figure 1, (b)
shows the filter values with N
m
= 3.
2.2.2 Spatio-temporal Filters
In these filters, image pixels are classified based on spatio-
temporal consistency in successive images. The output im-
age is generated as follows: the current frame t is spatially
smoothed by a binomial filter of width σ, resulting in values
ˆ
X
σ
t
. These values are time-smoothed using a first order AR
filter:
ˆ
X
σ,λ
t
= λ
ˆ
X
σ
t
+ (1 λ)
ˆ
X
σ
t1
. Finally, the filter is
f
ST
m
(X
t
) = |
ˆ
X
σ,λ
t
ˆ
X
σ,λ
t1
| (4)
In Section 3, we use 12 (“hall-monitor” sequence) or 48
(“MIT” sequence) classifiers, corresponding respectively to
σ {2, 16}, λ {1/2, 16/17} and to σ {0, 2, 4, 16},
λ {0, 1/2, 4/5, 16/17}; each of these filters were applied
in three color spaces, RGB, HS and V. Figure 1, (c) shows
the output of these filters, for the V component, with σ = 2
and λ = 1/2.
2.2.3 Pixel-wise probabilistic filters
Classification with these filters is based on the probability of
the current RGB (or HSV or Laplacian of Gaussian(LOG))
value X at a given pixel to belong to the background, as-
suming a kernel probability model:
f
Kernel
m
(X) =
P
X
i=1
Π
d
j=1
1
2πσ
m,j
e
(X
j
Z
i,j
)
2
/2σ
2
m,j
,
(5)
where Z
1
, ..., Z
P
are d = 3-dimensional vectors of RGB
(or HSV or LOG) background values observed in the
first P frames of the sequence, which are supposed to be
background-only. The parameter σ
m,j
for each pixel is al-
lowed to take 10 different values around the value suggested
in in Elgammal et al [5]. We thus have 10 × 3 = 30 differ-
ent pixel-wise probabilistic filters at our disposition. Fig-
ure 1, (d) shows a possible output of the filter (negated),
computed on the RGB representation the image.
2.2.4 Morphological operators
Morphological operators are applied at each training step
1 < t T to the current of the unthresholded classifier,
H
t
(X
n
) =
t1
X
s=1
α
s
h
s
(X
n
) , n {1, . . . , N}.
3
(a) Original (b) Correlation (c) Spatio-Temporal (d) Probabilistic (e) Regions
Figure 1: From left to right: Image 120 of MIT Indoor sequence; Output of a correlation filter (Sec. 2.2.1); of a spatio-
temporal filter (Sec. 2.2.2)(negated to be more visible); of a probabilistic filter (Sec. 2.2.3)(negated to be more visible); region
segmentation (Sec. 2.2.5).
We use grey-level operators of erosion, dilation, opening
and closing, with radii of 1, 2, 3 and 4 pixels, which re-
sult in grey-level images. There are thus 16 morphological
operators at our disposition.
2.2.5 Region-voting
The region-voting classifier is based on a segmentation of
the original image and on the classification at the current
training step. For any pixel X belonging to a region con-
sisting of R pixels {X
r
1
, X
r
2
, . . . , X
r
R
}, the region-voting
classifier is defined by:
f
Reg
m
(X) =
1
R
R
X
i=1
H
t
(X
r
i
) (6)
A different region-voting weak classifier family is defined
for each possible region segmentation. In the present work,
we use the segmentation method of Comaniciu and Meer [4]
with three different values of the color histogram distance
threshold, resulting in three families of weak classifiers,
corresponding to different levels of granularity of segmen-
tation. Note that only the regions matter in this filter, not
the color given to each region by the segmentation method.
Figure 1, (e) shows the most detailed region segmentation.
Note that using morphological operators and region-
voting changes theoretically the boosting framework, since
the set of weak classifiers varies during boosting. Our study
tends to show that this change has no adverse consequence
in theory and that it may be useful in practice.
Altogether, in the following section, we use
(9+48+28+16+3=) 104 filters on the MIT sequence,
and (9+12+28+16+3=) 68 in the “hall-monitor” sequence.
Even if, as explained above, each one yields a family of
weak classifiers indexed by the threshold τ in Equation (2),
this is a relatively small number of weak classifiers.
3 Experiments
Having detailed our methodology, we show how it performs
in practice and analyze the relative importance of each type
of weak classifier in the classifier built by Adaboost. We
also compare the results of the presented method with those
obtained when one uses high quality hand-segmented im-
ages for training.
To avoid confusion, we use the term “learning round” to
indicate the rounds of the labeling-training-validation cy-
cle, while “training steps” denotes the successive steps of
Adaboost.
Hall-monitor We first use the well-known “hall-monitor”
sequence, consisting in 300 352 × 288 images. For valida-
tion, we use a hand-made segmentation
2
(Figure 2, top).
For training, in the first round, we used the rectangular
regions in images 121, 126, 131 and 136 shown in Figure 3,
top. The lighter (and greener) boxes are regions of fore-
ground, while darkened (reddened) regions are background.
Adaboost reduced the training error to zero in five steps, as
shown in Fig. 3, left, circle-curve, which plots the training
error vs. the number of training steps. The validation error
remained approximately constant, with a slight increase, af-
ter the fifth Adaboost step (Fig. 3, left, full curve). This
indicates that Adaboost overfits slightly the data. After ex-
amining the output of the classifier on the training images
(this output is not shown), we added some boxes, resulting
in the training set shown in Figure 3, bottom.
This data was used in the second round of training. This
time, 17 rounds were needed to zero the training error. The
same slight overfitting trend as in the first round was ob-
served. Surprisingly, the validation error increased with re-
spect to the first round (Fig. 3, left, the dashed curve, is
above the full curve), despite the fact that the output ap-
pears visually improved (Fig. 2, third row vs. second). This
is better understood by looking at the false positive and false
negative rates: in the first round, these were 0.88% (FP) and
3.51% (FN), while in the second round, they were 1.15%
(FP) and 2.36% (FN). Indeed, in the second round of label-
ing, we focused on reducing the false negatives, by adding
labeled boxes mainly in foreground regions. While this goal
2
Prof. Erdem [6] provided hand-labeled ground-truth of the left sides
of images 32-240 and the authors hand-labeled the right sides of images
70, 100, 130, 160, 190 and 220
4
70 100 190 220
Hand-LabelsRound 1Round 2Elgammal
Figure 2: Row 1: Hand-labeled ground truth for images 70, 100, 190 and 220 of the “Hall-monitor” sequence. Row 2:
Output of the boosted classifier trained with the first set of boxes, superposed on the hall-monitor images. Row 2: Output of
the boosted classifier trained with the second set of boxes. Row 4: Output of the method of Elgammal et al [5].
was achieve, the false positive rate increased slightly. Since
the data is overwhelmingly background, this results in a
global increase of the misclassification rate.
These results were compared -this is not a fair
comparison- with the unsupervised Kernel-based approach
of Elgammal et al. [5] (shown in row 4). The false posi-
tive rate using their method was found to be 0.37% (com-
pare with 0.88%, 1.15% for rounds one and two) and , the
false negative rate was 15.7% (compare with 3.51% 2.36%
for rounds one and two) while the overall error was 1.76%.
(we reach 1.01 and 1.23% on the first and second rounds).
Although the difference in error is not huge, we achieve a
much lower false negative rate and our false positives form a
visually acceptable “fat border” around the true foreground
region.
A fairer comparison is with our previous work in which
carefully hand-labeled images images, like those in Fig-
ure 2, top, are used for training. In this case, using images
70, 100, 130, 160 and 190 for training and image 220 for
validation, and performing 20 boosting steps, we obtained a
training error of 0.51%, a validation error of 0.96%, a false
positive rate of 0.89% and a false negative rate of 2.8%.
Again, while these error rates appear good, the output is in
fact visually less appealing than that obtained when training
with boxes and shown in Fig. 2.
Beyond the classification error, it also worth studying the
relative importance of the various weak classifier: in the
second round, the final classifier consisted in the sum of 10
image-processing filters (6 probabilistic, 3 spatio-temporal
filters, and one correlation filter), 6 morphological opera-
tors, 2 region-voting filters and 1 correlation filter. The
contributions of these classes of filters in Eq. (1) were re-
spectively 36%, 56% and 7.5%. Figure 3, left, light dashed
curve is the validation error of a classifier trained on the
5
121 126 131 136
Round 1Round 2
Figure 3: Training data for the “Hall-monitor” sequence. Foreground regions are indicated as lightened green boxes and
background is indicated by the darkened red interior or exterior of boxes. with superposed boxes labeling foreground and
background regions. top: data used in the first training round. bottom: data used in the second round. Fig. 2 shows the
output of the boosted classifiers on another set of images.
second round data, but without the morphological opera-
tors. While this suggests that morphological operators are
detrimental in terms of validation error, it should be noted
that this operator improves the visual aspect of the segmen-
tation, by removing artifacts of other weak classifiers. A
more detailed analysis is done in the next dataset.
These experiments provide insight on how to further de-
velop interactive methods for offline background subtrac-
tion. These ideas are discussed in Section 4.
MIT indoor data This image sequence was taken at the
MIT AI Laboratory and provided by Joshua Migdal [14],
together with some hand-labeled frames. It consists of a
person entering the field of view of the camera on the right,
walking to the left out of the camera’s view. The top row
of Figure 5 shows the hand-labeled segmentation of frames
115, 125, 130 and 135, that we use for validation.
Five to 12 other images were used in training. At each
round, some new boxes were labeled and new images added
to the training set. Fig. 5, rows 2-4 show the output of the
classifier on the validation images at the first, second and
fourth learning rounds.
At the second learning round, we studied the importance
of each type of weak classifier. This was done by comparing
the performance obtained when weak classifiers were lim-
ited to (1) Image filters only (those of Secs. 2.2.1, 2.2.2 and
2.2.3); (2) Image filters and morphological operators; (3)
Image filters and region voting; (4) Image filters, morpho-
logical operators and region voting. For each combination
of weak classifier families, eight Adaboost classifiers were
trained, one for each subset of eight training images chosen
in {113, 116, 120, 123, 128, 133, 136, 138, 139}. The aver-
age performance of these eight classifiers is recorded and
plotted in Fig. 3, right. These curves show clearly that mor-
phological operators are detrimental in terms of validation
error, whether region voting is used or not. In this dataset,
contrarily to the “hall-monitor” dataset, the global error was
decreased from the first to the second learning round. With-
out the morphological operators, the validation error, false
positive and false negative rates are, after the first learning
round, 1.00%, 0.81% and 4.03%. After the second round,
these figures are: 0.96%, 0.86% and 2.50%. After the fourth
round, these figures are: 1.29%, 1.47% and 1.27%. These
last figures indicate a tendency to overfit when more com-
plete training data is used, even when morphological op-
erators are not used. This effect is comparable to that ob-
tained when training the image segmentation algorithm for
the “hall-monitor” sequence, using hand-labeled data for
training.
4 Conclusions
In this paper we presented an iterative approach based on
supervised learning to generate high quality foreground
segmentation for video sequences which involves the user
marking a few rectangular regions, as opposed to perform-
ing laborious hand-segmentation. In addition, by training
multiple classifiers on slightly different datasets, as we did
with the MIT sequence, it is possible to get a confidence
measure for the method.
Our objective is, in the future, to improve the quality.
6
0 5 10 15 20
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
0.02
No of Boosting Steps
Error
Validation Error w/ Morph. (Round 1)
Validation Error w/ Morph(Round2)
Validation Error w/o Morph (Round 2)
Training Error w/ Morph (Round1)
Training Error w/ Morph (Round2)
Filters only
Filters+Morph
Filters+Regions
All
No of Boosting Steps
Error
Valid.
Training
0.005
0.015
0.020
0.010
0.000
0 5 10 15 20
(a) (b)
Figure 4: Left: Validation and training error as a function of number of boosting steps, on the “Hall-Monitor” sequence. From bottom
to top: training error in first learning round, in second learning round, validation error in first learning round, in the second learning
round (without morphological classifiers) and in the second learning round with all classifiers. See the text for a complete explanation.
Right: Average validation and training error obtained with the training data of the second learning round the MIT sequence. Each curve
corresponds to a combination of weak classifiers.
115 125 130 135
Hand-LabelsRound 1Round 2Round 4
Figure 5: Row 1: Hand-labeled ground-truth for the MIT Indoor sequence, images 115, 125, 130 and 135. Rows 2-4: Output
of boosted classifier after 1,2 and 4 rounds of training.
Achieving this objective would bring the following benefits: Quantitative evaluation of unsupervised background
subtraction methods, as done in [6], would be possi-
7
ble with very little extra labor.
The availability of large amounts of segmentation data
enables new approaches to background subtraction, for
example approaches based on supervised learning.
The results obtained thus far when training from boxes
are quantitatively comparable to those obtained when train-
ing from hand-segmented images. Of course, the results are
better than that obtained with unsupervised learning. We
have seen that early in the interactive learning process, re-
sults are very appealing visually, owing to the low false neg-
ative rates and to the absence of patches of false positives.
We observed that false positives form a “fat border” around
the true positives.
However, trying to reduce this border does not give good
results with our current approach. Adding labeled boxes
to the training set quickly yields results comparable to that
obtained when training from hand-segmented images, i.e.
the limit of infinitely many labeled boxes.
We also showed experimentally that all image filters and
post-processing operators are not equally useful. This sug-
gests using different -more powerful- image operators, e.g.
coarse “tracking” filters similar to those in Black et al [3].
We also met with the discrepancy between classification
error and visual appeal of a segmentation. This suggests us-
ing quality measures closer to human perception [20], for
performance evaluation, and also during training. This last
point would require changes to the boosting framework, be-
yond that of using post-processors as weak classifiers. In
future work, we also consider using semi-supervised learn-
ing methods to automatically identify unlabeled regions that
would be particularly informative for the learner.
In summary, we have already obtained relatively promis-
ing results and our experiments suggest many directions for
improvements.
References
[1] H. Adams, S. Singh, and D. Strelow. An empirical compari-
son of methods for image-base motion estimation. In IROS,
Lausanne, Switzerland, 2002.
[2] A. Agarwala, A. Hertzmann, D. H. Salesin, and S. M.
Seitz. Keyframe-based tracking for rotoscoping and anima-
tion. ACM Trans. Graph., 23(3):584–591, 2004.
[3] P. Black, T. Ellis, and P. Rosin. A novel method for video
tracking performance evaluation. IEEE Int. Workshop on Vi-
sual Surveillance and Performance Evaluation of Tracking
and Surveillance (VS-PETS), pages 125–132, 2003.
[4] D. Comaniciu and P. Meer. Mean shift: A robust approach
toward feature space analysis. IEEE Trans. Pattern Analysis
Machine Intell., 24(5):603–619, 2002.
[5] A. Elgammal, R. Duraiswami, D. Harwood, and L. S. Davis.
Background and foreground modeling using nonparametric
kernel density estimation for visual surveillance. proc. of the
IEEE, 90(7):1151–1163, 2002.
[6] C. E. Erdem, A. M. Tekalp, and B. Sankur. Metrics for per-
formance evaluation of video object segmentation and track-
ing without ground-truth. In IEEE Intl. Conf. on Image Pro-
cessing (ICIP), 2001.
[7] R. Fisher, J. Santos-Victor, and J. Crowley. CAVIAR:
Context aware vision using image-based active recognition.
http://homepages.inf.ed.ac.uk/rbf/CAVIAR/, 2003. EC IST
project IST 2001 37540.
[8] A. Fitzgibbon, Y. Wexler, and A. Zisserman. Image-based
rendering using image-based priors. Proceedings of the In-
ternational Conference on Computer Vision, October 2003.
[9] W. T. Freeman and E. C. Pasztor. Learning low-level vision.
In International Conference on Computer Vision, volume 2,
pages 1182–1189, 1999.
[10] Y. Freund and R. E. Schapire. A decision-theoretic gener-
alization of on-line learning and an application to boosting.
Journal of Computer and System Sciences, 55(1):119–139,
1997.
[11] J. Friedman, T. Hastie, and R.J. Tibshirani. Additive logistic
regression: a statistical view of boosting. Annals of Statistics,
2:337–374, 2000.
[12] Z. Liu and S. Sarkar. Effect of silhouette quality on
hard problems in gait recognition. IEEE Trans. on SMC,
35(2):170–178, April 2005.
[13] A.M. Martinez and R. Benavente. The AR face database.
24, Computer Vision Center of the Universitat Autònoma de
Barcelona, 1998.
[14] J. Migdal and W.E.L Grimson. Background subtraction us-
ing markov thresholds. Proceedings of IEEE Workshop on
Motion and Video Computing, January 2005.
[15] P. J. Phillips, S. Sarkar, I. Robledo, P. Grother, and K. W.
Bowyer. The gait identification challenge problem: Data sets
and baseline algorithm. Proc of the International Conference
on Pattern Recognition, 2002.
[16] P.L. Rosin and E. Ioannidis. Evaluation of global image
thresholding for change detection. Pattern Recognition Let-
ters, 24(14):2345–2356, 2003.
[17] R. E. Schapire and Y. Singer. Improved boosting algo-
rithms using confidence-rated predictions. Machine Learn-
ing, 37(3):297–336, 1999.
[18] D. Scharstein, R. Szeliski, and R. Zabih. A taxonomy and
evaluation of dense two-frame stereo correspondence algo-
rithms, 2001.
[19] A. Torralba, K. P. Murphy, and W. T. Freeman.
The MIT-CSAIL database of objects and scenes.
web.mit.edu/torralba/www/database.html.
[20] Ranjith Unnikrishnan, Caroline Pantofaru, and Martial
Hebert. A measure for objective evaluation of image seg-
mentation algorithms. In Proc. CVPR Workshop on Empiri-
cal Evaluation Methods in Computer Vision, 2005.
[21] P. Viola and M. Jones. Robust real-time object detection. In
Proc. ICCV workshop on statistical and computational the-
ories of vision, 2001.
[22] P. A. Viola, M. J. Jones, and D. Snow. Detecting pedestrians
using patterns of motion and appearance. In ICCV, pages
734–741, 2003.
8
Article
Generation of ground truth data from video sequences is still an intriguing problem in the Computer Vision community. The massive amount of data and the necessary effort for annotating this data make this task a challenging problem. In this paper we investigate the possibility of generating ground truth data in a semi-automatic way. Specifically, using the output of different algorithms, a new output based on robust statistics is generated. The proposed method uses results obtained from real data which is used for evaluation purposes. The generated output is proposed to be used as a basis of ground truth data reducing the necessary time for generating this data. The main contribution of this paper is to show that such methodology can be used to generate an initial ground truth data, which is accurate and reliable, in both ways semi-automatic and fast. Various results and analysis are presented to evaluate the performance of the proposed methodology. Obtained results suggest that generating ground truth data based on the output of different algorithms is possible alleviating the problem of annotating such data manually.
Article
This chapter presents a Continuous Movement Recognition (CMR) framework which forms a basis for segmenting continuous human motion to recognize actions as demonstrated through the tracking and recognition of hundreds of skills from gait to twisting summersaults. A novel 3D color clone-body-model is dynamically sized and texture mapped to each person for more robust tracking of both edges and textured regions. Tracking is further stabilized by estimating the joint angles for the next frame using a forward smoothing Particle filter with the search space optimized by utilizing feedback from the CMR system. A new paradigm defines an alphabet of dynemes being small units of movement, to enable recognition of diverse actions. Using multiple Hidden Markov Models, the CMR system attempts to infer the action that could have produced the observed sequence of dynemes.
Article
Construction progress monitoring has been recognized as one of the key elements that lead to the success of a construction project. By performing construction progress monitoring, corrective measures and other appropriate actions can be taken in a timely manner, thereby enabling the actual performance to be as close as possible to the desired outcome even if the construction performance significantly deviates from the original plan. However, current methods of data acquisition and its use in construction progress monitoring have tended to be manual and time consuming. This paper proposes an efficient, automated 3D structural component recognition and modeling method that employs color and 3D data acquired from a stereo vision system for use in construction progress monitoring. An outdoor experiment was performed on an actual construction site to demonstrate the applicability of the method to 3D modeling of such environments, and the results indicate that the proposed method can be beneficial for construction progress monitoring.
Conference Paper
Cascades of boosted ensembles have become popular in the object detection community following their highly successful introduc- tion in the face detector of Viola and Jones (1). In this paper, we explore several aspects of this architecture that have not yet received adequate attention: decision points of cascade stages, faster ensemble learning, and stronger weak hypotheses. We present a novel strategy to determine the appropriate balance between false positive and detection rates in the individual stages of the cascade based on a probablistic model of the overall cascade's performance. To improve the training time of individ- ual stages, we explore the use of feature filtering before the application of Adaboost. Finally, we show that the use of stronger weak hypothe- ses based on CART can significantly improve upon the standard face detection results on the CMU-MIT data set.
Article
Cascades of boosted ensembles have become popular in the object detection community following their highly successful introduction in the face detector of Viola and Jones. Since then, researchers have sought to improve upon the original approach by incorporating new methods along a variety of axes (e.g. alternative boosting methods, feature sets, etc). We explore several axes that have not yet received adequate attention in this context: cascade learning, stronger weak hypotheses, and feature filtering. We present a novel strategy to determine the appropriate balance between false positive and detection rates in the individual stages of the cascade, enabling us to control our experiments to a degree not previously possible. We show that while the choice of boosting method has little impact on the detector's performance and feature filtering is largely ineffective, the use of stronger weak hypotheses based on CART classifiers can significantly improve upon the standard results.
Conference Paper
Full-text available
This paper presents a methodology for evaluating the performance of video surveillance tracking systems. We introduce a novel framework for performance evaluation using pseudo-synthetic video, which employs data captured online and stored in a surveillance database. Tracks are automatically selected from the surveillance database and then used to generate ground truthed video sequences with a controlled level of perceptual complexity that can be used to quantitatively characterise the quality of the tracking algorithms.
Article
Full-text available
Boosting is one of the most important recent developments in classi-fication methodology. Boosting works by sequentially applying a classifica-tion algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. We show that this seemingly mysterious phenomenon can be understood in terms of well-known statistical princi-ples, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most situations, and far superior in some. We suggest a minor modification to boosting that can reduce computation, often by factors of 10 to 50. Finally, we apply these insights to produce an alternative formulation of boosting decision trees. This approach, based on best-first truncated tree induction, often leads to better performance, and can provide interpretable descrip-tions of the aggregate decision rule. It is also much faster computationally, making it more suitable to large-scale data mining applications.
Article
This paper describes a visual object detection framework that is capable of processing images extremely rapidly while achieving high detection rates. There are three key contributions. The first is the introduction of a new image representation called the "Integral Image" which allows the features used by our detector to be computed very quickly. The second is a learning algorithm, based on AdaBoost, which selects a small number of critical visual features and yields extremely efficient number of critical visual features and yields extremely efficient classifiers [6]. The third contribution is a method for combining classifiers in a “cascade” which allows background regions of the image to be quickly discarded while spending more computation on promising object-like regions. A set of experiments in the domain of face detection are presented. The system yields face detection performace comparable to the best previous systems [18, 13, 16, 12, 1]. Implemented on a conventional desktop, face detection proceeds at 15 frames per second.
Article
In the first part of the paper we consider the problem of dynamically apportioning resources among a set of options in a worst-case on-line framework. The model we study can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting. We show that the multiplicative weight-update Littlestone-Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems. We show how the resulting learning algorithm can be applied to a variety of problems, including gambling, multiple-outcome prediction, repeated games, and prediction of points in n. In the second part of the paper we apply the multiplicative weight-update technique to derive a new boosting algorithm. This boosting algorithm does not require any prior knowledge about the performance of the weak learning algorithm. We also study generalizations of the new boosting algorithm to the problem of learning functions whose range, rather than being binary, is an arbitrary finite set or a bounded segment of the real line.
Conference Paper
Many video surveillance and identification applications need to find moving objects in the field of view of a stationary camera. A popular method for obtaining these silhouettes is through the process of background subtraction. We present a novel method for comparing image frames to the model of the stationary background that exploits the spatial and temporal dependencies that objects in motion impose on their images. We achieve this through the development and use of Markov random fields of binary segmentation variates. We show that the MRF approach produces more accurate and visually appealing silhouettes that are less prone to noise and background camouflaging effects than traditional per-pixel based methods. Results include visual examination of silhouettes, comparisons against hand-segmented data, and an analysis of the effects of various silhouette extraction techniques on gait recognition performance.