ArticlePDF Available

Object Tracking with Adaptive Multicue Incremental Visual Tracker

Wiley
Advances in Multimedia
Authors:

Abstract and Figures

Generally, subspace learning based methods such as the Incremental Visual Tracker (IVT) have been shown to be quite effective for visual tracking problem. However, it may fail to follow the target when it undergoes drastic pose or illumination changes. In this work, we present a novel tracker to enhance the IVT algorithm by employing a multicue based adaptive appearance model. First, we carry out the integration of cues both in feature space and in geometric space. Second, the integration directly depends on the dynamically-changing reliabilities of visual cues. These two aspects of our method allow the tracker to easily adapt itself to the changes in the context and accordingly improve the tracking accuracy by resolving the ambiguities. Experimental results demonstrate that subspace-based tracking is strongly improved by exploiting the multiple cues through the proposed algorithm.
This content is subject to copyright. Terms and conditions apply.
Research Article
Object Tracking with Adaptive Multicue Incremental
Visual Tracker
Jiang-tao Wang,1De-bao Chen,1Jing-ai Zhang,1Su-wen Li,1and Xing-jun Wang2
1School of Physical and Electronic Information, Huaibei Normal University, Huaibei 235000, China
2Shandong Huisheng Group, Weifang 261201, China
Correspondence should be addressed to Jing-ai Zhang; ellazhangja@.com
Received  February ; Revised  August ; Accepted  August ; Published  September 
Academic Editor: Constantine Kotropoulos
Copyright ©  Jiang-tao Wang et al. is is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Generally, subspace learning based methods such as the Incremental Visual Tracker (IVT) have been shown to be quite eective
for visual tracking problem. However, it may fail to follow the target when it undergoes drastic pose or illumination changes. In
this work, we present a novel tracker to enhance the IVT algorithm by employing a multicue based adaptive appearance model.
First, we carry out the integration of cues both in feature space and in geometric space. Second, the integration directly depends
on the dynamically-changing reliabilities of visual cues. ese two aspects of our method allow the tracker to easily adapt itself
to the changes in the context and accordingly improve the tracking accuracy by resolving the ambiguities. Experimental results
demonstrate that subspace-based tracking is strongly improved by exploiting the multiple cues through the proposed algorithm.
1. Introduction
Due to the wide applications in video surveillance, intelligent
user interface, human motion understanding, content-based
video retrieval, and object-based video compression [],
visualtrackinghasbecomeoneoftheessentialandfunda-
mental tasks in computer vision. During the past decades,
numerous and various approaches have been endeavored to
improve its performance, and there is a fruitful literature
in tracking algorithms development that reports promising
results under various scenarios. However, visual tracking still
remains a challenging problem in tracking the nonstationary
appearance of objects undergoing signicant pose, illumina-
tion variations, and occlusions as well as shape deformation
for nonrigid objects.
When we design an object tracking system, usually two
essential issues should be considered: which searching algo-
rithm should be applied to locate the target and what type
of cue should be used to represent the object. For the rst
issue, there are two well-known searching algorithms which
had been widely studied in the last decade. ese are oen
referred to as particle ltering and mean shi. e particle
lter performs a random search guided by a stochastic motion
modeltoobtainanestimateoftheposteriordistribution
describing the object’s conguration []. On the other hand,
mean shi, a typical and popular variational algorithm, is a
robust nonparametric method for climbing density gradients
to nd the peak of probability distributions [,]. e
searching paradigms dier in those two methods as one is
stochastic and model-driven while the other is deterministic
and data-driven.
Modeling target appearance in videos is a problem of fea-
ture extracting and is known to be a more critical factor than
the search strategy. Developing a robust appearance system
which can model the target appearance changes adaptively
has been the matter of primary interest in the recent visual
tracking research. e Incremental Visual Tracker (IVT)
[] has been proved to be a successful tracking method by
incorporating an adaptive appearance model. In particular,
the IVT modeled the target appearance as a low-dimensional
subspace based on the probabilistic principal component
analysis (PPCA), where the subspace is updated adaptively
basedontheimagepatchestrackedinthepreviousframes.In
this model, the intensity dierences between target reference
and candidates are computed to measure the observation
weight. e IVT alleviates the burden of constructing a target
Hindawi Publishing Corporation
Advances in Multimedia
Volume 2014, Article ID 343860, 11 pages
http://dx.doi.org/10.1155/2014/343860
Advances in Multimedia
F : Two cases for the IVT tracker failure.
model prior to tracking with a large number of expensive
oine data and tends to yield higher tracking accuracies.
However, since only the intensity feature is employed to select
the optimal candidate as the target, it may fall into trouble
when the target is moving into shadow or undergoing large
pose changes (as shown in Figure ).
In this work, a multicue based incremental visual tracker
(MIVT) is proposed to confront the aforementioned di-
culties. In a sense, our work can be seen as an extension
of []. Compared to the classical IVT method, the main
contributions of our algorithm are as follows. First, with
color (or gray) and edge properties, our representation
model describes the target with more information. Second,
an adaptive multicue integration framework is designed
considering both the target and the background changes.
When one cue becomes not discriminative enough due to
target or background changes, the other will compensate.
ird, the proposed multicue framework can be eectively
incorporated in the particle lter tracking system, so as to
make the tracking process more robustly.
e rest of the paper is organized as follows. Section
reviews the related multicue fusion works. Section gives
an overview of the IVT tracking algorithm. In Section ,we
rst propose our multicue appearance modeling scheme, and
then we implement the presented MIVT tracking framework.
In Section , a number of comparative experiments are
performed. Section concludes the whole paper.
2. Related Work
ere is a rich literature in visual tracking and a thorough
discussion on this topic is beyond the scope of this paper. In
this section, we review only the most relevant visual tracking
works, focusing on algorithms that operate on multiple cues.
Up to now, a number of literatures have been published about
thefusionofmultiplecues.Ingeneral,therearetwokeyissues
that should be solved in multicue based tracking algorithm:
(1)what cues are used to represent the targets feature, (2)
how the cues are integrated. Here, we focus on the second
key problem.
e simplest case is that dierent cues are assumed to be
independent, so as to use all cues in parallel and treat them as
equivalent channels; this approach has been reported in [,].
Based on this idea, in [], two features, intensity gradients,
and the color histogram were fused directly with xed equal
weights. A limitation of this method lies in that it does not
take account of the single cues discriminative ability.
To avoid the limitation of above methods, in [], Du
and Piater proposed Hidden Markov Models (HMM) based
multicue fusing approach. In this approach the target was
trackedineachcuebyaparticlelter,andtheparticlelters
in dierent cues interacted via a message passing scheme
based on the HMM, four visual cues including color, edges,
motion, and contours were selectively integrated in this work.
Jia et al. [] presented a dynamic multicue tracking scheme
Advances in Multimedia
by integrating color and local features. In this work, the
weights were supervised and updated based on a Histograms
of Oriented Gradients (HOG) detection response. Yang et al.
[] introduced a new adaptive way to integrate multicue
in tracking multiple human driven by human detections,
these dened dissimilarity function for each cue according
to its discriminative power and applied regression process to
adapt the integration of multiple cues. In [], a consistent
histogram-based framework was developed for the analysis
of color, edge, and texture features.
In [], Erdem et al. carried out the integration of the
multiple cues in both the prediction step and the measure-
ment step, and they dened the overall likelihood function
so that the measurements from each cue contributed the
overall likelihood according to its reliability. Yin et al. []
designed an algorithm that combined CamShi with particle
lterusingmultiplecuesandanadaptiveintegrationmethod
was adopted to combine color information with motion
information. Democratic integration was an architecture that
allows the tracking of objects through the fusion of multiple
adaptive cues in a self-organized fashion. is method was
given by Triesch and von der Malsburg in []. It was explored
more deeply in [,]. In this framework, each cue created
a -dimensional cue report, or saliency map; the cues fusion
was carried out by resulting fused saliency map which was
computed as a weighted sum of all the cue reports. P´
erez
et al. [] utilized a particle lter based visual tracker that
fused three cues: color, motion, and sound. In their work,
color cues were served as the main visual cue, and according
to the scenario under consideration, color cues were fused
with either sound localization cues or motion activity cues.
A partitioned sampling technique was applied to combine
dierent cues; the particle resampling was not implemented
on the whole feature space but in each single feature space
separately. is technique increased the eciency of the
pa rticle lter. However, in their c ase, onl y two c ues could
be used simultaneously, this restricted the exible selection
of cues and the extension of the method. Wu and Huang
[] formulated the problem of integrating multiple cues for
robust tracking as the probabilistic inference problem of a
factorized graphical model. To analyze this complex graphical
model, a variational method was taken to approximate
the Bayesian inference. Interestingly, the analysis revealed
a coinference phenomenon of multiple modalities, which
illustrated the interactions among dierent cues; that is, one
cue could be inferred iteratively by the other cues. An ecient
sequential Monte Carlo tracking algorithm was employed to
integrate multiple visual cues, in which the coinference of
dierent modalities was approximated.
Despite that subspace representation models have been
successfully applied in handling the small appearance vari-
ations and illumination changes, they still usually fail in
handling rapid appearance, shape, and scale changes. To
overcome this problem for the classical IVT tracker, in this
paper, we design a novel multicue based dynamic appearance
model for the IVT tracking system, and this model can
adapt to both the target and the background changes. We
implement this model by fusing multiple cues in an adaptive
observation model. In each frame, the tracking reliability is
utilized to measure the weight of each cue, and an observation
model is constructed with the subspace models and their
corresponding weights. e appearance changes of the target
are taken into account when we update the appearance
models with tracking results. erefore, online appearance
modeling and weight update of each cue can adapt our track-
ing approach to both the target and background changes,
thereby generating good performances.
3. Review of the IVT
e IVT models the target appearance as a low-dimensional
subspace based on the probabilistic principal component
analysis (PPCA) and uses particle-lter dynamics to track the
target.
Letthestateofthetargetobjectattimeis represented as
𝑡=𝑡,𝑡,𝑡,𝑡,𝑡,𝑡, ()
where 𝑡,𝑡,𝑡,𝑡,𝑡,and𝑡denote ,translation, rotation
angle, scale, aspect ratio, and skew direction at time .And
the state dynamic model between time and time −1can
be treated as a Gaussian distribution around the state at −1;
then, we have
𝑡|𝑡−1=𝑡;𝑡−1, ()
where Ψis a diagonal covariance matrix whose elements are
the corresponding variances of state parameters.
Based on ()and(), the particle lter can be carried out
to locate the target. In this stage, rst the particles are drawn
from the particle lter, according to the dynamical model.
en, for each particle, extract the corresponding window
from the current frame and calculate its reconstruction error
on the selected eigenbasis and weight through ()and(),
which is its likelihood under the observation model:
𝑒=𝑡−−𝑇𝑡−, ()
𝑡
𝑖exp
𝑡
𝑒
2
22
𝑤, =1,...,, ()
where 𝑡is an image patch predicated by 𝑡,anditwas
generated from a subspace spanned by andcenteredat.
In (), 2
𝑤is the variance of the reconstruction error and ⋅
denotes the L2-norm.
Finally, the image window corresponding to the most
likely particle is stored as the real target window. When
the desired number of new images has been accumulated,
perform an incremental update (with a forgetting factor) of
the eigenbasis, mean, and eective number of observations.
e key contribution of the IVT lies in that an ecient
incremental method was proposed to learn the eigenbases
online as new observations arrive. is method extends the
Sequential Karhunen-Loeve (SKL) algorithm to present a
new incremental PCA algorithm that correctly updates the
eigenbasis as well as the mean, given one or more additional
training data. A detailed description of this method can be
found in [].
Advances in Multimedia
(a) (b)
0
50
010 20 30 40 50
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
×105
(c)
11
10
9
8
7
6
60
40
20
0
50 40 3020 10 0
×104
(d)
F : Dierent features show dierent discriminative abilities. (a) e target window within a red bounding box in reference image. (b)
Current target window aer object moving. (c) Image error between reference image and candidate image around current target window
with edge cue. (d) Image error between reference image and candidate image around current target window with gray.
4. Multicue Fusion
For a robust tracking system with multicue, dierent cue’s
signicanceshouldbeconsistentwithitstrackingreliability.
And this signicance also should be self-adapting under
dynamic environment when the object undergoing signi-
cant pose, illumination variations, and occlusions as well as
shape deformation, so as to ensure that the most reliable cue
in current time is always adapted to track the target. In this
work, we aim to develop a multicue integrating framework
which is exible enough to exploit any valid image feature,
such as gray, texture, edge direction, and motion information;
and this framework does not restrict the target’s feature type.
However, it is impossible to apply all the types of features
simultaneously, for simple, only two types of features, gray
andedge,areusedintherestofthepaper.
Figure shows that dierence cues may have dierent
discriminative abilities. In Figure (a),theimagewithinred
rectangle is set as the target reference. Aer some time,
the target moves to the current position of Figure (b).
To evaluate the discriminative ability of various features,
we generate target candidate uniformly around the current
target position with same scale as the reference image. en,
the sum pixel error between the candidate and reference
is calculated under two feature spaces: edge and gray. As
showninFigures(c) and (d), it indicates that dierent
cues may yield dierent discriminative abilities. From the
gure we also can nd that here two distances exist: (1)the
Euclidean distance between the position of candidate and the
position of the real target in the image plane, (2)the distance
betweenthereferencemodelandcandidatemodelinfeature
space (reconstruction error). When single cue is used, small
feature distance means that the candidate approximates the
real object more closely. However, this does not work when
multiple visual cues are adopted, since dierent cues may
have dierent sensitive characters to the change of object
appearance and environment.
In this section, we introduce a method to evaluate the
reliabilities of cues based on the above analysis, and this
methodcanbeeectivelyembeddedintheparticlelter
Advances in Multimedia
Initialization
Locate the target manually in the rst frame, and use a single particle to indicate this
location. Set the initial relative sharpness factors as 0=1/for cues. Initialize the
eigenbasis to be empty, and the mean to be the appearance of the target in the rst frame.
for t=1toT
() Spread the target states at time −1to time using the state dynamic model.
() For each new state 𝑖
𝑡corresponding to particle at time , nd its corresponding
weight (𝑖,𝑓)
𝑡in feature space based on its likelihood under the observation models.
() Based on each cue’s relative sharpness factor 𝑓
𝑡−1,=1,...,. Combine multiple cues
by calculating the new weight for each particle as
𝑖
𝑡=𝑁
𝑓=1 (𝑖,𝑓)
𝑡𝑓
𝑡−1.
() Store the image window corresponding to the most likely particle. When the desired
number of new images have been accumulated, perform an incremental update (with a
forgetting factor) of the eigenbasis, mean, and eective number of observations.
() Update the relative sharpness factor for each cue at time as 𝑡based on the
estimated target state and the particle distribution.
end for
A : Multicue based IVT algorithm.
tracking framework. In our approach, we treat each particle
as a target candidate, and the image reconstruction error of
this candidate serves as particle’s weight. ereby, the particle
distribution and its weights can be referred as a D map
(Figures (c) and (d)). For a point (,,) in this map,
the and coordinates give the point projecting position
on the image plane, and coordinate describes the weight
for the particle on this point. Particles with same position
on the image plane may have dierent weighs under various
feature spaces, for that dierent cues may lead to dierent
reconstruction errors. In other words, particles with same
distribution may show dierence maps under dierence cues.
For cues with enough discriminability between the target and
background, obvious height dierences exist among point on
target position and point on other positions in the D map.
ese D maps are then analyzed to obtain the sharpness
factor of the terrain. e sharpness factor is used to evaluate
the signicance of the cue.
We denote the distance created by reconstruction error as
𝑛
𝑚,=1,...,and =1,...,.Here,is the number
of particle; is the cue index which the distance belongs to.
is distance can be gotten by ()and(), where 𝑛
𝑚∝𝑛
𝑚.
en we calculate the Euclidean distance 𝑚,=1,...,
for every particle as:
𝑚=𝑚−02+𝑚−02,()
where (𝑚,𝑚)and (0,0)are the coordinates of particle and
target in the image plane. e sharpness factor for particle
under feature space canbedenedas:
𝑛
𝑚=𝑛
𝑚
𝑛
𝑚.()
So the mean sharpness factor for the entire particle under
feature space is
𝑛=1
𝑀
𝑚=1 𝑛
𝑚.()
Here,
𝑛gives the tracking ability for the th feature space,
because more large value of
𝑛indicates that the current
reconstruction error map is more steep, therefore the target
can be distinguished from other candidates more clearly.
Otherwise, the current reconstruction error map is more
at, thus the target and other candidates may be confused.
To compare the discriminative ability among various feature
spaces, the relative sharpness factor among dierent features
is dened as:
𝑛=
𝑛
𝑁
𝑛=1
𝑛.()
is RSF (relative sharpness factor) gives the signicance for
the th cue.
e general algorithm is given as in Algorithm .
5. Experimental Results and Analysis
We implemented the proposed approach in MATLAB
based on the code of classical IVT from http://www.cs
.toronto.edu/dross/ivt/.eproposedmethodistestedon
several video sequences, which include dicult tracking
conditions such as complex backgrounds, occlusions, and
non-rigid object’s appearance changes. In order to test the
eectivenessoftheproposedadaptiveappearancemodel,
we compare the tracking results of our presented method
with other approaches. For multicue method, the multicue
appearance models with intensity and edge cues are used,
as for the single cue tracker, the single feature model with
intensity cue is applied. e number of particles used for
our method is same to the other trackers,  particles are
adopted for all experiments except for the two long sequences
where it is . In all cases, the initial position of the target is
selected manually.
e rst test sequence is an infrared (IR) image sequence;
it shows a tank moving on the ground from le to right.
Some samples of the tracking results are shown in Figure .
Advances in Multimedia
(a)
(b)
F : Tracking results for seq.. e rst row: results for IVT. e second row: results for our method.
020 40 60 80 100 120 140
8
7
6
5
4
3
2
1
0
Frame number
Position error (pixel)
Our method
IVT
F : Position error for seq..
Here, the rst row gives the results of classical IVT and the
second row shows the results of our proposed method. e
frame indices are , , , and  from le to right. e
target-to-background contrast is very low and the noise level
is high for these IR frames. In Figure ,thetrackingerrors
for both the two methods are given, we can see that our
tracker is capable of tracking the object all the time with
smallerror.eRSFforthetwocuesaredemonstratedin
Figure , it shows that the edge weight is higher than the
intensity weight in general, and this also gives low target-to-
background contrast.
e second test sequence shows a moving person and it
presents challenging appearance changes caused by shadows
of the trees. Figure shows the tracking results using both
methods, where the rst row gives the results of classical
IVTandthesecondrowshowstheresultsofourproposed
method. e person is small in the image and undergoes
sharp appearance changes when he walks into the shadow.
From Figure ,weseethatlargepositionerrorarousedfor
020 40 60 80 100 120
1.2
1
0.8
0.6
0.4
0.2
0
−0.2
Frame number
RSF of individual cue
Intensity cue
Edge cue
F : RSF for each cue throughout seq..
the classical IVT, on comparisons, our method keeps low
error still the person walk out the shadow. In Figure ,the
RSF of edge and intensity cues are given.
e third test sequences is from http://www.cs.toronto
.edu/dross/ivt/, it shows a moving animal toy undergoing
drastic view changes, for that the toy frequently changes its
view as well as its scale. e tracking results are illustrated in
Figure , where rows and correspond to IVT and rows
and correspond to our tracker. In which eight representative
frames(,,,,,,,and)areshown.We
can see that our proposed tracker performs well throughout
the sequence. In contrast to our method, IVT fails when the
target changes its pose drastically.
e fourth test sequence is an infrared (IR) image
sequence from the VIVID benchmark dataset []. In this
sequence, cars run through large shadows caused by the trees
ontheroadside,andthetarget-to-backgroundcontrastislow,
but the noise level is high. Some samples of the nal tracking
results are demonstrated in Figure .Fourrepresentative
Advances in Multimedia
(a)
(b)
F : Tracking results for seq.. e rst row: results for IVT. e second row: results for our method.
020 40 60 80 100 120 180160140
9
8
7
6
5
4
3
2
1
0
Frame number
Position error (pixel)
Our method
IVT
F : Position error for seq..
frames of the video sequence are shown, with indices , , ,
and,itcorrespondstotheFigure,,,andin
the dataset. From Figure ,weseethatourtrackeriscapable
of tracking the object all the time even the car runs out the
shadows. In comparison, IVT tracker fails when the car runs
out the shadows and is unable to recover it.
e h test sequence is obtained from PETS 
benchmark dataset http://www.cvg.reading.ac.uk/datasets/
index.html. It shows a walking passenger in subway station
who undergoes large appearance changes. Some samples of
the tracking results are shown in Figure . e frame indices
are , , , and  from le to right; the indices of them in
thedatasetare,,,andrespectively.Fromthe
results, we can see that the tracking process of IVT cannot
distinguish the actual person of interest from the background
for the large appearance changes. On the other hand, our
framework provide good tracking results, it can overcome
the eect of appearance changes and tracking the target
successfully.
020 40 60 80 120100 160 180140
1.2
1
0.8
0.6
0.4
0.2
0
−0.2
Frame number
RSF of individual cue
Intensity cue
Edge cue
F : RSF for each cue throughout seq..
Figure  gives some representative tracking results for
three sequences which have been tested in []. e rst
row shows results for sequence “trellis. e indices of
them in the dataset are , , , and  from le to
right. e second row provides representative results for the
sequence “car, and the frame indices are , , , and .
Tracking results for the sequence davidin” are depicted
in the last row, frame , , , and  of the dataset are
given.Ascanbeseeninthegure,ourmethodperforms
well under challenging conditions such as variations of
views, scale, and illumination changes. To straightforwardly
make comparisons among other tracker, we quantitatively
evaluate our tracking algorithm on the sequence duedk” and
sequence “car” which can be found in []. In Table ,center
location RMS errors of three tracker: the proposed method,
IVT tracker, and a multicue tracker which described in []
(here, we call it CSR) are provided.
Finally, we investigate the runtime of the IVT algorithm
andtheproposedmethod.AscanbeseenfromTable ,the
IVT tracker can track target with perfect real-time processing
Advances in Multimedia
(a)
(b)
F : Tracking results for seq.. e rst two rows: results for IVT. e last two rows: results for our method.
(a)
(b)
F : Tracking results for seq.. e rst row: results for IVT. e second row: results for our method.
Advances in Multimedia
(a)
(b)
F : Tracking results for seq.. e rst row: results for IVT. e second row: results for our method.
(a)
(b)
(c)
F : Tracking results for some sequences which had been tested in []. e rst row: results for the sequence “trellis. e second
row: results for the sequence car.” e last row: results for the sequence davidin.”
T : Center location RMS errors (in pixel) and running speed (in frame per second) for three trackers.
Video sequence IVT CSR Our method
RMS error Running Speed RMS error Running Speed RMS error Running Speed
dudek . 21.32 . . 15.32 .
Car . 28.04 . . 2.51 .
 Advances in Multimedia
speed. In contrast, our method and CSR are slower than the
IVT. With the same number of particles, the IVT algorithm
using single intensity cue run at an average speed of . fps,
in comparison, our method using both intensity and edge
cues runs at an average speed of . fps. is means a loss in
the runtime performance as increasing the number of cues.
As illustrated in above experiment results, we can see that
the presented approach outperforms the IVT and the CSR
algorithm in terms of tracking accuracy. is mainly stems
from our adaptive cue integration scheme, for that, at each
frame, the target is determined by using particles under all
the cues, but additionally considering their discriminative
reliabilities, rather than by just using particles under single
cue which itself may provide poor or inaccurate measure-
ments. e advantage of our formulation is its adaptive
nature which lets us easily combine dierent target views, but
generally with a loss of computational eciency. It would be
interesting to focus on developing more ecient solutions to
this problem in future work.
6. Conclusion
In this work, we presented a novel tracker to enhance the
IVT algorithm by employing a multicue based adaptive
appearance model. First, we carry out the integration of cues
bothinfeaturespaceandimagegeometricspace.Second,
considering both the target and background changes, the
integration of cues directly depend on the dynamically-
changing reliabilities of visual cues. ese two aspects of
our method allow the tracker to easily adapt itself to the
changes in the context and accordingly improve the tracking
accuracy by resolving the ambiguities. In this way, our adap-
tive appearance model can ensure when one cue becomes
not discriminative enough due to target or background
changes, the other will compensate. In the last, the proposed
multicue framework eective utilizes the merits of particle
lter, so as to make robust tracking on less computing cost.
Experimental results demonstrate that subspace tracking is
strongly improved by exploiting the multiple cues through
the proposed algorithm.
Conflict of Interests
e authors declare that there is no conict of interests
regarding the publication of this paper.
Acknowledgments
is work is jointly supported by the National Natural
Science Foundation of China (nos. , , and
) and the Natural Science Foundation of Anhui
Province (MF).
References
[] R. Wang and J. Popovic, “Real-time hand-tracking with a color
glove, ACM Transactions on Graphics,vol.,no.,article,
.
[] V. omas and A. K. Ray, “Fuzzy particle lter for video surveil-
lance, IEEE Transactions on Fuzzy Systems,vol.,no.,pp.
–, .
[] Z. Li, S. Qin, and L. Itti, “Visual attention guided bit allocation
in video compression, Image and Vision Computing,vol.,no.
, pp. –, .
[] M.IsardandA.Blake,“CONDENSATION:conditionaldensity
propagation for visual tracking , International Journal of Com-
puter Vision,vol.,no.,pp.,.
[] D. Comaniciu and P. Meer, “Mean shi: a robust approach
toward feature space analysis, IEEE Transactions on Pattern
Analysis and Machine Intelligence,vol.,no.,pp.,
.
[] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object
tracking, IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. , no. , pp. –, .
[] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang, “Incremental
learning for robust visual tracking, International Journal of
Computer Vision,vol.,no.,pp.,.
[] H. Wang and D. Suter, “Ecient visual tracking by probabilistic
fusion of multiple cues, in Proceedings of the 18th International
Conference on Pattern Recognition, pp. –, August .
[]P.LiandC.Francois,“Imagecuesfusionforobjecttracking
based on particle lter,” in Articulated Motion and Deformable
Objects,vol.ofLecture Notes in Computer Science,pp.
, .
[] S. Bircheld, “Elliptical head tracking using intensity gradients
and color histograms, in Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition,
pp. –, June .
[] W. Du and J. Piater, A probabilistic approach to integrating
multiple cues in visual tracking, in Proceedings of the 10th
European Conference on Computer Vision,vol.,pp.,
.
[] G. Jia, Y. Tian, Y. Wang, T. Huang, and M. Wang, “Dynamic
multi-cue tracking with detection responses association,” in
Proceedingsofthe18thACMInternationalConferenceonMul-
timedia (MM ’10), pp. –, October .
[] M.Yang,F.Lv,W.Xu,andY.Gong,“Detectiondrivenadaptive
multi-cue integration for multiple human tracking, in Proceed-
ings of the International Conference Computer Vision,pp.
, .
[] P.Brasnett,L.Mihaylova,D.Bull,andN.Canagarajah,“Sequen-
tial Monte Carlo tracking by fusing multiple cues in video
sequences, Image and Vision Computing,vol.,no.,pp.
, .
[] E. Erdem, S. Dubuisson, and I. Bloch, “Visual tracking by fus-
ing multiple cues with context-sensitive reliabilities, Pattern
Recognition,vol.,no.,pp.,.
[] M. Yin, J. Zhang, H. Sun, and W. Gu, “Multi-cue-based
CamShi guided particle lter tracking, Expert Systems with
Applications, vol. , no. , pp. –, .
[] J. Triesch and C. von der Malsburg, “Democratic integration:
self-organized integration of adaptive cues, Neural Computa-
tion,vol.,no.,pp.,.
[] M. Spengler and B. Schiele, “Towards robust multi-cue integra-
tion for visual tracking, Machine Vision and Applications,vol.
,no.,pp.,.
[] C.Shen,A.vandenHengel,andA.Dick,“Probabilisticmultiple
cueintegrationforparticlelterbasedtracking,”inProceedings
of the 7th Digital Image Computing: Techniques and Applications,
pp.,.
Advances in Multimedia 
[] P. P´
erez, J. Vermaak, and A. Blake, “Data fusion for visual track-
ing with particles, Proceedings of the IEEE,vol.,no.,pp.
–, .
[] Y. Wu and T. S. Huang, “Robust visual tracking by integrating
multiple cues based on co-inference learning, International
Journal of Computer Vision,vol.,no.,pp.,.
[] VIVID database, http://vision.cse.psu.edu/data/vividEval/data-
sets/datasets.html.
Conference Paper
Full-text available
It has been shown that integrating multiple cues will increase the reliability and robustness of a vision system in situations that no single cue is reliable. In this paper, we propose a method by fusing multiple cues (i.e., the color cue and the edge cue). In contrast to previous work, we propose a novel shape similarity measure which includes the spatial distribution of the number of and the gradient intensity of the edge points. We integrate this shape similarity measure with our recently proposed SMOG-based color similarity measure in the framework of particle filter (PF). Experimental results demonstrate the high robustness and effectiveness of our method in handling appearance changes, cluttered background, moving camera, and occlusions
Conference Paper
Full-text available
Multi-cue integration has proved successful at increasing the robustness of tracking algorithms and overcoming the failure cases of individual cue. But considering dynamic appearance of objects or clutter background, the integration based on constant weights may weaken the performance of this scheme. In this paper, we propose a dynamic weights update mechanism for multiple cues tracking with detection responses as supervision. We integrate multiple cues based on the observation hypotheses compared with detection association results and adjust the weights according to the approximation degree. The integration is adapted on-the-fly during tracking, in order to keep the tracker adaptive. The proposed method allows flexible combination of different cues and we select cues based on color and local feature for tracking. Experiments are carried out on 602 trajectories extracted from TRECVID 2008 event detection dataset which is recorded in an airport scenario. Comparison results prove the effectiveness of our method.
Article
Visual tracking can be treated as a parameter estimation problem that infers target states based on image observations from video sequences. A richer target representation may incur better chances of successful tracking in cluttered and dynamic environments, and thus enhance the robustness. Richer representations can be constructed by either specifying a detailed model of a single cue or combining a set of rough models of multiple cues. Both approaches increase the dimensionality of the state space, which results in a dramatic increase of computation. To investigate the integration of rough models from multiple cues and to explore computationally efficient algorithms, this paper formulates the problem of multiple cue integration and tracking in a probabilistic framework based on a factorized graphical model. Structured variational analysis of such a graphical model factorizes different modalities and suggests a co-inference process among these modalities. Based on the importance sampling technique, a sequential Monte Carlo algorithm is proposed to provide an efficient simulation and approximation of the co-inferencing of multiple cues. This algorithm runs in real-time at around 30 Hz. Our extensive experiments show that the proposed algorithm performs robustly in a large variety of tracking scenarios. The approach presented in this paper has the potential to solve other problems including sensor fusion problems.
Article
Video object tracking is the process of locating one or several moving objects in time by the use of optical cameras. In this paper, an algorithm for object tracking by the use of particle filtering is presented. The algorithm employs fuzzy techniques for feature estimation. The algorithm handles color video image sequences from a stationary camera under changing illumination conditions. The proposed algorithm successfully tracks multiple objects by the use of an adaptive Gaussian mixture model for background modeling and a sequential Monte-Carlo-based tracking algorithm. Various fuzzy distance measures have been applied and compared for the estimation of the object location.
Article
The problem of tracking curves in dense visual clutter is challenging. Kalman filtering is inadequate because it is based on Gaussian densities which, being unimo dal, cannot represent simultaneous alternative hypotheses. The Condensation algorithm uses factored sampling, previously applied to the interpretation of static images, in which the probability distribution of possible interpretations is represented by a randomly generated set. Condensation uses learned dynamical models, together with visual observations, to propagate the random set over time. The result is highly robust tracking of agile motion. Notwithstanding the use of stochastic methods, the algorithm runs in near real-time.
Article
Abstract. Even though many of today's vision algorithms are very successful, they lack robustness, since they are typically tailored to a particular situation. In this paper, we argue that the principles of sensor and model integration can increase the robustness of today's computer-vision systems substantially. As an example, multi-cue tracking of faces is discussed. The approach is based on the principles of self-organization of the integration mechanism and self-adaptation of the cue models during tracking. Experiments show that the robustness of simple models is leveraged significantly by sensor and model integration.
Article
A visual attention-based bit allocation strategy for video compression is proposed. Saliency-based attention prediction is used to detect interesting regions in video. From the top salient locations from the computed saliency map, a guidance map is generated to guide the bit allocation strategy through a new constrained global optimization approach, which can be solved in a closed form and independently of video frame content. Fifty video sequences (300 frames each) and eye-tracking data from 14 subjects were collected to evaluate both the accuracy of the attention prediction model and the subjective quality of the encoded video. Results show that the area under the curve of the guidance map is 0.773 ± 0.002, significantly above chance (0.500). Using a new eye-tracking-weighted PSNR (EWPSNR) measure of subjective quality, more than 90% of the encoded video clips with the proposed method achieve better subjective quality compared to standard encoding with matched bit rate. The improvement in EWPSNR is up to over 2 dB and on average 0.79 dB.
Article
This paper presents visual cues for object tracking in video sequences using particle filtering. A consistent histogram-based framework is developed for the analysis of colour, edge and texture cues. The visual models for the cues are learnt from the first frame and the tracking can be carried out using one or more of the cues. A method for online estimation of the noise parameters of the visual models is presented along with a method for adaptively weighting the cues when multiple models are used. A particle filter (PF) is designed for object tracking based on multiple cues with adaptive parameters. Its performance is investigated and evaluated with synthetic and natural sequences and compared with the mean-shift tracker. We show that tracking with multiple weighted cues provides more reliable performance than single cue tracking.