ArticlePDF Available

Visual Tracking via Exemplar Regression Model

Authors:

Abstract and Figures

Visual tracking remains a challenging problem in computer vision due to the intricate variation of target appearances. Some progress made in recent years has revealed that correlation filters, which formulate the tracking process by creating a regressor in the frequency domain, have achieved remarkable experimental results on a large amount of video tracking sequences. On the contrary, building the regressor in the spatial domain directly has been considered as a limited approach since the number of training samples is restricted. And without sufficient training samples, the regressor will have less discriminability. In this paper, we demonstrate that, by giving a very simple positive-negative prior knowledge for the training samples, the performance of the ridge regression model can be improved by a large margin, even better than its frequency domain competitors-the correlation filters, on most challenging sequences. In particular, we build a regressor (or a score function) by learning a linear combination of some selected training samples. The selected samples consist of a large number of negative samples, but a few positive ones. We constrain the combination such that only the coefficients of positive samples are positive, while all coefficients of negative samples are negative. The coefficients are learnt under such a regression setting that makes the outputs fit overlap ratios of the bounding box, where the overlap ratios are measured by calculating the overlaps between the inputs and the estimated position in the last frame. We call this regression exemplar regression because of the novel positive-negative arrangement of the linear combination. In addition, we adopt a non-negative least square approach to solve this regression model. We evaluate our approach on both the standard CVPR2013 benchmark and the 50 selected challenging sequences, which include dozens of state-of-the-art trackers and more than 70 datasets in total. In both of the two experiments, our algorithm achieves a promising performance, which outperforms the state-of-the-art approaches.
Content may be subject to copyright.
Visual Tracking via Exemplar Regression Model
Xiao Maa, Qiao Liub, Xiaohuan Lua, Zhenyu Hea,
aSchool of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, China
bSchool of Mathematic and Computer Science , Guizhou Normal University, China
Abstract
Visual tracking remains a challenging problem in computer vision due to the intricate variation
of target appearances. Some progresses made in recent years have revealed that correlation fil-
ters, which formulate the tracking process by creating a regressor in the frequency domain, have
achieved remarkable experimental results on a large amount of video tracking sequences. On the
contrary, building the regressor in the spatial domain directly has been considered as a limited
approach since the number of training samples is restricted. And without sufficient training sam-
ples, the regressor will have less discriminability. In this paper, we demonstrate that, by giving
a very simple positive-negative prior knowledge for the training samples, the performance of the
ridge regression model can be improved by a large margin, even better than its frequency domain
competitors-the correlation filters, on most challenging sequences. In particular, we build a re-
gressor (or a score function) by learning a linear combination of some selected training samples.
The selected samples consist of a large number of negative samples and a few positive ones. We
constrain the combination such that only the coefficients of positive samples are positive, while
all coefficients of negative samples are negative. The coefficients are learnt under such a regres-
sion setting that makes the outputs fit overlap ratios of the bounding box, where the overlap ratios
are measured by calculating the overlaps between the inputs and the estimated position in the last
frame. We call this regression exemplar regression because of the novel positive-negative arrange-
ment of the linear combination. In addition, we provide a off-the-shelf method, a non-negative
least square approach, to solve this regression model more efficiently. We evaluate our approach
on both the standard CVPR2013 benchmark and the 50 selected challenging sequences, which
include dozens of state-of-the-art trackers and more than 70 datasets in total. In both of the two
experiments, our algorithm achieves a promising performance, which outperforms the state-of-
1
the-art approaches.
Keywords: visual tracking, kernelized ridge regression, exemplar regression model.
1. Introduction
Given an initial bounding box of a certain target in the first frame, a visual tracker estimates this
target’s state, e.g. location and scale, in each frame of the image sequences. Visual tracking is a key
component in numerous applications, such as vision-based control, visual surveillance, human-
computer interfaces, intelligent transportation, and augmented reality. Although some significant
progress has been made in several decades of visual tracking research, most trackers are still prone
to failure in challenging scenarios such as partial occlusion, deformation, motion blur, fast motion,
illumination variation, background clutter and scale variations.
Among the achievements in visual tracking research, the discriminative approaches [1,2,3,4,
5,6,7,8,9] provide an online mechanism that adapts appearance variations of the target, and gain
better results than their generative rivals [10,11,12,13] on some hard sequences [14]. Traditional
discriminative approaches [5,6,7] maintain a classifier trained online to distinguish the target
object from its surrounding background. This process can be divided into two stages: searching
and updating. During the searching stage, the classifier is utilized to estimate the target’s location
in a certain search region around the estimated position from the previous frame, typically using
a sliding-window [5,7] or particle filtering approach [10]. In the updating stage, the traditional
discriminative approaches generate a set of binary labelled training samples which are used to
update the classifier online.
Although the traditional discriminative approaches gain convincing results in some hard se-
quences, it is difficult to arrange the binary labels for training samples in the updating stage.
Because it is difficult to determine a pre-defined threshold and rule (e.g. Euclidean distance of one
sample from the estimated target location from last frame ) to decide whether a sample should be
positive or negative.
I am corresponding author
Email address: zyhe@hitsz.edu.cn (Zhenyu He)
Preprint submitted to Knowledge-based System December 30, 2016
To avoid the confusion of label arrangement, some discriminative approaches [9,8,1,2,4,3]
use the regression model instead of the traditional binary classification model during the updating
stage. The regression models output a real-value score for each training sample to fit some pre-
defined distributions, such as the bounding box overlap ratios [9,8] or a Euclidean distance [1,2,
4,3]. However, the small sample size training problem makes the traditional regression models,
such as the ridge regression, hard to create a robust regressor.
Some progresses [1,2,4,3] made in recent years have revealed that solving the ridge regression
in the frequency domain can achieve a dense sampling since it avoids the small sample size training
problem. These approaches are called correlation filters. In this paper, different from correlation
filters, we propose a simple approach to handle the small sample size training problem in the spatial
domain directly. Our approach gives a positive-negative prior knowledge for the training samples.
We demonstrate that, by adding such prior knowledge, the performance of the conventional spatial
domain ridge regression can be improved by a large margin, even better than its frequency domain
competitors-the correlation filters, on most challenging sequences.
our ridge regression model learns its score by calculating a linear combination of weighted
similarities of some selected samples , as called support samples in this paper. Among those
weights of similarities, we constrain that only the support samples in the estimated positions in
the past frames are positive and the rest of the weights are all negative. This constraint gives us a
large number of negative weights in the linear combination, but a few positive ones. We call the
modified ridge regression exemplar regression. We show that the support samples and the weights
can be solved by a non-negative least square method.
The main contribution of this paper is summarized as follows:
We provide a simple positive-negative constraint method for a common kernelized ridge
regression model to construct a robust visual tracker - we call exemplar regression track-
ing (ERT). The experiments show that the proposed ERT approach gains the state-of-the-art
results under the standard CVPR2013 benchmark 1[14] and other challenging sequences.
1This benchmark is first published in IEEE conference on Computer Vision and Pattern Recognition (CVPR) in
2013:https://sites.google.com/site/trackerbenchmark/benchmarks/v10
3
We provide an easy-to-implement approach to solve the ERT based on the off-the-shelf non-
negative least square method.
The rest of the paper is organized as follows: Section 3describes the proposed ERT approach.
The implementation details of the proposed approach are introduced in Section 4. In Section 5
we discuss the method. In Section 6we performan extensive experimental comparison with the
state-of-the-art visual trackers and we draw a conclusion in section 7.
2. Related Works
We refer to [15,16,17] for the detailed visual tracking surveys. In this section, we briefly re-
view the most related online single object tracking, especially for the regression based approaches.
The visual tracking approaches can be generally categorized as either generative [10,11,18,13,
19,20,21,22,23,24], or discriminative [5,6,7,8,9,1,2,3,4,25,26] based on their appearance
models.
Generative approaches build an appearance model then use this model to find the optimal can-
didate samples with a certain region in the image frame which has the minimum construction error.
Black et al. [20] learn an off-line subspace model to represent the object of interest for tracking.
In [22], every image sample is fragmented into several patches, each of them is represented by
an intensity histogram and compared to the corresponding patch in the target region by the Earth
Movers Distance. Ross et al. [10] introduce the incremental PCA (Principal Component Analysis)
to capture the full range of appearances of the target in the past frames. Mei et al. [11,27] employ
the sparse representation to build a dictionary which contains the appearances from past frames,
then select the optimal appearance from this dictionary to estimate the target’s positions. Li et
al. [18] further extend the l1tracking [11] by using the orthogonal matching pursuit algorithm to
solve the optimization problems efficiently. In [19], an accelerated version of l1tracking[11] is
proposed based on the accelerated proximal gradient (APG) approach. Jia et. al. [13] and Wang
et. al [12] introduce the patch-based sparse representation to enhance the tracker’s robustness. In
[21], Kwon et al. combine multiple observation and motion models to handle large appearance
and motion variation.
4
Table 1: Comparisons of the proposed exemplar regression with the popular regression based tracking approach
Struck[9] and correlation filters (CFs)[1,4,3]
Regression models Searching strategies Regressors Updating algorithms
Struck [9] Sliding window f(x) = Piωik(xi,x)LaRank [28]
Exemplar regression Particle filtering f(x) = 1
kxkPiωi<xi,x>Non-negative least square
Correlation filters [4,3] Convolution f(X) = F1(F(X)(F(Xo))
F(Xo)(F(Xo))+λ)Linear ridge regression
The traditional discriminative approaches pose the tracking problem as a binary classification
task with local search and determine the decision boundary for separating the target object from
the background. Collins et al. [26] provide an approach which learns the discriminative features
online to separate the target object from the background. In [25], an online boosting algorithm
is proposed to select features for tracking. Babenko et al. [7] introduce the multiple instance
learning approach to decide the labels of the training samples, and integrate an online boosting
approach to select the Haar-like features for tracking. A semi-supervised learning approach [6] is
proposed in which positive and negative samples are selected via an online classifier with structural
constraints. Zhang et al. [5] employ a random projected compressed features space with a very
high dimension to represent the target’s appearances, and a binary classifier is used to find the
optimal image sample in this compressed domain.
One of the typical regression based trackers is Struck[9], which builds a regression model
upon the Structural Support Vector Machine ( SSVM ) framework, and embeds the novel LaRank
[28] algorithm into the updating stages. In [9], a regressor is formulated by the weighted linear
combination of a set of support vectors. These support vectors might come from different frames
during the tracking process. Therefore, Struck has a good ability to adapt of appearance variances.
[9] uses the Haar-like features and sliding-window searching strategy, which makes it difficult to
handle scale variances of the target. In [8], the authors integrate the Lie group theory to promote
Struck’s scale-aware ability.
Another popular regression model is Correlation Filters ( CFs ) [1,2,4,3]. In a few years, CFs
has proved itself to be competitive with far more complicated approaches. The use of Fast Fourier
5
Transform ( FFT ) makes it run at a very high speed online( about hundreds of frames-per-second
). CFs takes advantage of the famous theorem that the convolution of two image samples in a
spatial domain is equivalent to an element-wise product in the Fourier domain. By formulating the
CF’s objective function in the Fourier domain, it can achieve dense sampling during the updating
and searching stage without being very much time consuming.
Comparison of paradigms between the exemplar regression model, Struck and CFs is shown
in Table 1. The original Struck [9] applies a sliding-window to find the optimal sample. The
regressor in Struck is a linear combination of weighted kernel functions between support samples
and the test sample, which is very similar to ours. Struck introduces the LaRank [28] to construct
its regressor. CFs use the convolution between the constructed regressor ( or filter ) and the search
region’s image patch to find the optimal position. However, a single correlation filter can not
handle the scale variation of the target. To promote the CFs scale-aware ability, DSST [3] creates
another filter using the target’s features at different scales to estimate the target’s scale. CFs utilize
the ridge regression in the Fourier domain to construct the filter.
The formulation of our regressor is similar to Struck [9], since they both consider the support
samples from past frames and arrange the positive weights only for samples within the estimated
positions. However, Struck introduces a complicated LaRank [28] to choose which samples should
be added to the regressor’s linear combination and how large are the weights that should be ar-
ranged. Without using the LaRank, we suggest using a more effective and easy-to-implement
approach to do those selections and arrangements: non-negative least square.
We borrow the concept of exemplar from Exemplar-SVM [29]. For the detection problem,
Exemplar-SVM trains a separate linear SVM classifier for every exemplar in the training set. Each
Exemplar-SVM is defined by a single positive instance and millions of negatives.
3. Exemplar Regression Tracking
As a discriminative approach, our exemplar regression tracker has searching and updating stage
during the tracking process. In the searching stage (Section 3.1), we analyze the probability of the
observation model in the Bayesian tracking framework. The observation model is provided by the
updating stage, which is in the form of the linear combination of weighted similarities between
6
chosen support samples and the test sample. The formulation of this exemplar regression model
and how to choose the support samples and the weights of similarities are introduced in Section
3.2 and Section 3.3.
3.1. The searching stage
Visual tracking can be cast as a sequential Bayesian inference problem [10]. Given a set of
observed image patches Itup to the t-th frame, we aim to estimate the value of the state variable
ZtR4, which describes the bounding box of the observed image patch. The true posterior
state distribution P r(Zt|It)is approximated by a set of Nssamples, called tracking candidates,
Zt={Z1
t, Z2
t,· · · , ZNs
t}, and the optimal state is estimated by the MAP(Maximum A Posteriori)
formulation:
ˆ
Zt= argmax
Zi
t
P r(Zi
t|It),(1)
where Zi
tdenotes the state variable of the i-th candidate sample at the t-th frame. Using
Bayesian theory, the posterior probability P r(Zi
t|It)is inferred by:
P r(Zi
t|It)P r(Ii
t|Zi
t)ZP r(Zi
t|Zi
t1)P r(Zi
t1|It1)dZi
t1,(2)
where P r(Zi
t|Zi
t1)is the dynamic model. Ii
tis the t-th frame’s i-th observed image patch. We
use the same dynamic model which is described in [30,10]. The observation model P r(Ii
t|Zi
t)is
determined by our proposed exemplar regression model.
Given an observed image patch Ii
t, we use xi
tRdto represent the features extracted in Ii
t,
where dis the dimension of this feature vector. In the rest of this paper, xi
tis the sample in state
Zi
t. We introduce a function of xi
tto approximate P r(Ii
t|Zi
t):
P r(Ii
t|Zi
t)f(xi
t).(3)
The function f(·)helps us find the one that is most likely to be a target from the observed
image patches It, and gives this optimal one the highest output. Therefore, all we need to do in the
searching stage is two things: 1) apply the particle filtering strategy to generate Nsstate variables
Ztand their relevant samples; 2) utilize f(·)to find the optimal state variable.
7
3.2. The exemplar regression model
For a common kernelized ridge regression model, we want to find a linear function (the regres-
sor ) in a certain projected Hilbert space S:
f(x) =<W,Φ(x)>k,(4)
where Wis the unknown coefficient, Φ(·) : RdSmaps the ddimension feature vector to that
Hilbert space S.<, >kdenotes the inner-product in S. We use the following objective function
to solve Win the ridge regression setting:
X
i
k<W,Φ(xi)>kyik2
2+λ
2<W,W>k,(5)
where {yi}, yiRand {xi}are the training set, and λis the regularization parameter.
From the representer theorem [31], the regressor fcan be constructed by a linear combination
of the weighted kernel function between support samples and the test sample [32]:
f(x) =
N
X
i=1
ωiKer(xi,x)(6)
where xidenotes the support samples. We call X={x1,x2,· · · ,xN}the support sample set.ωi
represents the weight of the i-th kernel function value. In this paper, we choose the normalized
inner-product in the Rdto represent Ker(,):
f(x) =
N
X
i=1
ωi
<xi,x>
kxik2kxk2
,(7)
Suppose all the support samples are normalized, which means kxik2= 1, then we have:
f(x) = 1
kxk2
N
X
i=1
ωi<xi,x>, (8)
During the t-th frame’s updating stage, we draw Nusamples around the estimated state variable
ˆ
Zt:Xt= [x1
t,x2
t,· · · ,xNu
t]Rd×Nu, and we call Xtthe training sample set. We define a score
value for each samples in Xtusing the following function:
s(xi
t) = overlap(ˆ
Zt, Zi
t),(9)
8
where the overlap(·,·)function calculates the overlap between two bounding boxes of state
variables. Zi
tis the state variable of xi
t.s(·)gives the higher output which has the larger overlap
with the estimated state ˆ
Z1
t, and lower output with the smaller overlaps. Using s(·), we can gener-
ate the outputs for Xtin a regression model: Yt= [y1
t, y2
t,· · · , yNu
t]T= [s(x1
t), s(x2
t),· · · , s(xNu
t)]T
RNu.
We define a loss function for the regressor f(·):
E(f(Xt), Yt) =
Nu
X
i=1
(yi
tf(xi
t))2.(10)
Putting Eq.8into Eq.10 and adding the exemplar constraint, we have the exemplar regression
model’s objective function:
E(f(Xt), Yt) =
Nu
X
i=1
(yi
t1
kxi
tk2
N
X
j=1
ωj<xj,xi
t>)2
s.t.
δ(xj)ωj>0,
(11)
where
δ(x) =
1xis positive support sample
1otherwise
(12)
is the indicator which implements the exemplar constraint. Note that, for each frame, only the
sample with the estimated state variable is positive, and the rest of the samples are negative.
3.3. Solving the exemplar regression using non-negative least square
In the updating stage, we want to add or remove support samples from the candidate sample
set , which consists of the old support sample set Xand the normalized training sample set ¯
Xt:
Xt={X,Xt}={x1,x2,· · · ,xN,¯x1
t,¯x2
t,· · · ,¯xNu
t}, where the first Nsamples are old normal-
ized support samples of the un-updated regressor. The last Nusamples are from Xtand are all also
9
to be normalized. The arrangement above reveals that the updating process reselects the old sup-
port samples and adds the new support samples from the training samples, and the corresponding
weights will also be re-allocated. This formulation helps the regression model to adapt the target
appearance variations smoothly. For every candidate sample xi(i= 1,2,· · · , N +Nu)X,
we define a weight ωiand a label to indicate if it is positive or not. Note that the labels of the first
Nsamples are determined by the un-updated regressor, and among the last Nusamples, only the
¯x1
tis positive, while the rest are all negative.
For now, the new regressor is represented by:
f(x) = 1
kxk2
N+Nu
X
i=1
δ(xi)|ωi|<xi,x> . (13)
Note that we want to use the training sample set Xtto learn a exemplar regression model. This
means we want to find a set of support samples in candidate sample set Xtand their corresponding
weights to minimize Eq.11.
For every sample xi
tin training sample set Xt, we can obtain a loss between the regressor
output and the required one yi
t:
E(f(xi
t), yi
t) = (yi
t1
kxi
tk2
N+Nu
X
j=1
δ(xj)|ωj|<xj,xi
t>)2,(14)
where the xi
tcan be represented by the normalized ¯xi
t, which is xN+iin the candidate sample
set Xt. Then Eq.14 can be simplified by:
E(f(xi
t), yi
t)=(yi
t
N+Nu
X
j=1
δ(xj)|ωj|<xj,xN+i>)2.(15)
The overall loss of the exemplar regression model on the training set Xtcan be calculated by:
E(f(Xt), Yt) =
Nu
X
i=1
(yi
t
N+Nu
X
j=1
δ(xj)|ωj|<xj,xN+i>)2.(16)
Let:
Kj,N+i=δ(xj)<xj,xN+i>, (17)
10
#347
updating
Poisitve samples Negative samples
#327
#346
Search
Region
=(X)<X, X>
mininize
searching
...
Figure 1: Overview of the proposed exemplar regression tracking (ERT) approach: the formulation of the regressor
is a linear combination of weighted inner-products of the support samples with the test sample. This combination is
formulated by the following two steps: 1) Construct an inner-product matrix Kusing the samples from the current
frame and past frames, and the signs of those inner-products are constrained by the pre-defined positive and negative
settings (δ(X)); 2) Solve a non-negative least square problem to find the weights .
then we can construct an inner-product matrix K:
K=
K1,N+1, K2,N +1 ,· · · , KN+Nu,N +1
.
.
.,.
.
.,.
.
.
K1,N+2, K2,N +2 ,· · · , KN+Nu,N +2
K1,N+Nu, K2,N+Nu,· · · , KN+Nu,N +Nu
.(18)
We use a more compact way to formulate Eq.16:
E(f(Xt), Yt) = kYtKΩk2
2,(19)
where:
= [ |ω1|,|ω2|,· · · ,|ωN+Nu|]T(20)
Since 0, the solution of Eq.19 can be obtained by the standard non-negative least
square (NNLS) method [33]. An overview of the proposed method is shown in Fig. 1.
11
3.4. The updating stage
In the updating stage of the tracking process, we draw Nutraining samples to update the old
regressor using the approach provided above. First, we use the training samples and old support
samples to construct the inner-product matrix by Eq.18. Then a standard NNLS approach is ap-
plied to solve Eq.19. This gives us a vector of weights (Eq. 20). We retain the samples with the
large weights, and discard the samples with small weights. We take those retained samples to be
the new support samples of the regressor.
4. Implementation details
During the tracking process, the image patches are all resized to 32×32. In the searching stage,
the number of observed image patches Nis set to 600, of which 500 samples are used to approx-
imate translations, and 100 samples are arranged to estimate the scale and stretch transforms. This
means that for the accuracy of scale estimation, after the translation of the target is estimated using
the 500 samples, we draw another 100 samples around this estimated position to obtain the scale
transforms. We use the PCA-HOG(Histogram of Oriented Gradient) features [34] to describe the
observed image patches, for which the cell-size is set to 4and the number of orientation bins is set
to 9, and this process gives us a 2048 dimension feature vector for each image patch.
In the updating stage, we obtain the training sample set Xton the polar grid, and like Struck
[9], we set the radial to 5and angular divisions to 16, giving us 81 training samples. The candidate
sample set Xtis constructed like Fig.2, which consists of the 81 training samples and the Nold
support samples from the un-updated regressor. These samples are divided into different parts
within Xtaccording to whether the samples are positive or not. This treatment helps us arrange
the labels for the inner-product matrix K. We use the Matlab’s built-in function lsqnonneg() to
solve Eq.19. A summary of this ERT method is shown in Algorithm 1.
4.1. Lazy update strategies
The exemplar model of the visual tracker can be updated frame-by-frame over the video se-
quences to adapt to the targets’ changing appearance. However, this continuous update strategy
makes the support sample set of the model become huge and the tracking process will become
12
Algorithm 1 The exemplar regression tracking
1: Inputs:Z0: the initial state variables; Nu: the number of samples during the updating stage;
Ns: the number of samples during the searching stage; : the threshold for selecting the
support samples.
2: In every frame, we manage the regressor Eq. 13 involving Nsupport samples;
3: Outputs: the estimated state variables ˆ
Ztin every frames.
4: while The video sequence is not ended do
5: Draw Nssamples using the particle filtering, and use Eq.13 to find the optimal state variable;
6: Construct the inner-product matrix Kaccording to the description Section 3.3;
7: Find the weights of samples in Xtby solving Eq.19;
8: Select the support samples by removing the sample whose weight is less than ;
9: Update the regressor using the new support samples {xi}and the corresponding weights
{ωi}.
10: end while
very slow. Moreover, for most video sequences, the appearance of targets does not undergo large
variations in the continuous small number of frames. Therefore, we provide three lazy update
strategies to decide when to update the model dynamically:
4.1.1. Checking the positive samples
This strategy will check the similarity of the tracking result of the current frame with the
positive samples of the support sample set. If there is at least one positive sample making the
similarity less than a threshold, then the update process should be executed. This strategy helps
the tracker update the model when the appearance does not change greatly.
4.1.2. Checking the contexts
The contexts are the image patches within the targets’ bounding box with a certain padding
ratio. This includes the targets and the surroundings. This strategy calculates the similarity of
the contexts in the two continuous frames, and avoids repetitive updating when the targets are
motionless.
13
Figure 2: Construction of the candidate sample set Xt.
4.1.3. Checking the variance of test samples
This strategy obtains the variance of the regressor’s output of the test samples, and compares
it with a given threshold to decide when to update the model.
5. Discussion
In this section, we present some discussions of our exemplar regression model. We explain the
model from the templates construction perspective. We believe this explanation will be beneficial
to the formulations provided above.
5.1. The templates construction perspective
A visual tracking problem can be formulated as a templates matching problem. The templates
are the image samples which represent the targets’ appearances. The templates are used to cal-
culate the similarities between the test image samples in the new frame, and to choose the test
sample with the greatest similarity to the tracking result. The templates can consist of a single
image or multiple images. Our approach, Struck [9] and correlation filters [4,1,3] are typical
single template models. Sparse coding based approaches [11,13] preserve a set of templates and
choose the most representative template for the matching process.
14
In every frame, our exemplar regression model generate a set of image samples around the
estimated position. This process gives us a large number of samples which describe the target’s
appearances, while negative samples contain the background information. One of the naive ap-
proaches to create the single template is casting all the samples as training samples into the re-
gression settings, such as ridge regression, or support vector regression (SVR). However, many of
the samples are superfluous, and this redundancy makes the construction of the regression model
nearly infeasible such that it can not be used for online visual tracking. Moreover, the template
created by the naive approaches will have a weak discriminative ability because the number of
samples that have background information is much larger than the number of samples of targets.
To understand this, we must clarify the templates in the regression settings. Consider a linear
regression model’s score function:
y=ωTx+b, (21)
The ωcan be solved by the ridge regression or support vector regression, and depends on the loss
term in the objective function. Eq.22 can also be simply represented as:
y=sim(ω, x) + b=< ω, x>+b, (22)
which means that the score is determined by the similarity between a certain template ωand the
test sample x. Note that this template explanation has already been used to model the visualiza-
tion for convolutional networks [35] and in the context of Bayesian classification [36]. From the
representer theorem [31], we know that template ωis spanned by the training samples:
ω=X
i
αixi,(23)
where xiis the training sample, and αiis the corresponding coefficient. Since there are a large
number of training samples that contain background information, if we don’t constrain the value
of those coefficients, some coefficients of background image samples could be positive, which will
make the constructed template ωhave weak discriminative ability. Recall the exemplar constraint
of our approach, which actually gives the positive samples (the estimated targets samples) positive
coefficients, and negative coefficients for negative samples (containing background information).
15
Old frames
...
Sample
selection
Predictednegativepostive
Template
Span
Current frame
Predict
Figure 3: For each frame of the given image sequences, we can obtain a set of image samples consisting of only
one positive sample and many negative samples. This gives us a large number of samples. Since many of them are
redundant, we provide an efficient sample selection approach to filter the important sample set, and use this sample
set to span the discriminative template.
This constraint can enhance the discriminative ability of template ωbecause it penalizes the back-
ground information within the negative samples’ influence. Meanwhile, it strengthens the effects
of the positive samples in the linear combination of template ω.
It is worth noting that the sample selection effect of the exemplar regression model. Since
we discard the small weights of samples during the updating stage (see Section 3.4), the sample
selection has choosen the more representative samples to be preserved. This sample selection helps
the trackers to decide which samples generated frame-by-frame are important for representing the
appearances of targets, and avoids the large sample set problem of the regression model. See Fig.3
for the summary of the templates construction perspective.
6. Experiments
We evaluate our exemplar regression tracking approach on two experiments. In the first exper-
iment, we evaluate our approach on a large benchmark [14] that contains 50 videos with compar-
isons to state-of-the-art trackers. The overall experimental results are illustrated by both precision
16
(a) (b) (c)
(d) (e) (f)
Figure 4: Distance precision and overlap success plots over standard 50 benchmark sequences [14] using OPE, TRE
and SRE. The legend contains the area-under-the-curve score for the top-10 trackers.
plots and success plots. The results of eleven challenging attributes of fast motion, background
clutter, blur, deformation, illumination, in-plane rotation, low resolution, occlusion, out-of-view,
and out-of-plane rotation and scale variations are also provided. In the second experiment, to
further demonstrate our approach’s robustness, we select another 50 challenging sequences and
compare our method with the state-of-the-art regression based trackers: Struck [9], KCF [4] and
DSST [3]. The results of this experiment are also illustrated by both precision plots and success
plots. In addition, a detailed analysis of some typical sequences is provided.
6.1. Experimental settings
Our ERT tracker is implemented in Matlab and runs at approximately 4frames per second on
an Intel Core i5 3.30GHz CPU with 8GB RAM. Note that for all of the experiments, we use the
parameters provided in Section 4and all are fixed.
17
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k)
Figure 5: Overlap success plots over the eleven challenges of fast motion, background clutter, blur, deformation,
illumination, in-plane rotation, low resolution, occlusion, out-of-view, and out-of-plane rotation and scale variations
on the standard CVPR2013 benchmark. The legend contains the AUC score for the top-10 trackers. Our method
performs favorably against the state-of-the-art trackers.
18
6.2. Evaluation criteria
In this section, we introduce the evaluation methodology used in the two experiments. We use
the precision and success rate for quantitative evaluation:
6.2.1. Precision plot
The first evaluation metric is the CLE (Center Location Error), which is defined as the average
Euclidean distance between the center locations of the tracked targets and the manually labeled
ground truth. The average center location error over all the frames of one sequence is used to
evaluate the overall performance for that sequence. We use the score at the 20 pixels CLE as the
representative precision score for each tracker.
6.2.2. Success plot
Another evaluation metric is the bounding box overlap. We use the typical Pascal VOR (VOC
Overlap Ratio) criterion [37]. Given the bounding box BRof the result and the bounding box
BGof the ground truth, the VOR can be computed as VOR =|BRBG
BRBG|, where and represent
the intersection and union of two regions, respectively, and |·|denotes the number of pixels in
the region. To measure the performance on a given sequence, we count the number of successful
frames whose VOR is larger than the given threshold to. The success plot shows the ratios of
successful frames at the thresholds varied from 0to 1. We use the AUC (Area Under Curve) of
each success plot to rank the compared tracking approaches.
6.2.3. Precision plot on single challenging sequence
To compare the tracking approaches on some typical challenging sequences, we plot the track-
ing results of trackers on each frame in the sequence.
To evaluate the robustness of the visual tracking approaches, we use three different evaluations:
OPE (One Pass Evaluation): this evaluation runs the compared trackers throughout a test
sequence with an initialization ground truth position in the first frame;
SRE (Spatial Robustness Evaluation): to evaluate whether a tracking method is sensitive to
initialization errors, this evaluation generates the object states by slightly shifting or scaling
19
the ground truth bounding box of a target object. In our experiments, we use eight spatial
shifts (four center shifts and four corner shifts), and four scale variations,according to the
benchmark [14]. We run the SRE evaluation 12 times, and the final scores of SRE are
averaged to rank the compared tracking approaches.
TRE (Temporal Robustness Evaluation): in this evaluation, each compared tracking ap-
proach is evaluated numerous times from different starting frames across an image sequence.
In each test, an algorithm is evaluated from a particular starting frame, with the initialization
of the corresponding ground truth object state, until the end of an image sequence. In our
experiments, we run TRE 20 times and use the averaged results to generate the TRE scores.
6.3. Experiment on CVPR2013 benchmark
In this section, we give the detailed experiment results of the standard CVPR2013 benchmark
[14]. We compare our exemplar regression tracking approach with 29 trackers: CPF [38], LOT
[39], IVT [10], ASLA [13], SCM [40], L1APG [19], MTT [41], VTD [21], VTS [42], LSK [43],
ORIA [44], DFT [45], KMS [46], SMS [47], VR-V [26], Frag [22], OAB [25], SemiT [48], BSBT
[49], MIL [7], CT [5], TLD [6], Struck [9], CSK [2] and CXT [50]. These 29 trackers cover
the classical discriminative and generative visual tracking approaches, and some of them gain the
state-of-the-art tracking results.
The CVPR2013 benchmark provides 50 sequences. To evaluate the trackers robustness, these
sequences are categorized with 11 challenging attributes: fast motion (FM), background clut-
ter (BC), motion blur (MB), deformation (DEF), illumination (IV), in-plane rotation (IPR), low
resolution (LR), occlusion (OCC), out-of-view (OV), and out-of-plane rotation (OPR) and scale
variations (SV). Each sequence includes several attributes. Among the 50 sequences, there are
39 sequences with OR attributes, 31 sequences with IR attributes, 29 sequences with OCC at-
tributes, 28 sequences with SV attributes, 25 sequences with IV attributes, 21 sequences with BC
attributes, 19 sequences with DEF attributes, 17 sequences with FM attributes, 12 sequences with
MB attributes, 6sequences with OOV attributes and 4with LR attributes. In Fig. 6, we give the de-
tailed results for each of the 11 challenging attributes. Our approach achieves the best benchmark
results on all the 29 classical visual trackers except the OV attribute.
20
(a) (b) (c)
(d) (e) (f)
Figure 6: Comparisons between our method with the state-of-the-art regression based visual trackers, Struck [9],
DSST [3] and KCF [4] on the selected 50 sequences using the OPE, SRE and TRE. Our method outperforms other
trackers on the three evaluations, and by nearly 10% on the OPE.
In Fig. 4, we give the overall OPE, TRE and SRE results on the benchmark sequences. Our
approach outperforms the second best approach Struck [9] with a large margine. The average CLE
and VOR results of our approach and 7visual trackers are presented in Tab. 2.
Table 2: The average comparisons of our approach with the 7trackers on the 50 sequences of CVPR2013 benchmark
in CLE at a threshold of 20 pixels and VOR at a threshold of 0.5. The value format of each table cell is ”CLE/VOR”.
The best result of each sequence is highlighted by bold red and the second best is highlighted by bold blue.
Ours Struck[9] MIL[7] ASLA[13] TLD[6] CSK[2] SCM[40] L1APG[19]
CLE/VOR 0.73/0.66 0.66/0.56 0.47/0.37 0.53/0.51 0.61/0.52 0.54/0.44 0.65/0.62 0.48/0.44
6.4. Experiment on the 50 challenging sequences
To further demonstrate our approach’s robustness comparing the state-of-the-art regression
trackers: Struck [9], KCF [4] and DSST [3], we run our approach and the three approaches on the
21
Lost Lost Lost
Ours Struck KCF DSST
Figure 7: Tracking results of four kinds of regression based trackers: our approach, structured output SVM(Struck
[9]), correlation filters (KCF [4] and DSST [3]) on 16 challenging sequences(from left to right and top down are
Human6, Human3, Human5, BlurBody, Lemming, Singer2, Car1, Couple, Freeman1, Freeman3, Human7, Human9,
Trellis, David3Outdoor, CarScale, Jumping).
selected 50 challenging sequences: BlurBody, BlurCar1, BlurCar4, BlurFace, Board, Box, Boy,
Car1, Car4, CarDark, CarScale, Coke, Couple, Crossing, Crowds, David, David2, David3Outdoor,
Deer, Dog, Dog1, Doll, Dudek, FaceOcc1, FaceOcc2, Fish, FleetFace, Football1, Freeman1,
Freeman3, Human3, Human4, Human5, Human6, Human7, Human9, Jumping, Lemming, Man,
singer1, Skating1, Subway, Surfer, Sylvester, Tiger1, Tiger2, Trellis, Walking, Walking2 and
Woman.
The overall experiment results are shown in Fig. 6, where our method is better than the three
trackers by nearly 10% on OPE. The average comparisons are presented in Tab. 3.
We compare our approach with the three state-of-the-art regression based approaches Struck
22
Table 3: The average comparisons of our approach with the 3state-of-the are regression based trackers on the selected
50 sequences. The format are same as Tab. 2.
Ours KCF[4] DSST[3] Struck[9]
CLE/VOR 0.86/0.81 0.77/0.62 0.76/0.70 0.74/0.58
[9], KCF [4] and DSST [3] on 16 challenging sequences for the 50 sequences in detail (Fig. 7). The
KCF approach is based on a correlation filter learned from HOG features [51]. The KCF approach
performs well in handling significant deformation and fast motion (BlurBody, Trellis, and Singer2)
due to the robust representation of HOG features and effectiveness of the temporal context corre-
lation model. However, it fails to handle large scale variation (Human6, Human5, Car1, Freeman3
and CarScale) because of the fixed sizes convolution searching strategy. Besides, it drifts when
there exists large motion blur (Couple, Human7 and Jumping), since the searching regions of KCF
are determined by the initial bounding box and fixed during the tracking process. In addition, be-
cause of the linear updating of the filter, KCF can handle short-term occlusions (David3Outdoor)
smoothly, but can not handle the long-term (Human5, Human3 and Lemming). The Struck [9] ap-
proach can not handle the large scale variation (Human6, Human5, Car1, Freeman3 and CarScale)
either because of the traditional fixed size Haar-like features [52] representation. In addition, it
drifts when the targets on the gray images are not obvious because of the limitation of the Haar-
like features representation (Singer2). The DSST [3] enhances the scale aware ability of KCF by
integrating a new 1Dfilter to predict the scale variations of the targets. Therefore, it obtains
better experimental results on Car1, Trellis, CarScale and Singer2 than KCF. However, it does not
resolve the inherent drawbacks of correlation filters such as the linear updating strategy which can
make the trackers fail on the long-term occlusions (Lemming, Human3 and Human5), while the
problem of the limited size of the searching regions means that it cannot handle the large motion
blur (Human9, Couple and Jumping).
Our approach achieves remarkable results on all the 16 challenging sequences (Fig. 8). The
utilization of the PCA-HOG features representation makes our approach robust when handling
the illumination variations (Singer2 and Trellis). The sample selection effect of the exemplar
regression model helps the tracker handle the long-term and short-term occlusions (Lemming,
23
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l)
(m) (n) (o) (p)
Figure 8: Frame-by-frame comparison of center location errors on the 16 challenging sequences illustrated in Figure
7. Generally, our method is able to track targets accurately and stably.The meanings of the colors of lines are the same
as in Figure 7.
David3Outdoor, and Human5) and gain good results on significant deformation (Singer2 and Free-
man1). By integrating the particle filtering and the fixed size templates (32 ×32), our approach
can handle large scale variations (Freeman3, Carscale and Human6).
24
7. Conclusion
Although few researchers have noticed this, by giving a very simple negative-positive con-
straint for the training samples set of a common kernelized regression model, a state-of-the-art
robust visual tracker can be constructed. We constrain the linear combination of the score function
of the kernelized regression model to make sure that only the support samples in the estimated
positions from past frames are positive and the rest of the weights are all negative. We show that
this novel linear combination can be solved by a simple off-the-shelf non-negative least square
method.
8. Acknowledge
This study was supported by Science and Technology Project of Shenzhen (NO. JSGG20150331152017052).
This study was also supported in part by Shenzhen IOT key technology and application systems
integration engineering laboratory.
References
[1] D. Bolme, J. Beveridge, B. Draper, Y. M. Lui, Visual object tracking using adaptive correlation filters, in:
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 2010, pp. 2544–2550. 2,3,4,5,
14
[2] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the circulant structure of tracking-by-detection with
kernels, in: Computer Vision–ECCV 2012, Springer, 2012, pp. 702–715. 2,3,4,5,20,21
[3] M. Danelljan, G. H¨
ager, F. Khan, M. Felsberg, Accurate scale estimation for robust visual tracking, in: British
Machine Vision Conference, Nottingham, September 1-5, 2014, BMVA Press, 2014. 2,3,4,5,6,14,17,21,
22,23
[4] J. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, Pattern
Analysis and Machine Intelligence, IEEE Transactions on 37 (3) (2015) 583–596. 2,3,4,5,14,17,21,22,23
[5] K. Zhang, L. Zhang, M.-H. Yang, Fast compressive tracking, Pattern Analysis and Machine Intelligence, IEEE
Transactions on 36 (10) (2014) 2002–2015. 2,4,5,20
[6] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, Pattern Analysis and Machine Intelligence,
IEEE Transactions on 34 (7) (2012) 1409–1422. 2,4,5,20,21
[7] B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online multiple instance learning, in: Computer
Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, pp. 983–990. 2,4,5,20,21
25
[8] G. Zhu, F. Porikli, Y. Ming, H. Li, Lie-struck: Affine tracking on lie groups using structured svm, in: Applica-
tions of Computer Vision (WACV), 2015 IEEE Winter Conference on, 2015, pp. 63–70. 2,3,4,5
[9] S. Hare, A. Saffari, P. Torr, Struck: Structured output tracking with kernels, in: Computer Vision (ICCV), 2011
IEEE International Conference on, 2011, pp. 263–270. 2,3,4,5,6,12,14,17,20,21,22,23
[10] D. A. Ross, J. Lim, R.-S. Lin, M.-H. Yang, Incremental learning for robust visual tracking, International Journal
of Computer Vision 77 (1-3) (2008) 125–141. 2,4,7,20
[11] X. Mei, H. Ling, Robust visual tracking using l1minimization, in: Computer Vision, 2009 IEEE 12th Interna-
tional Conference on, 2009, pp. 1436–1443. 2,4,14
[12] Q. Wang, F. Chen, W. Xu, M.-H. Yang, Online discriminative object tracking with local sparse representation,
in: Applications of Computer Vision (WACV), 2012 IEEE Workshop on, 2012, pp. 425–432. 2,4
[13] X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, in: Computer
Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 1822–1829. 2,4,14,20,21
[14] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2013. 2,3,16,17,20
[15] A. Yilmaz, O. Javed, M. Shah, Object tracking: A survey, Acm Computing Surveys 38 (1) (2006) 8193. 4
[16] M.-H. Yang, J. Ho, Toward robust online visual tracking, in: Distributed Video Sensor Networks, Springer,
2011, pp. 119–136. 4
[17] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, M. Shah, Visual tracking: an exper-
imental survey, Pattern Analysis and Machine Intelligence, IEEE Transactions on 36 (7) (2014) 1442–1468.
4
[18] H. Li, C. Shen, Q. Shi, Real-time visual tracking using compressive sensing, in: Computer Vision and Pattern
Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 1305–1312. 4
[19] H. Ji, Real time robust l1 tracker using accelerated proximal gradient approach, in: 2012 IEEE Conference on
Computer Vision and Pattern Recognition, 2012, pp. 1830–1837. 4,20,21
[20] M. J. Black, A. D. Jepson, Eigentracking: Robust matching and tracking of articulated objects using a view-
based representation, International Journal of Computer Vision 26 (1) (1998) 63–84. 4
[21] J. Kwon, K. M. Lee, Visual tracking decomposition, in: Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on, IEEE, 2010, pp. 1269–1276. 4,20
[22] A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using the integral histogram, in: Computer
vision and pattern recognition, 2006 IEEE Computer Society Conference on, Vol. 1, IEEE, 2006, pp. 798–805.
4,20
[23] X. Li, Z. He, X. You, C. P. Chen, A novel joint tracker based on occlusion detection, Knowledge-Based Systems
71 (2014) 409–418. 4
[24] N. V. Lopes, P. Couto, A. Jurio, P. Melo-Pinto, Hierarchical fuzzy logic based approach for object tracking,
26
Knowledge-Based Systems 54 (2013) 255–268. 4
[25] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting., in: BMVC, Vol. 1, 2006, p. 6. 4,
5,20
[26] R. T. Collins, Y. Liu, M. Leordeanu, Online selection of discriminative tracking features, Pattern Analysis and
Machine Intelligence, IEEE Transactions on 27 (10) (2005) 1631–1643. 4,5,20
[27] X. Mei, H. Ling, Robust visual tracking and vehicle classification via sparse representation, Pattern Analysis
and Machine Intelligence, IEEE Transactions on 33 (11) (2011) 2259–2272. 4
[28] A. Bordes, L. Bottou, P. Gallinari, J. Weston, Solving multiclass support vector machines with larank, in:
Z. Ghahramani (Ed.), Proceedings of the 24th International Machine Learning Conference, OmniPress, Cor-
vallis, Oregon, 2007, pp. 89–96. 5,6
[29] T. Malisiewicz, A. Gupta, A. Efros, Ensemble of exemplar-svms for object detection and beyond, in: Computer
Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 89–96. 6
[30] M. Isard, A. Blake, Condensationconditional density propagation for visual tracking, International journal of
computer vision 29 (1) (1998) 5–28. 7
[31] B. Scholkopf, R. Herbrich, A. J. Smola, R. Williamson, A generalized representer theorem, Proceedings of
Annual Conference on Computational Learning Theory 42 (3) (2000) 416–426. 8,15
[32] C. M. Bishop, Pattern recognition and machine learning, springer, 2006. 8
[33] C. L. Lawson, R. J. Hanson, Solving least squares problems, Vol. 161, SIAM, 1974. 11
[34] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-
based models, Pattern Analysis and Machine Intelligence, IEEE Transactions on 32 (9) (2010) 1627–1645. 12
[35] K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising image classification
models and saliency maps, CoRR abs/1312.6034.
URL http://arxiv.org/abs/1312.6034 15
[36] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, K. R. Mller, How to explain individual
classification decisions, Journal of Machine Learning Research 11 (9) (2010) 1803–1831. 15
[37] L. Cehovin, M. Kristan, A. Leonardis, Is my new tracker really better than yours?, in: Applications of Computer
Vision (WACV), 2014 IEEE Winter Conference on, IEEE, 2014, pp. 540–547. 19
[38] P. P´
erez, C. Hue, J. Vermaak, M. Gangnet, Color based probabilistic tracking, in: Computer visionECCV 2002,
Springer, 2002, pp. 661–675. 20
[39] S. Oron, A. Bar Hillel, D. Levi, S. Avidan, Locally orderless tracking, in: Computer Vision and Pattern Recog-
nition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 1940–1947. 20
[40] W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparsity-based collaborative model, in: Computer
vision and pattern recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 1838–1845. 20,21
[41] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Robust visual tracking via multi-task sparse learning, in: Computer
27
Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 2042–2049. 20
[42] J. Kwon, K. M. Lee, Tracking by sampling trackers, in: Computer Vision (ICCV), 2011 IEEE International
Conference on, IEEE, 2011, pp. 1195–1202. 20
[43] B. Liu, J. Huang, L. Yang, C. Kulikowsk, Robust tracking using local sparse appearance model and k-selection,
in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 1313–1320.
20
[44] Y. Wu, B. Shen, H. Ling, Online robust image alignment via iterative convex optimization, in: Computer Vision
and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 1808–1814. 20
[45] L. Sevilla-Lara, E. Learned-Miller, Distribution fields for tracking, in: Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 1910–1917. 20
[46] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, Pattern Analysis and Machine Intelligence,
IEEE Transactions on 25 (5) (2003) 564–577. 20
[47] R. T. Collins, Mean-shift blob tracking through scale space, in: Computer Vision and Pattern Recognition, 2003.
Proceedings. 2003 IEEE Computer Society Conference on, Vol. 2, IEEE, 2003, pp. II–234. 20
[48] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robust tracking, in: Computer Vision–
ECCV 2008, Springer, 2008, pp. 234–247. 20
[49] S. Stalder, H. Grabner, L. Van Gool, Beyond semi-supervised tracking: Tracking should be as simple as detec-
tion, but not simpler than recognition, in: Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th
International Conference on, IEEE, 2009, pp. 1409–1416. 20
[50] T. B. Dinh, N. Vo, G. Medioni, Context tracker: Exploring supporters and distracters in unconstrained environ-
ments, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1177–1184. 20
[51] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern
Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, IEEE, 2005, pp. 886–893. 23
[52] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: Computer Vision and
Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1,
IEEE, 2001, pp. I–511. 23
28
... Salient object detection (SOD) aims to achieve a unique, clear edge, completed object area which mostly can attract human visual attention. The SOD usually acts as an important role in the area of computer vision, such as object importance [1], visual object tracking [2,3], visual perception [4], contour 5 detection [5,6], motion detection [7,8], image denoising [9], and visual tracking [10,11,12,13] etc. Although some achievements have been obtained for SOD in the past, it is still a challenge problem in the community. ...
... This reverse transfer is only the first side structure that has all the boundary fusion information. On the contrary, we use a reverse side connect subnet to connect extract block's outputs, such as (1, 2) with (3,4,5), and connect (3, 4) with 5. ...
... Especially, compared to other methods, we can observe some characters in the first and second rows(transparent water and large size background), where our BCN can identify all parts of the salient object without missing parts. Moreover, we can find the third, fourth and eighth rows (human, large object size) 530 and multiple object scenarios(rows in 3,4,5,6,7,8), no more non-salient object region is detected in BCN. Our BCN precisely detects influenced salient object better than others in light reflection and ground reflection, such as the last two rows. ...
Article
As a challenging task for pixel-wise image analysis, salient object detection has made huge progress in recent years. However, there still exists a difficult problem: detection of distinguishing a salient and non-salient object in multiple objects under complex background (e.g. blur, translucent, light reflection, etc.). Our proposed method cast this difficulty as information dissolve problem in deep convolutional network, which is manifested as: first, the model cannot grab whole details of a salient object at training phrase; second, due to the isolation between layers and blocks, the valued information is blocked within a block, which leads to the difficulty in obtaining the position and the edge of salient objects simultaneously; third, the output of the network is a low-resolution saliency map, which cannot accurately express the edge of salient objects. To address information dissolve problems, we construct a Bi-Connect Net (BCN) composed of forward connection subnet and reverse side connection subnet. Besides, the proposed adaptive learning fusion method not only stress all blocks contribution but also combine multiple features with different scale, so that grab more details on the right salient location and precise edges at the same time. Extensive experiments show that our proposed bi-connect net can outperform the state-of-the-art methods.
... Visual tracking has been one of the most fundamental topics in computer vision due to its important roles in numerous applications such as surveillance, humancomputer interaction, and automatic driving [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. It aims to estimate the states (e.g., location, scale, rotation) of a target in a video after specifying the target in the first frame usually using a rectangle. ...
Article
Full-text available
Abstract In recent years, deep convolutional neural networks (CNNs) have achieved great success in visual tracking. To learn discriminative representations, most of existing methods utilize information of image region category, namely target or background, and/or of target motion among consecutive frames. Although these methods demonstrated to be effective, they ignore the importance of the ranking relationship among samples, which is able to distinguish one positive sample better than another positive one or not. This is especially crucial for visual tracking because there is only one best target candidate among all positive candidates, which tightly bounds the target. In this paper, we propose to take advantage of the ranking relationship among positive samples to learn more discriminative features so as to distinguish closely similar target candidates. In addition, we also propose to make use of the normalized spatial location information to distinguish spatially neighboring candidates. Extensive experiments on challenging image sequences demonstrate the effectiveness of the proposed algorithm against several state-of-the-art methods.
... Various feature descriptors with effective appearance models have been proposed in numerous literatures [21,22,43,48,51]. Single feature descriptor has been widely used in appearance based visual tracking models [10,20,24,42,44] for their computational convenience. The single feature is easily disturbed by noise, however, can not describe the appearance of the object target clearly. ...
Article
Full-text available
Common tracking algorithms only use a single feature to describe the target appearance, which makes the appearance model easily disturbed by noise. Furthermore, the tracking performance and robustness of these trackers are obviously limited. In this paper, we propose a novel multiple feature fused model into a correlation filter framework for visual tracking to improve the tracking performance and robustness of the tracker. In different tracking scenarios, the response maps generated by the correlation filter framework are different for each feature. Based on these response maps, different features can use an adaptive weight-ing function to eliminate noise interference and maintain their respective advantages. It can enhance the tracking performance and robustness of the tracker efficiently. Meanwhile, the correlation filter framework can provide a fast training and accurate locating mechanism. In addition, we give a simple yet effective scale variation detection method, which can appropriately handle scale variation of the target in the tracking sequences. We evaluate our tracker on OTB2013/OTB50/OBT2015 benchmarks, which are including more than 100 video sequences. Extensive experiments on these benchmark datasets demonstrate that the proposed MFFT tracker performs favorably against the state-of-the-art trackers.
... Graphical representations, which characterize the affinities among data points, have played an important role in machine learning, image processing [1][2][3][4], writer identification [5][6][7], visual tracking [8][9][10][11][12], and especially for clustering problems [13][14][15][16]. For graph-based clustering methods, the graph construction is guided under certain learned or pre-defined pairwise similarities [17,18]. ...
Article
Full-text available
Graph-based methods have been widely applied in clustering problems. The mainstream pipeline for these methods is to build an affinity matrix first, and then use the spectral clustering methods to construct a graph. The existing studies about such a pipeline mainly focus on how to build a good affinity matrix, while the spectral method has only been considered as an end-up step to achieve the clustering tasks. However, the quality of the constructed graph has significant influences on the clustering results. Unlike most of the existing works, our studies in this paper focus on how to refine the original graph to construct a good graph by giving the number of clusters. We show that spectral clustering method has a good property of block structure preserving by giving the priori knowledge about number of clusters. Based on the property, we provide an iterative regularization framework to refine the original graph. The regularization framework is based on a well-designed reproducing kernel Hilbert spaces for vector-valued (RKHSvv) functions, which is in favor of doing kernel tricks on graph reconstruction. The elements in RKHSvv are multiple outputs affinity functions. We show that finding an optimal multiple outputs function is equivalent to construct a graph, and the associated affinity matrix of such a graph can be obtained in a form of multiplication between a kernel matrix and an unknown coefficient matrix.
... Multiple object tracking (MOT) plays an important role in the field of computer vision and video surveillance, and it has attracted more and more attention in recent years [1][2][3][4][5][6][7][8][9][10][11][12]. The task is to find a trajectory with a unique ID for every moving object. ...
Article
Full-text available
This paper proposes a novel network flow model for multi-target tracking, which uses short and highly reliable detection responses as the basic unit, namely the tracklet, in the model. Our model exploits the local information of the tracklet and deploys the global strategy of data association in tracking. The method is divided into two phases: a local phase and a global phase. In the local phase, our method is used to track targets using the detection results, namely the tracking by detection, where the boosted particle filter is used to generate high-confidence detection responses and they are connected into reliable tracklets. In the global phase, the multi-object tracking is modeled as data association problem, and the problem is represented by the maximum posterior probability. Finally, the model is solved by the minimum cost flow algorithm. When dealing with the target occlusion, this paper designs a two-step optimization algorithm to solve the long-term occlusion that affects tracking. A large number of experimental results show that our method is more effective in multi-object tracking than other state-of-the-art methods.
Article
Human action recognition based on 3D data is attracting increasing attention because it could provide more abundant spatial and temporal information compared with RGB videos. The challenge of depth map based method is to capture the cues between spatial appearance and temporal motion. In this paper, we propose a straightforward and efficient framework for modeling the human action based on depth map sequences, considering the short-term and long-term dependencies. A frame-level feature, termed depth-oriented gradient vector (DOGV), is developed to capture the appearance and motion in short-term duration. For long-term dependence, we construct convolutional neural networks (CNNs) based backbone to aggregate frame-level features in space and time. The proposed method is comprehensively evaluated on four public benchmark datasets, including NTU RGB+D, NTU RGB+D 120, PKU-MMD and UOW LSC. The experimental results demonstrate that the proposed approach can solve the problem of 3D human action recognition in an efficient way and achieve state-of-the-art performance.
Article
The single-image representation (SIR) matching and cross-image representation (CIR) classification are two significant solutions in the person re-identification (Re-ID) task. Combining the two categories of methods has been regarded as an effective solution to improve discriminative performances in Re-ID. Previous combination methods mainly focus on fusing the SIR learning and CIR learning losses, representing the cross-image features with limitations and not available to different network structures for feature learning. This paper proposed an efficient joint SIR and CIR learning strategy – cross-image features fusion strategy (CFFS), which fuses cross-image information through parallel training. Precisely, CFFS consists of one shared sub-network and two branches, one branch for learning single-image feature and the other for fusing cross-image features. CFFS requires pairwise data parallelism training for each identity to learn the CIR, applied to metric learning. Therefore, the cross-image features would be fused better, and the performance of Re-ID would be improved. Experiments on Market-1501, DukeMTMC-reid, and CUHK03 datasets show that the proposed method can achieve favorable accuracy while compared with state-of-the-arts.
Article
Closed‐circuit television and sensor‐based intelligent surveillance systems have attracted considerable attentions in the field of public security affairs. To provide real‐time reaction in the case of a huge volume of the surveillance data, researchers have proposed event‐reasoning frameworks for modeling and inferring events of interest. However, they do not support decision‐making, which is very important for surveillance operators. To this end, this paper incorporate a function of decision‐making in an event‐reasoning framework, so that our model not only can perform event‐reasoning but also can predict, rank, and alarm threats according to uncertain information from multiple heterogeneous sources. In particular, we propose a multiattribute decision‐making model, in which an object being watched is modeled as a multiattribute event, where each attribute corresponds to a specific source, and the information from each source can be used to elicit a local threat degree of different malicious situations with respect to the corresponding attribute. Moreover, to assess an overall threat degree of an object being observed, we also propose a method to fuse the conflict threat degrees regarding all the relevant attributes. Finally, we demonstrate the effectiveness of our framework by an airport security surveillance scenario.
Article
Full-text available
With the emergence of a large number of artificial intelligence technologies, deep learning has become the key technology in computer vision area. Object tracking is one of the most important technology in the field of computer vision. Thus we studied about tracking algorithms and proposed a method mainly hopes to solve the occlusion problem in complex tracking scene. Using object detection algorithms based on deep learning to increase the speed of associations and improve tracking effect. It can return the position of the tracking object unsupervised. Then extract features to store in features library, so that the prediction of trajectory whose features can highly be matched is more accurate and the associations are more reliable. Experiment shows our tracking algorithm combines with detection algorithm based on depthwise separable convolution networks not only has a smaller and faster model but also achieved a robustness and real-time tracking in scene where objects are under occlusions.
Article
Correlation filters (CF) have demonstrated a good performance in visual tracking. However, the base training sample region is larger than the object region, including the interference region (IR). IRs in training samples from cyclic shifts of the base training sample severely degrade the quality of the tracking model. In this paper, a region-filtering correlation tracking (RFCT) algorithm is proposed to address this problem. In this algorithm, we filter training samples by introducing a spatial map into the standard CF formulation. Compared with the existing correlation filter trackers, the proposed tracker has the following advantages. (1) Using a spatial map, the correlation filter can be learned on a larger search region without the interference of IR. (2) Due to processing training samples by a spatial map, it is a more general way to control background information and target information in training samples. In addition, a better spatial map can be explored, the values of which are not restricted. Quantitative evaluations are performed on four benchmark datasets: OTB-2013, OTB-2015, VOT2015, and VOT2016. Experimental results demonstrate that the proposed RFCT algorithm performs favorably against several state-of-the-art methods.
Conference Paper
Full-text available
The problem of visual tracking evaluation is sporting an abundance of performance measures, which are used by various authors, and largely suffers from lack of consensus about which measures should be preferred. This is hampering the cross-paper tracker comparison and faster advancement of the field. In this paper we provide an overview of the popular measures and performance visualizations and their critical theoretical and experimental analysis. We show that several measures are equivalent from the point of information they provide for tracker comparison and, crucially, that some are more brittle than the others. Based on our analysis we narrow down the set of potential measures to only two complementary ones that can be intuitively interpreted and visualized, thus pushing towards homogenization of the tracker evaluation methodology.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
This paper presents a novel and reliable tracking-by detection method for image regions that undergo affine transformations such as translation, rotation, scale, dilatation and shear deformations, which span the six degrees of freedom of motion. Our method takes advantage of the intrinsic Lie group structure of the 2D affine motion matrices and imposes this motion structure on a kernelized structured output SVM classifier that provides an appearance based prediction function to directly estimate the object transformation between frames using geodesic distances on manifolds unlike the existing methods proceeding by linearizing the motion. We demonstrate that these combined motion and appearance model structures greatly improve the tracking performance while an incorporated particle filter on the motion hypothesis space keeps the computational load feasible. Experimentally, we show that our algorithm is able to outperform state-of-the-art affine trackers in various scenarios.
Article
This paper describes an approach for tracking rigid and articulated objects using a view-based representation. The approach builds on and extends work on eigenspace representations, robust estimation techniques, and parameterized optical flow estimation. First, we note that the least-squares image reconstruction of standard eigenspace techniques has a number of problems and we reformulate the reconstruction problem as one of robust estimation. Second we define a "subspace constancy assumption" that allows us to exploit techniques for parameterized optical flow estimation to solve for both the view of an object and the affine transformation between the eigenspace and the image. To account for large affine transformations between the eigenspace and the image we define a multi-scale eigenspace representation and a coarse-to-fine matching strategy. Finally, we use these techniques to track objects over long image sequences in which the objects simultaneously undergo both affine image motions and changes of view. In particular we use this "EigenTracking" technique to track and recognize the gestures of a moving hand.
Conference Paper
Being embedded in the physical world, wireless sensor networks (WSNs) present a wide range of failures, due to environment conditions, hardware limitations and software uncertainties, and so on. Once deployed, the interactivity of a WSN greatly decreases, which leads to limited visibility of network performance for managers to investigate sensor behaviors. Existing evidence-based approaches aim to explain particular network symptoms based on expert knowledge and heuristic experiences, which degrade diagnosis accuracy and perform unreliably. These diagnosis models define a limited group of network failures, emphasizing on expert knowledge too much, and thus fail to be adopted to different applications. In this work, we propose VN2, a novel tool to enhance the visibility of network performance. VN2 quantifies a node's state in terms of variation of 43 metrics, and trains a representative matrix of network exceptions with Non-negative Matrix Factorization (NMF) model. With this matrix, when a new network state coming up, VN2 automatically attributes abnormal symptoms to one or more root causes. We implement VN2 on test bed and real system traces. Experimental results show that VN2 models network exceptions involving small subsets of root causes, and the interpretation of root causes help us understand network behaviors in details.