Content uploaded by Qiao Liu

Author content

All content in this area was uploaded by Qiao Liu on Feb 25, 2018

Content may be subject to copyright.

Visual Tracking via Exemplar Regression Model

Xiao Maa, Qiao Liub, Xiaohuan Lua, Zhenyu Hea,∗

aSchool of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, China

bSchool of Mathematic and Computer Science , Guizhou Normal University, China

Abstract

Visual tracking remains a challenging problem in computer vision due to the intricate variation

of target appearances. Some progresses made in recent years have revealed that correlation ﬁl-

ters, which formulate the tracking process by creating a regressor in the frequency domain, have

achieved remarkable experimental results on a large amount of video tracking sequences. On the

contrary, building the regressor in the spatial domain directly has been considered as a limited

approach since the number of training samples is restricted. And without sufﬁcient training sam-

ples, the regressor will have less discriminability. In this paper, we demonstrate that, by giving

a very simple positive-negative prior knowledge for the training samples, the performance of the

ridge regression model can be improved by a large margin, even better than its frequency domain

competitors-the correlation ﬁlters, on most challenging sequences. In particular, we build a re-

gressor (or a score function) by learning a linear combination of some selected training samples.

The selected samples consist of a large number of negative samples and a few positive ones. We

constrain the combination such that only the coefﬁcients of positive samples are positive, while

all coefﬁcients of negative samples are negative. The coefﬁcients are learnt under such a regres-

sion setting that makes the outputs ﬁt overlap ratios of the bounding box, where the overlap ratios

are measured by calculating the overlaps between the inputs and the estimated position in the last

frame. We call this regression exemplar regression because of the novel positive-negative arrange-

ment of the linear combination. In addition, we provide a off-the-shelf method, a non-negative

least square approach, to solve this regression model more efﬁciently. We evaluate our approach

on both the standard CVPR2013 benchmark and the 50 selected challenging sequences, which

include dozens of state-of-the-art trackers and more than 70 datasets in total. In both of the two

experiments, our algorithm achieves a promising performance, which outperforms the state-of-

1

the-art approaches.

Keywords: visual tracking, kernelized ridge regression, exemplar regression model.

1. Introduction

Given an initial bounding box of a certain target in the ﬁrst frame, a visual tracker estimates this

target’s state, e.g. location and scale, in each frame of the image sequences. Visual tracking is a key

component in numerous applications, such as vision-based control, visual surveillance, human-

computer interfaces, intelligent transportation, and augmented reality. Although some signiﬁcant

progress has been made in several decades of visual tracking research, most trackers are still prone

to failure in challenging scenarios such as partial occlusion, deformation, motion blur, fast motion,

illumination variation, background clutter and scale variations.

Among the achievements in visual tracking research, the discriminative approaches [1,2,3,4,

5,6,7,8,9] provide an online mechanism that adapts appearance variations of the target, and gain

better results than their generative rivals [10,11,12,13] on some hard sequences [14]. Traditional

discriminative approaches [5,6,7] maintain a classiﬁer trained online to distinguish the target

object from its surrounding background. This process can be divided into two stages: searching

and updating. During the searching stage, the classiﬁer is utilized to estimate the target’s location

in a certain search region around the estimated position from the previous frame, typically using

a sliding-window [5,7] or particle ﬁltering approach [10]. In the updating stage, the traditional

discriminative approaches generate a set of binary labelled training samples which are used to

update the classiﬁer online.

Although the traditional discriminative approaches gain convincing results in some hard se-

quences, it is difﬁcult to arrange the binary labels for training samples in the updating stage.

Because it is difﬁcult to determine a pre-deﬁned threshold and rule (e.g. Euclidean distance of one

sample from the estimated target location from last frame ) to decide whether a sample should be

positive or negative.

∗I am corresponding author

Email address: zyhe@hitsz.edu.cn (Zhenyu He)

Preprint submitted to Knowledge-based System December 30, 2016

To avoid the confusion of label arrangement, some discriminative approaches [9,8,1,2,4,3]

use the regression model instead of the traditional binary classiﬁcation model during the updating

stage. The regression models output a real-value score for each training sample to ﬁt some pre-

deﬁned distributions, such as the bounding box overlap ratios [9,8] or a Euclidean distance [1,2,

4,3]. However, the small sample size training problem makes the traditional regression models,

such as the ridge regression, hard to create a robust regressor.

Some progresses [1,2,4,3] made in recent years have revealed that solving the ridge regression

in the frequency domain can achieve a dense sampling since it avoids the small sample size training

problem. These approaches are called correlation ﬁlters. In this paper, different from correlation

ﬁlters, we propose a simple approach to handle the small sample size training problem in the spatial

domain directly. Our approach gives a positive-negative prior knowledge for the training samples.

We demonstrate that, by adding such prior knowledge, the performance of the conventional spatial

domain ridge regression can be improved by a large margin, even better than its frequency domain

competitors-the correlation ﬁlters, on most challenging sequences.

our ridge regression model learns its score by calculating a linear combination of weighted

similarities of some selected samples , as called support samples in this paper. Among those

weights of similarities, we constrain that only the support samples in the estimated positions in

the past frames are positive and the rest of the weights are all negative. This constraint gives us a

large number of negative weights in the linear combination, but a few positive ones. We call the

modiﬁed ridge regression exemplar regression. We show that the support samples and the weights

can be solved by a non-negative least square method.

The main contribution of this paper is summarized as follows:

•We provide a simple positive-negative constraint method for a common kernelized ridge

regression model to construct a robust visual tracker - we call exemplar regression track-

ing (ERT). The experiments show that the proposed ERT approach gains the state-of-the-art

results under the standard CVPR2013 benchmark 1[14] and other challenging sequences.

1This benchmark is ﬁrst published in IEEE conference on Computer Vision and Pattern Recognition (CVPR) in

2013:https://sites.google.com/site/trackerbenchmark/benchmarks/v10

3

•We provide an easy-to-implement approach to solve the ERT based on the off-the-shelf non-

negative least square method.

The rest of the paper is organized as follows: Section 3describes the proposed ERT approach.

The implementation details of the proposed approach are introduced in Section 4. In Section 5

we discuss the method. In Section 6we performan extensive experimental comparison with the

state-of-the-art visual trackers and we draw a conclusion in section 7.

2. Related Works

We refer to [15,16,17] for the detailed visual tracking surveys. In this section, we brieﬂy re-

view the most related online single object tracking, especially for the regression based approaches.

The visual tracking approaches can be generally categorized as either generative [10,11,18,13,

19,20,21,22,23,24], or discriminative [5,6,7,8,9,1,2,3,4,25,26] based on their appearance

models.

Generative approaches build an appearance model then use this model to ﬁnd the optimal can-

didate samples with a certain region in the image frame which has the minimum construction error.

Black et al. [20] learn an off-line subspace model to represent the object of interest for tracking.

In [22], every image sample is fragmented into several patches, each of them is represented by

an intensity histogram and compared to the corresponding patch in the target region by the Earth

Movers Distance. Ross et al. [10] introduce the incremental PCA (Principal Component Analysis)

to capture the full range of appearances of the target in the past frames. Mei et al. [11,27] employ

the sparse representation to build a dictionary which contains the appearances from past frames,

then select the optimal appearance from this dictionary to estimate the target’s positions. Li et

al. [18] further extend the l1tracking [11] by using the orthogonal matching pursuit algorithm to

solve the optimization problems efﬁciently. In [19], an accelerated version of l1tracking[11] is

proposed based on the accelerated proximal gradient (APG) approach. Jia et. al. [13] and Wang

et. al [12] introduce the patch-based sparse representation to enhance the tracker’s robustness. In

[21], Kwon et al. combine multiple observation and motion models to handle large appearance

and motion variation.

4

Table 1: Comparisons of the proposed exemplar regression with the popular regression based tracking approach

Struck[9] and correlation ﬁlters (CFs)[1,4,3]

Regression models Searching strategies Regressors Updating algorithms

Struck [9] Sliding window f(x) = Piωik(xi,x)LaRank [28]

Exemplar regression Particle ﬁltering f(x) = 1

kxkPiωi<xi,x>Non-negative least square

Correlation ﬁlters [4,3] Convolution f(X) = F−1(F(X)(F(Xo))∗

F(Xo)(F(Xo))∗+λ)Linear ridge regression

The traditional discriminative approaches pose the tracking problem as a binary classiﬁcation

task with local search and determine the decision boundary for separating the target object from

the background. Collins et al. [26] provide an approach which learns the discriminative features

online to separate the target object from the background. In [25], an online boosting algorithm

is proposed to select features for tracking. Babenko et al. [7] introduce the multiple instance

learning approach to decide the labels of the training samples, and integrate an online boosting

approach to select the Haar-like features for tracking. A semi-supervised learning approach [6] is

proposed in which positive and negative samples are selected via an online classiﬁer with structural

constraints. Zhang et al. [5] employ a random projected compressed features space with a very

high dimension to represent the target’s appearances, and a binary classiﬁer is used to ﬁnd the

optimal image sample in this compressed domain.

One of the typical regression based trackers is Struck[9], which builds a regression model

upon the Structural Support Vector Machine ( SSVM ) framework, and embeds the novel LaRank

[28] algorithm into the updating stages. In [9], a regressor is formulated by the weighted linear

combination of a set of support vectors. These support vectors might come from different frames

during the tracking process. Therefore, Struck has a good ability to adapt of appearance variances.

[9] uses the Haar-like features and sliding-window searching strategy, which makes it difﬁcult to

handle scale variances of the target. In [8], the authors integrate the Lie group theory to promote

Struck’s scale-aware ability.

Another popular regression model is Correlation Filters ( CFs ) [1,2,4,3]. In a few years, CFs

has proved itself to be competitive with far more complicated approaches. The use of Fast Fourier

5

Transform ( FFT ) makes it run at a very high speed online( about hundreds of frames-per-second

). CFs takes advantage of the famous theorem that the convolution of two image samples in a

spatial domain is equivalent to an element-wise product in the Fourier domain. By formulating the

CF’s objective function in the Fourier domain, it can achieve dense sampling during the updating

and searching stage without being very much time consuming.

Comparison of paradigms between the exemplar regression model, Struck and CFs is shown

in Table 1. The original Struck [9] applies a sliding-window to ﬁnd the optimal sample. The

regressor in Struck is a linear combination of weighted kernel functions between support samples

and the test sample, which is very similar to ours. Struck introduces the LaRank [28] to construct

its regressor. CFs use the convolution between the constructed regressor ( or ﬁlter ) and the search

region’s image patch to ﬁnd the optimal position. However, a single correlation ﬁlter can not

handle the scale variation of the target. To promote the CFs scale-aware ability, DSST [3] creates

another ﬁlter using the target’s features at different scales to estimate the target’s scale. CFs utilize

the ridge regression in the Fourier domain to construct the ﬁlter.

The formulation of our regressor is similar to Struck [9], since they both consider the support

samples from past frames and arrange the positive weights only for samples within the estimated

positions. However, Struck introduces a complicated LaRank [28] to choose which samples should

be added to the regressor’s linear combination and how large are the weights that should be ar-

ranged. Without using the LaRank, we suggest using a more effective and easy-to-implement

approach to do those selections and arrangements: non-negative least square.

We borrow the concept of exemplar from Exemplar-SVM [29]. For the detection problem,

Exemplar-SVM trains a separate linear SVM classiﬁer for every exemplar in the training set. Each

Exemplar-SVM is deﬁned by a single positive instance and millions of negatives.

3. Exemplar Regression Tracking

As a discriminative approach, our exemplar regression tracker has searching and updating stage

during the tracking process. In the searching stage (Section 3.1), we analyze the probability of the

observation model in the Bayesian tracking framework. The observation model is provided by the

updating stage, which is in the form of the linear combination of weighted similarities between

6

chosen support samples and the test sample. The formulation of this exemplar regression model

and how to choose the support samples and the weights of similarities are introduced in Section

3.2 and Section 3.3.

3.1. The searching stage

Visual tracking can be cast as a sequential Bayesian inference problem [10]. Given a set of

observed image patches Itup to the t-th frame, we aim to estimate the value of the state variable

Zt∈R4, which describes the bounding box of the observed image patch. The true posterior

state distribution P r(Zt|It)is approximated by a set of Nssamples, called tracking candidates,

Zt={Z1

t, Z2

t,· · · , ZNs

t}, and the optimal state is estimated by the MAP(Maximum A Posteriori)

formulation:

ˆ

Zt= argmax

Zi

t

P r(Zi

t|It),(1)

where Zi

tdenotes the state variable of the i-th candidate sample at the t-th frame. Using

Bayesian theory, the posterior probability P r(Zi

t|It)is inferred by:

P r(Zi

t|It)∝P r(Ii

t|Zi

t)ZP r(Zi

t|Zi

t−1)P r(Zi

t−1|It−1)dZi

t−1,(2)

where P r(Zi

t|Zi

t−1)is the dynamic model. Ii

tis the t-th frame’s i-th observed image patch. We

use the same dynamic model which is described in [30,10]. The observation model P r(Ii

t|Zi

t)is

determined by our proposed exemplar regression model.

Given an observed image patch Ii

t, we use xi

t∈Rdto represent the features extracted in Ii

t,

where dis the dimension of this feature vector. In the rest of this paper, xi

tis the sample in state

Zi

t. We introduce a function of xi

tto approximate P r(Ii

t|Zi

t):

P r(Ii

t|Zi

t)∝f(xi

t).(3)

The function f(·)helps us ﬁnd the one that is most likely to be a target from the observed

image patches It, and gives this optimal one the highest output. Therefore, all we need to do in the

searching stage is two things: 1) apply the particle ﬁltering strategy to generate Nsstate variables

Ztand their relevant samples; 2) utilize f(·)to ﬁnd the optimal state variable.

7

3.2. The exemplar regression model

For a common kernelized ridge regression model, we want to ﬁnd a linear function (the regres-

sor ) in a certain projected Hilbert space S:

f(x) =<W,Φ(x)>k,(4)

where Wis the unknown coefﬁcient, Φ(·) : Rd→Smaps the ddimension feature vector to that

Hilbert space S.<, >kdenotes the inner-product in S. We use the following objective function

to solve Win the ridge regression setting:

X

i

k<W,Φ(xi)>k−yik2

2+λ

2<W,W>k,(5)

where {yi}, yi∈Rand {xi}are the training set, and λis the regularization parameter.

From the representer theorem [31], the regressor fcan be constructed by a linear combination

of the weighted kernel function between support samples and the test sample [32]:

f(x) =

N

X

i=1

ωiKer(xi,x)(6)

where xidenotes the support samples. We call X={x1,x2,· · · ,xN}the support sample set.ωi

represents the weight of the i-th kernel function value. In this paper, we choose the normalized

inner-product in the Rdto represent Ker(,):

f(x) =

N

X

i=1

ωi

<xi,x>

kxik2kxk2

,(7)

Suppose all the support samples are normalized, which means kxik2= 1, then we have:

f(x) = 1

kxk2

N

X

i=1

ωi<xi,x>, (8)

During the t-th frame’s updating stage, we draw Nusamples around the estimated state variable

ˆ

Zt:Xt= [x1

t,x2

t,· · · ,xNu

t]∈Rd×Nu, and we call Xtthe training sample set. We deﬁne a score

value for each samples in Xtusing the following function:

s(xi

t) = overlap(ˆ

Zt, Zi

t),(9)

8

where the overlap(·,·)function calculates the overlap between two bounding boxes of state

variables. Zi

tis the state variable of xi

t.s(·)gives the higher output which has the larger overlap

with the estimated state ˆ

Z1

t, and lower output with the smaller overlaps. Using s(·), we can gener-

ate the outputs for Xtin a regression model: Yt= [y1

t, y2

t,· · · , yNu

t]T= [s(x1

t), s(x2

t),· · · , s(xNu

t)]T∈

RNu.

We deﬁne a loss function for the regressor f(·):

E(f(Xt), Yt) =

Nu

X

i=1

(yi

t−f(xi

t))2.(10)

Putting Eq.8into Eq.10 and adding the exemplar constraint, we have the exemplar regression

model’s objective function:

E(f(Xt), Yt) =

Nu

X

i=1

(yi

t−1

kxi

tk2

N

X

j=1

ωj<xj,xi

t>)2

s.t.

∀δ(xj)ωj>0,

(11)

where

δ(x) =

1xis positive support sample

−1otherwise

(12)

is the indicator which implements the exemplar constraint. Note that, for each frame, only the

sample with the estimated state variable is positive, and the rest of the samples are negative.

3.3. Solving the exemplar regression using non-negative least square

In the updating stage, we want to add or remove support samples from the candidate sample

set , which consists of the old support sample set Xand the normalized training sample set ¯

Xt:

Xt={X,Xt}={x1,x2,· · · ,xN,¯x1

t,¯x2

t,· · · ,¯xNu

t}, where the ﬁrst Nsamples are old normal-

ized support samples of the un-updated regressor. The last Nusamples are from Xtand are all also

9

to be normalized. The arrangement above reveals that the updating process reselects the old sup-

port samples and adds the new support samples from the training samples, and the corresponding

weights will also be re-allocated. This formulation helps the regression model to adapt the target

appearance variations smoothly. For every candidate sample xi(i= 1,2,· · · , N +Nu)∈X,

we deﬁne a weight ωiand a label to indicate if it is positive or not. Note that the labels of the ﬁrst

Nsamples are determined by the un-updated regressor, and among the last Nusamples, only the

¯x1

tis positive, while the rest are all negative.

For now, the new regressor is represented by:

f(x) = 1

kxk2

N+Nu

X

i=1

δ(xi)|ωi|<xi,x> . (13)

Note that we want to use the training sample set Xtto learn a exemplar regression model. This

means we want to ﬁnd a set of support samples in candidate sample set Xtand their corresponding

weights to minimize Eq.11.

For every sample xi

tin training sample set Xt, we can obtain a loss between the regressor

output and the required one yi

t:

E(f(xi

t), yi

t) = (yi

t−1

kxi

tk2

N+Nu

X

j=1

δ(xj)|ωj|<xj,xi

t>)2,(14)

where the xi

tcan be represented by the normalized ¯xi

t, which is xN+iin the candidate sample

set Xt. Then Eq.14 can be simpliﬁed by:

E(f(xi

t), yi

t)=(yi

t−

N+Nu

X

j=1

δ(xj)|ωj|<xj,xN+i>)2.(15)

The overall loss of the exemplar regression model on the training set Xtcan be calculated by:

E(f(Xt), Yt) =

Nu

X

i=1

(yi

t−

N+Nu

X

j=1

δ(xj)|ωj|<xj,xN+i>)2.(16)

Let:

Kj,N+i=δ(xj)<xj,xN+i>, (17)

10

#347

updating

Poisitve samples Negative samples

#327

#346

Search

Region

=(X)<X, X>

mininize

searching

...

Figure 1: Overview of the proposed exemplar regression tracking (ERT) approach: the formulation of the regressor

is a linear combination of weighted inner-products of the support samples with the test sample. This combination is

formulated by the following two steps: 1) Construct an inner-product matrix Kusing the samples from the current

frame and past frames, and the signs of those inner-products are constrained by the pre-deﬁned positive and negative

settings (δ(X)); 2) Solve a non-negative least square problem to ﬁnd the weights Ω.

then we can construct an inner-product matrix K:

K=

K1,N+1, K2,N +1 ,· · · , KN+Nu,N +1

.

.

.,.

.

.,.

.

.

K1,N+2, K2,N +2 ,· · · , KN+Nu,N +2

K1,N+Nu, K2,N+Nu,· · · , KN+Nu,N +Nu

.(18)

We use a more compact way to formulate Eq.16:

E(f(Xt), Yt) = kYt−KΩk2

2,(19)

where:

Ω= [ |ω1|,|ω2|,· · · ,|ωN+Nu|]T(20)

Since Ω0, the solution of Eq.19 can be obtained by the standard non-negative least

square (NNLS) method [33]. An overview of the proposed method is shown in Fig. 1.

11

3.4. The updating stage

In the updating stage of the tracking process, we draw Nutraining samples to update the old

regressor using the approach provided above. First, we use the training samples and old support

samples to construct the inner-product matrix by Eq.18. Then a standard NNLS approach is ap-

plied to solve Eq.19. This gives us a vector of weights (Eq. 20). We retain the samples with the

large weights, and discard the samples with small weights. We take those retained samples to be

the new support samples of the regressor.

4. Implementation details

During the tracking process, the image patches are all resized to 32×32. In the searching stage,

the number of observed image patches Nis set to 600, of which 500 samples are used to approx-

imate translations, and 100 samples are arranged to estimate the scale and stretch transforms. This

means that for the accuracy of scale estimation, after the translation of the target is estimated using

the 500 samples, we draw another 100 samples around this estimated position to obtain the scale

transforms. We use the PCA-HOG(Histogram of Oriented Gradient) features [34] to describe the

observed image patches, for which the cell-size is set to 4and the number of orientation bins is set

to 9, and this process gives us a 2048 dimension feature vector for each image patch.

In the updating stage, we obtain the training sample set Xton the polar grid, and like Struck

[9], we set the radial to 5and angular divisions to 16, giving us 81 training samples. The candidate

sample set Xtis constructed like Fig.2, which consists of the 81 training samples and the Nold

support samples from the un-updated regressor. These samples are divided into different parts

within Xtaccording to whether the samples are positive or not. This treatment helps us arrange

the labels for the inner-product matrix K. We use the Matlab’s built-in function lsqnonneg() to

solve Eq.19. A summary of this ERT method is shown in Algorithm 1.

4.1. Lazy update strategies

The exemplar model of the visual tracker can be updated frame-by-frame over the video se-

quences to adapt to the targets’ changing appearance. However, this continuous update strategy

makes the support sample set of the model become huge and the tracking process will become

12

Algorithm 1 The exemplar regression tracking

1: Inputs:Z0: the initial state variables; Nu: the number of samples during the updating stage;

Ns: the number of samples during the searching stage; : the threshold for selecting the

support samples.

2: In every frame, we manage the regressor Eq. 13 involving Nsupport samples;

3: Outputs: the estimated state variables ˆ

Ztin every frames.

4: while The video sequence is not ended do

5: Draw Nssamples using the particle ﬁltering, and use Eq.13 to ﬁnd the optimal state variable;

6: Construct the inner-product matrix Kaccording to the description Section 3.3;

7: Find the weights of samples in Xtby solving Eq.19;

8: Select the support samples by removing the sample whose weight is less than ;

9: Update the regressor using the new support samples {xi}and the corresponding weights

{ωi}.

10: end while

very slow. Moreover, for most video sequences, the appearance of targets does not undergo large

variations in the continuous small number of frames. Therefore, we provide three lazy update

strategies to decide when to update the model dynamically:

4.1.1. Checking the positive samples

This strategy will check the similarity of the tracking result of the current frame with the

positive samples of the support sample set. If there is at least one positive sample making the

similarity less than a threshold, then the update process should be executed. This strategy helps

the tracker update the model when the appearance does not change greatly.

4.1.2. Checking the contexts

The contexts are the image patches within the targets’ bounding box with a certain padding

ratio. This includes the targets and the surroundings. This strategy calculates the similarity of

the contexts in the two continuous frames, and avoids repetitive updating when the targets are

motionless.

13

New Negative Old Negative Negative scales New

Postive

old

Postive

Support

Smaples

Classifer Frame t

search

region

Figure 2: Construction of the candidate sample set Xt.

4.1.3. Checking the variance of test samples

This strategy obtains the variance of the regressor’s output of the test samples, and compares

it with a given threshold to decide when to update the model.

5. Discussion

In this section, we present some discussions of our exemplar regression model. We explain the

model from the templates construction perspective. We believe this explanation will be beneﬁcial

to the formulations provided above.

5.1. The templates construction perspective

A visual tracking problem can be formulated as a templates matching problem. The templates

are the image samples which represent the targets’ appearances. The templates are used to cal-

culate the similarities between the test image samples in the new frame, and to choose the test

sample with the greatest similarity to the tracking result. The templates can consist of a single

image or multiple images. Our approach, Struck [9] and correlation ﬁlters [4,1,3] are typical

single template models. Sparse coding based approaches [11,13] preserve a set of templates and

choose the most representative template for the matching process.

14

In every frame, our exemplar regression model generate a set of image samples around the

estimated position. This process gives us a large number of samples which describe the target’s

appearances, while negative samples contain the background information. One of the naive ap-

proaches to create the single template is casting all the samples as training samples into the re-

gression settings, such as ridge regression, or support vector regression (SVR). However, many of

the samples are superﬂuous, and this redundancy makes the construction of the regression model

nearly infeasible such that it can not be used for online visual tracking. Moreover, the template

created by the naive approaches will have a weak discriminative ability because the number of

samples that have background information is much larger than the number of samples of targets.

To understand this, we must clarify the templates in the regression settings. Consider a linear

regression model’s score function:

y=ωTx+b, (21)

The ωcan be solved by the ridge regression or support vector regression, and depends on the loss

term in the objective function. Eq.22 can also be simply represented as:

y=sim(ω, x) + b=< ω, x>+b, (22)

which means that the score is determined by the similarity between a certain template ωand the

test sample x. Note that this template explanation has already been used to model the visualiza-

tion for convolutional networks [35] and in the context of Bayesian classiﬁcation [36]. From the

representer theorem [31], we know that template ωis spanned by the training samples:

ω=X

i

αixi,(23)

where xiis the training sample, and αiis the corresponding coefﬁcient. Since there are a large

number of training samples that contain background information, if we don’t constrain the value

of those coefﬁcients, some coefﬁcients of background image samples could be positive, which will

make the constructed template ωhave weak discriminative ability. Recall the exemplar constraint

of our approach, which actually gives the positive samples (the estimated targets samples) positive

coefﬁcients, and negative coefﬁcients for negative samples (containing background information).

15

Old frames

...

Sample

selection

Predictednegativepostive

Template

Span

Current frame

Predict

Figure 3: For each frame of the given image sequences, we can obtain a set of image samples consisting of only

one positive sample and many negative samples. This gives us a large number of samples. Since many of them are

redundant, we provide an efﬁcient sample selection approach to ﬁlter the important sample set, and use this sample

set to span the discriminative template.

This constraint can enhance the discriminative ability of template ωbecause it penalizes the back-

ground information within the negative samples’ inﬂuence. Meanwhile, it strengthens the effects

of the positive samples in the linear combination of template ω.

It is worth noting that the sample selection effect of the exemplar regression model. Since

we discard the small weights of samples during the updating stage (see Section 3.4), the sample

selection has choosen the more representative samples to be preserved. This sample selection helps

the trackers to decide which samples generated frame-by-frame are important for representing the

appearances of targets, and avoids the large sample set problem of the regression model. See Fig.3

for the summary of the templates construction perspective.

6. Experiments

We evaluate our exemplar regression tracking approach on two experiments. In the ﬁrst exper-

iment, we evaluate our approach on a large benchmark [14] that contains 50 videos with compar-

isons to state-of-the-art trackers. The overall experimental results are illustrated by both precision

16

(a) (b) (c)

(d) (e) (f)

Figure 4: Distance precision and overlap success plots over standard 50 benchmark sequences [14] using OPE, TRE

and SRE. The legend contains the area-under-the-curve score for the top-10 trackers.

plots and success plots. The results of eleven challenging attributes of fast motion, background

clutter, blur, deformation, illumination, in-plane rotation, low resolution, occlusion, out-of-view,

and out-of-plane rotation and scale variations are also provided. In the second experiment, to

further demonstrate our approach’s robustness, we select another 50 challenging sequences and

compare our method with the state-of-the-art regression based trackers: Struck [9], KCF [4] and

DSST [3]. The results of this experiment are also illustrated by both precision plots and success

plots. In addition, a detailed analysis of some typical sequences is provided.

6.1. Experimental settings

Our ERT tracker is implemented in Matlab and runs at approximately 4frames per second on

an Intel Core i5 3.30GHz CPU with 8GB RAM. Note that for all of the experiments, we use the

parameters provided in Section 4and all are ﬁxed.

17

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k)

Figure 5: Overlap success plots over the eleven challenges of fast motion, background clutter, blur, deformation,

illumination, in-plane rotation, low resolution, occlusion, out-of-view, and out-of-plane rotation and scale variations

on the standard CVPR2013 benchmark. The legend contains the AUC score for the top-10 trackers. Our method

performs favorably against the state-of-the-art trackers.

18

6.2. Evaluation criteria

In this section, we introduce the evaluation methodology used in the two experiments. We use

the precision and success rate for quantitative evaluation:

6.2.1. Precision plot

The ﬁrst evaluation metric is the CLE (Center Location Error), which is deﬁned as the average

Euclidean distance between the center locations of the tracked targets and the manually labeled

ground truth. The average center location error over all the frames of one sequence is used to

evaluate the overall performance for that sequence. We use the score at the 20 pixels CLE as the

representative precision score for each tracker.

6.2.2. Success plot

Another evaluation metric is the bounding box overlap. We use the typical Pascal VOR (VOC

Overlap Ratio) criterion [37]. Given the bounding box BRof the result and the bounding box

BGof the ground truth, the VOR can be computed as VOR =|BR∩BG

BR∪BG|, where ∩and ∪represent

the intersection and union of two regions, respectively, and |·|denotes the number of pixels in

the region. To measure the performance on a given sequence, we count the number of successful

frames whose VOR is larger than the given threshold to. The success plot shows the ratios of

successful frames at the thresholds varied from 0to 1. We use the AUC (Area Under Curve) of

each success plot to rank the compared tracking approaches.

6.2.3. Precision plot on single challenging sequence

To compare the tracking approaches on some typical challenging sequences, we plot the track-

ing results of trackers on each frame in the sequence.

To evaluate the robustness of the visual tracking approaches, we use three different evaluations:

•OPE (One Pass Evaluation): this evaluation runs the compared trackers throughout a test

sequence with an initialization ground truth position in the ﬁrst frame;

•SRE (Spatial Robustness Evaluation): to evaluate whether a tracking method is sensitive to

initialization errors, this evaluation generates the object states by slightly shifting or scaling

19

the ground truth bounding box of a target object. In our experiments, we use eight spatial

shifts (four center shifts and four corner shifts), and four scale variations,according to the

benchmark [14]. We run the SRE evaluation 12 times, and the ﬁnal scores of SRE are

averaged to rank the compared tracking approaches.

•TRE (Temporal Robustness Evaluation): in this evaluation, each compared tracking ap-

proach is evaluated numerous times from different starting frames across an image sequence.

In each test, an algorithm is evaluated from a particular starting frame, with the initialization

of the corresponding ground truth object state, until the end of an image sequence. In our

experiments, we run TRE 20 times and use the averaged results to generate the TRE scores.

6.3. Experiment on CVPR2013 benchmark

In this section, we give the detailed experiment results of the standard CVPR2013 benchmark

[14]. We compare our exemplar regression tracking approach with 29 trackers: CPF [38], LOT

[39], IVT [10], ASLA [13], SCM [40], L1APG [19], MTT [41], VTD [21], VTS [42], LSK [43],

ORIA [44], DFT [45], KMS [46], SMS [47], VR-V [26], Frag [22], OAB [25], SemiT [48], BSBT

[49], MIL [7], CT [5], TLD [6], Struck [9], CSK [2] and CXT [50]. These 29 trackers cover

the classical discriminative and generative visual tracking approaches, and some of them gain the

state-of-the-art tracking results.

The CVPR2013 benchmark provides 50 sequences. To evaluate the trackers robustness, these

sequences are categorized with 11 challenging attributes: fast motion (FM), background clut-

ter (BC), motion blur (MB), deformation (DEF), illumination (IV), in-plane rotation (IPR), low

resolution (LR), occlusion (OCC), out-of-view (OV), and out-of-plane rotation (OPR) and scale

variations (SV). Each sequence includes several attributes. Among the 50 sequences, there are

39 sequences with OR attributes, 31 sequences with IR attributes, 29 sequences with OCC at-

tributes, 28 sequences with SV attributes, 25 sequences with IV attributes, 21 sequences with BC

attributes, 19 sequences with DEF attributes, 17 sequences with FM attributes, 12 sequences with

MB attributes, 6sequences with OOV attributes and 4with LR attributes. In Fig. 6, we give the de-

tailed results for each of the 11 challenging attributes. Our approach achieves the best benchmark

results on all the 29 classical visual trackers except the OV attribute.

20

(a) (b) (c)

(d) (e) (f)

Figure 6: Comparisons between our method with the state-of-the-art regression based visual trackers, Struck [9],

DSST [3] and KCF [4] on the selected 50 sequences using the OPE, SRE and TRE. Our method outperforms other

trackers on the three evaluations, and by nearly 10% on the OPE.

In Fig. 4, we give the overall OPE, TRE and SRE results on the benchmark sequences. Our

approach outperforms the second best approach Struck [9] with a large margine. The average CLE

and VOR results of our approach and 7visual trackers are presented in Tab. 2.

Table 2: The average comparisons of our approach with the 7trackers on the 50 sequences of CVPR2013 benchmark

in CLE at a threshold of 20 pixels and VOR at a threshold of 0.5. The value format of each table cell is ”CLE/VOR”.

The best result of each sequence is highlighted by bold red and the second best is highlighted by bold blue.

Ours Struck[9] MIL[7] ASLA[13] TLD[6] CSK[2] SCM[40] L1APG[19]

CLE/VOR 0.73/0.66 0.66/0.56 0.47/0.37 0.53/0.51 0.61/0.52 0.54/0.44 0.65/0.62 0.48/0.44

6.4. Experiment on the 50 challenging sequences

To further demonstrate our approach’s robustness comparing the state-of-the-art regression

trackers: Struck [9], KCF [4] and DSST [3], we run our approach and the three approaches on the

21

Lost Lost Lost

Ours Struck KCF DSST

Figure 7: Tracking results of four kinds of regression based trackers: our approach, structured output SVM(Struck

[9]), correlation ﬁlters (KCF [4] and DSST [3]) on 16 challenging sequences(from left to right and top down are

Human6, Human3, Human5, BlurBody, Lemming, Singer2, Car1, Couple, Freeman1, Freeman3, Human7, Human9,

Trellis, David3Outdoor, CarScale, Jumping).

selected 50 challenging sequences: BlurBody, BlurCar1, BlurCar4, BlurFace, Board, Box, Boy,

Car1, Car4, CarDark, CarScale, Coke, Couple, Crossing, Crowds, David, David2, David3Outdoor,

Deer, Dog, Dog1, Doll, Dudek, FaceOcc1, FaceOcc2, Fish, FleetFace, Football1, Freeman1,

Freeman3, Human3, Human4, Human5, Human6, Human7, Human9, Jumping, Lemming, Man,

singer1, Skating1, Subway, Surfer, Sylvester, Tiger1, Tiger2, Trellis, Walking, Walking2 and

Woman.

The overall experiment results are shown in Fig. 6, where our method is better than the three

trackers by nearly 10% on OPE. The average comparisons are presented in Tab. 3.

We compare our approach with the three state-of-the-art regression based approaches Struck

22

Table 3: The average comparisons of our approach with the 3state-of-the are regression based trackers on the selected

50 sequences. The format are same as Tab. 2.

Ours KCF[4] DSST[3] Struck[9]

CLE/VOR 0.86/0.81 0.77/0.62 0.76/0.70 0.74/0.58

[9], KCF [4] and DSST [3] on 16 challenging sequences for the 50 sequences in detail (Fig. 7). The

KCF approach is based on a correlation ﬁlter learned from HOG features [51]. The KCF approach

performs well in handling signiﬁcant deformation and fast motion (BlurBody, Trellis, and Singer2)

due to the robust representation of HOG features and effectiveness of the temporal context corre-

lation model. However, it fails to handle large scale variation (Human6, Human5, Car1, Freeman3

and CarScale) because of the ﬁxed sizes convolution searching strategy. Besides, it drifts when

there exists large motion blur (Couple, Human7 and Jumping), since the searching regions of KCF

are determined by the initial bounding box and ﬁxed during the tracking process. In addition, be-

cause of the linear updating of the ﬁlter, KCF can handle short-term occlusions (David3Outdoor)

smoothly, but can not handle the long-term (Human5, Human3 and Lemming). The Struck [9] ap-

proach can not handle the large scale variation (Human6, Human5, Car1, Freeman3 and CarScale)

either because of the traditional ﬁxed size Haar-like features [52] representation. In addition, it

drifts when the targets on the gray images are not obvious because of the limitation of the Haar-

like features representation (Singer2). The DSST [3] enhances the scale aware ability of KCF by

integrating a new 1−Dﬁlter to predict the scale variations of the targets. Therefore, it obtains

better experimental results on Car1, Trellis, CarScale and Singer2 than KCF. However, it does not

resolve the inherent drawbacks of correlation ﬁlters such as the linear updating strategy which can

make the trackers fail on the long-term occlusions (Lemming, Human3 and Human5), while the

problem of the limited size of the searching regions means that it cannot handle the large motion

blur (Human9, Couple and Jumping).

Our approach achieves remarkable results on all the 16 challenging sequences (Fig. 8). The

utilization of the PCA-HOG features representation makes our approach robust when handling

the illumination variations (Singer2 and Trellis). The sample selection effect of the exemplar

regression model helps the tracker handle the long-term and short-term occlusions (Lemming,

23

(a) (b) (c) (d)

(e) (f) (g) (h)

(i) (j) (k) (l)

(m) (n) (o) (p)

Figure 8: Frame-by-frame comparison of center location errors on the 16 challenging sequences illustrated in Figure

7. Generally, our method is able to track targets accurately and stably.The meanings of the colors of lines are the same

as in Figure 7.

David3Outdoor, and Human5) and gain good results on signiﬁcant deformation (Singer2 and Free-

man1). By integrating the particle ﬁltering and the ﬁxed size templates (32 ×32), our approach

can handle large scale variations (Freeman3, Carscale and Human6).

24

7. Conclusion

Although few researchers have noticed this, by giving a very simple negative-positive con-

straint for the training samples set of a common kernelized regression model, a state-of-the-art

robust visual tracker can be constructed. We constrain the linear combination of the score function

of the kernelized regression model to make sure that only the support samples in the estimated

positions from past frames are positive and the rest of the weights are all negative. We show that

this novel linear combination can be solved by a simple off-the-shelf non-negative least square

method.

8. Acknowledge

This study was supported by Science and Technology Project of Shenzhen (NO. JSGG20150331152017052).

This study was also supported in part by Shenzhen IOT key technology and application systems

integration engineering laboratory.

References

[1] D. Bolme, J. Beveridge, B. Draper, Y. M. Lui, Visual object tracking using adaptive correlation ﬁlters, in:

Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 2010, pp. 2544–2550. 2,3,4,5,

14

[2] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the circulant structure of tracking-by-detection with

kernels, in: Computer Vision–ECCV 2012, Springer, 2012, pp. 702–715. 2,3,4,5,20,21

[3] M. Danelljan, G. H¨

ager, F. Khan, M. Felsberg, Accurate scale estimation for robust visual tracking, in: British

Machine Vision Conference, Nottingham, September 1-5, 2014, BMVA Press, 2014. 2,3,4,5,6,14,17,21,

22,23

[4] J. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation ﬁlters, Pattern

Analysis and Machine Intelligence, IEEE Transactions on 37 (3) (2015) 583–596. 2,3,4,5,14,17,21,22,23

[5] K. Zhang, L. Zhang, M.-H. Yang, Fast compressive tracking, Pattern Analysis and Machine Intelligence, IEEE

Transactions on 36 (10) (2014) 2002–2015. 2,4,5,20

[6] Z. Kalal, K. Mikolajczyk, J. Matas, Tracking-learning-detection, Pattern Analysis and Machine Intelligence,

IEEE Transactions on 34 (7) (2012) 1409–1422. 2,4,5,20,21

[7] B. Babenko, M.-H. Yang, S. Belongie, Visual tracking with online multiple instance learning, in: Computer

Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009, pp. 983–990. 2,4,5,20,21

25

[8] G. Zhu, F. Porikli, Y. Ming, H. Li, Lie-struck: Afﬁne tracking on lie groups using structured svm, in: Applica-

tions of Computer Vision (WACV), 2015 IEEE Winter Conference on, 2015, pp. 63–70. 2,3,4,5

[9] S. Hare, A. Saffari, P. Torr, Struck: Structured output tracking with kernels, in: Computer Vision (ICCV), 2011

IEEE International Conference on, 2011, pp. 263–270. 2,3,4,5,6,12,14,17,20,21,22,23

[10] D. A. Ross, J. Lim, R.-S. Lin, M.-H. Yang, Incremental learning for robust visual tracking, International Journal

of Computer Vision 77 (1-3) (2008) 125–141. 2,4,7,20

[11] X. Mei, H. Ling, Robust visual tracking using l1minimization, in: Computer Vision, 2009 IEEE 12th Interna-

tional Conference on, 2009, pp. 1436–1443. 2,4,14

[12] Q. Wang, F. Chen, W. Xu, M.-H. Yang, Online discriminative object tracking with local sparse representation,

in: Applications of Computer Vision (WACV), 2012 IEEE Workshop on, 2012, pp. 425–432. 2,4

[13] X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, in: Computer

Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 1822–1829. 2,4,14,20,21

[14] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in: IEEE Conference on Computer Vision

and Pattern Recognition (CVPR), 2013. 2,3,16,17,20

[15] A. Yilmaz, O. Javed, M. Shah, Object tracking: A survey, Acm Computing Surveys 38 (1) (2006) 8193. 4

[16] M.-H. Yang, J. Ho, Toward robust online visual tracking, in: Distributed Video Sensor Networks, Springer,

2011, pp. 119–136. 4

[17] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, M. Shah, Visual tracking: an exper-

imental survey, Pattern Analysis and Machine Intelligence, IEEE Transactions on 36 (7) (2014) 1442–1468.

4

[18] H. Li, C. Shen, Q. Shi, Real-time visual tracking using compressive sensing, in: Computer Vision and Pattern

Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 1305–1312. 4

[19] H. Ji, Real time robust l1 tracker using accelerated proximal gradient approach, in: 2012 IEEE Conference on

Computer Vision and Pattern Recognition, 2012, pp. 1830–1837. 4,20,21

[20] M. J. Black, A. D. Jepson, Eigentracking: Robust matching and tracking of articulated objects using a view-

based representation, International Journal of Computer Vision 26 (1) (1998) 63–84. 4

[21] J. Kwon, K. M. Lee, Visual tracking decomposition, in: Computer Vision and Pattern Recognition (CVPR),

2010 IEEE Conference on, IEEE, 2010, pp. 1269–1276. 4,20

[22] A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using the integral histogram, in: Computer

vision and pattern recognition, 2006 IEEE Computer Society Conference on, Vol. 1, IEEE, 2006, pp. 798–805.

4,20

[23] X. Li, Z. He, X. You, C. P. Chen, A novel joint tracker based on occlusion detection, Knowledge-Based Systems

71 (2014) 409–418. 4

[24] N. V. Lopes, P. Couto, A. Jurio, P. Melo-Pinto, Hierarchical fuzzy logic based approach for object tracking,

26

Knowledge-Based Systems 54 (2013) 255–268. 4

[25] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting., in: BMVC, Vol. 1, 2006, p. 6. 4,

5,20

[26] R. T. Collins, Y. Liu, M. Leordeanu, Online selection of discriminative tracking features, Pattern Analysis and

Machine Intelligence, IEEE Transactions on 27 (10) (2005) 1631–1643. 4,5,20

[27] X. Mei, H. Ling, Robust visual tracking and vehicle classiﬁcation via sparse representation, Pattern Analysis

and Machine Intelligence, IEEE Transactions on 33 (11) (2011) 2259–2272. 4

[28] A. Bordes, L. Bottou, P. Gallinari, J. Weston, Solving multiclass support vector machines with larank, in:

Z. Ghahramani (Ed.), Proceedings of the 24th International Machine Learning Conference, OmniPress, Cor-

vallis, Oregon, 2007, pp. 89–96. 5,6

[29] T. Malisiewicz, A. Gupta, A. Efros, Ensemble of exemplar-svms for object detection and beyond, in: Computer

Vision (ICCV), 2011 IEEE International Conference on, IEEE, 2011, pp. 89–96. 6

[30] M. Isard, A. Blake, Condensationconditional density propagation for visual tracking, International journal of

computer vision 29 (1) (1998) 5–28. 7

[31] B. Scholkopf, R. Herbrich, A. J. Smola, R. Williamson, A generalized representer theorem, Proceedings of

Annual Conference on Computational Learning Theory 42 (3) (2000) 416–426. 8,15

[32] C. M. Bishop, Pattern recognition and machine learning, springer, 2006. 8

[33] C. L. Lawson, R. J. Hanson, Solving least squares problems, Vol. 161, SIAM, 1974. 11

[34] P. Felzenszwalb, R. Girshick, D. McAllester, D. Ramanan, Object detection with discriminatively trained part-

based models, Pattern Analysis and Machine Intelligence, IEEE Transactions on 32 (9) (2010) 1627–1645. 12

[35] K. Simonyan, A. Vedaldi, A. Zisserman, Deep inside convolutional networks: Visualising image classiﬁcation

models and saliency maps, CoRR abs/1312.6034.

URL http://arxiv.org/abs/1312.6034 15

[36] D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, K. Hansen, K. R. Mller, How to explain individual

classiﬁcation decisions, Journal of Machine Learning Research 11 (9) (2010) 1803–1831. 15

[37] L. Cehovin, M. Kristan, A. Leonardis, Is my new tracker really better than yours?, in: Applications of Computer

Vision (WACV), 2014 IEEE Winter Conference on, IEEE, 2014, pp. 540–547. 19

[38] P. P´

erez, C. Hue, J. Vermaak, M. Gangnet, Color based probabilistic tracking, in: Computer visionECCV 2002,

Springer, 2002, pp. 661–675. 20

[39] S. Oron, A. Bar Hillel, D. Levi, S. Avidan, Locally orderless tracking, in: Computer Vision and Pattern Recog-

nition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 1940–1947. 20

[40] W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparsity-based collaborative model, in: Computer

vision and pattern recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 1838–1845. 20,21

[41] T. Zhang, B. Ghanem, S. Liu, N. Ahuja, Robust visual tracking via multi-task sparse learning, in: Computer

27

Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 2042–2049. 20

[42] J. Kwon, K. M. Lee, Tracking by sampling trackers, in: Computer Vision (ICCV), 2011 IEEE International

Conference on, IEEE, 2011, pp. 1195–1202. 20

[43] B. Liu, J. Huang, L. Yang, C. Kulikowsk, Robust tracking using local sparse appearance model and k-selection,

in: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, IEEE, 2011, pp. 1313–1320.

20

[44] Y. Wu, B. Shen, H. Ling, Online robust image alignment via iterative convex optimization, in: Computer Vision

and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 1808–1814. 20

[45] L. Sevilla-Lara, E. Learned-Miller, Distribution ﬁelds for tracking, in: Computer Vision and Pattern Recognition

(CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 1910–1917. 20

[46] D. Comaniciu, V. Ramesh, P. Meer, Kernel-based object tracking, Pattern Analysis and Machine Intelligence,

IEEE Transactions on 25 (5) (2003) 564–577. 20

[47] R. T. Collins, Mean-shift blob tracking through scale space, in: Computer Vision and Pattern Recognition, 2003.

Proceedings. 2003 IEEE Computer Society Conference on, Vol. 2, IEEE, 2003, pp. II–234. 20

[48] H. Grabner, C. Leistner, H. Bischof, Semi-supervised on-line boosting for robust tracking, in: Computer Vision–

ECCV 2008, Springer, 2008, pp. 234–247. 20

[49] S. Stalder, H. Grabner, L. Van Gool, Beyond semi-supervised tracking: Tracking should be as simple as detec-

tion, but not simpler than recognition, in: Computer Vision Workshops (ICCV Workshops), 2009 IEEE 12th

International Conference on, IEEE, 2009, pp. 1409–1416. 20

[50] T. B. Dinh, N. Vo, G. Medioni, Context tracker: Exploring supporters and distracters in unconstrained environ-

ments, in: IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 1177–1184. 20

[51] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: Computer Vision and Pattern

Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, IEEE, 2005, pp. 886–893. 23

[52] P. Viola, M. Jones, Rapid object detection using a boosted cascade of simple features, in: Computer Vision and

Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, Vol. 1,

IEEE, 2001, pp. I–511. 23

28