ArticlePDF Available

Abstract and Figures

Robustness and efficiency are the two main goals of existing trackers. Most robust trackers are implemented with combined features or models accompanied with a high computational cost. To achieve a robust and efficient tracking performance, we propose a multi-view correlation tracker to do tracking. On one hand, the robustness of the tracker is enhanced by the multi-view model, which fuses several features and selects the more discriminative features to do tracking. On the other hand, the correlation filter framework provides a fast training and efficient target locating. The multiple features are well fused on the model level of correlation filer, which are effective and efficient. In addition, we raise a simple but effective scale-variation detection mechanism, which strengthens the stability of scale variation tracking. We evaluate our tracker on online tracking benchmark (OTB) and two visual object tracking benchmarks (VOT2014, VOT2015). These three datasets contains more than 100 video sequences in total. On all the three datasets, the proposed approach achieves promising performance.
Content may be subject to copyright.
A multi-view model for visual tracking via correlation filters
Xin Lia,, Qiao Liub,, Zhenyu Hea,∗∗, Xinge Youc,∗∗, C. L. Philip Chend,c
aSchool of Computer Science, Harbin Institute of Technology Shenzhen Graduate School, China
bSchool of Big Data and Computer Science, Guizhou Normal University, Guiyang 550001, China
cDepartment of Electronics and Information Engineering, Huazhong University of Science and Technology, China
dFaculty of Science and Technology, University of Macau, Macau
Abstract
Robustness and efficiency are the two main goals of existing trackers. Most robust trackers are
implemented with combined features or models accompanied with a high computational cost. To
achieve a robust and efficient tracking performance, we propose a multi-view correlation tracker
to do tracking. On one hand, the robustness of the tracker is enhanced by the multi-view model,
which fuses several features and selects the more discriminative features to do tracking. On the
other hand, the correlation filter framework provides a fast training and efficient target locating.
The multiple features are well fused on the model level of correlation filer, which are effective and
efficient. In addition, we raise a simple but effective scale-variation detection mechanism, which
strengthens the stability of scale variation tracking. We evaluate our tracker on online tracking
benchmark (OTB) and two visual object tracking benchmarks (VOT2014, VOT2015). These three
datasets contains more than 100 video sequences in total. On all the three datasets, the proposed
approach achieves promising performance.
Keywords: Visual object tracking, Multi-view, Correlation filters, Robust tracking
1. Introduction
Visual object tracking is a basic and challenging topic in the computer vision area. Its goal
is tracking a specified target in video sequences automatically. Visual object tracking can be
applied to many tasks, such as video surveillance, automatic driving, and video analysis. Despite
the enormous progresses made in recent years, object tracking remains a challenging task. The
main challenging aspects [1] include illumination variation, background clusters, scale variation,
deformation, and occlusion. The challenging aspects are caused by random variations of the video
sequence. The variations includes the background variation and target variation.
An effective tracker should handle the variations both of the target and the background well. At
the same time, the efficiency of the tracker should also be considered. The two points (robustness
These authors contributed equally to this work and should be considered co-first authors
∗∗I am corresponding author
Email addresses: zyhe@hitsz.edu.cn (Zhenyu He), youxg@hust.edu.cn (Xinge You),
Philip.Chen@ieee.org (C. L. Philip Chen)
Preprint submitted to Knowledge-based System December 29, 2016
and efficiency) indicate two development directions of the recent tracking literature. One direction
is to develop more robust trackers tackling with several challenging aspects. The other is to develop
fast trackers to achieve real-time tracking.
To be robust to variations, trackers should be online updated. Generative methods reconstruct
each candidate of the current frame with updated templates. Then the tracking result is selected
as the candidate with the minimum reconstruction error. In early studies, holistic templates were
widely used for tracking [2]. These methods adapt to the variations by newly updated templates. In
order to update the template more effectively, subspace learning approaches [3] were proposed for
tracking. Besides, a sparse representation based method, which handles the corrupted appearance,
was proposed to do tracking by [4]. After that, several improved sparse representation based meth-
ods [5, 6, 7, 8] were proposed. To make templates more robust, effective visual features [9, 10]
were adopted. The statistical features are insensitive to pose and scale variation, such as color
histograms [11], histograms of oriented gradients (HOG) [12]. To deal with long-term tracking,
Ma et al.[13] proposed to learn discriminative correlation filters for accurate tracking and train an
online random fern classifier to re-detect objects. Another kind of online trackers are discrimi-
native approaches [14]. The discriminative approaches train a binary classifier online to discrim-
inate the target from the background. At first, support vector machine (SVM) tracker [15] and
boosting tracker [16] were proposed. To tackle with the label noise, more priori information and
revalidated mechanism were exploited by learning methods [17, 18, 19]. Some semi-supervised
methods [20, 21] and dictionary learning method [22] both benefit the tracking problem. To take
advantages of both the generative and discriminative models, some hybrid trackers [23, 24, 25]
are proposed. To deal with drastic appearance changes, Zhong et al. [23] developed a sparse dis-
criminative classifier and sparse generative model for object tracking. Yin et al. [24] combined a
structured support vector machine with incremental principal component analysis to do tracking.
These methods obtain robust performance based on the clues from two different models.
The above trackers usually apply one kind of representation or model for tracking. In a video
sequences especially the video captured from real scenario, trackers with one feature or model
are prone to failures. To handle more challenging aspects, combination methods on feature level,
model level and tracker level were proposed. Multi-features [26, 27] are exploited to tackle with
different target variations. The multi-feature tracker improves the robustness to diverse variations.
However, directly putting different features together may make some features mutually exclusive.
Majority combination methods [28, 29] combine different models for tracking. Each model cor-
responds to one kind of features. The feature can be appearance feature, motion feature or spatial
feature. With the training from different views, the combined models can be complementary to
each other. Besides, combination methods [30, 31] on tracker level were presented. These trackers
obtain the final results through the votes of each tracker. However the weights of each sub-tracker
should be well set. Generally speaking, the combination trackers becomes more robust, but with a
higher computational cost.
Efficiency is another important characteristic of a good tracker. The efficiency can be achieved
by an efficient framework [32, 33, 34], which has a low computational cost and an optimized
search strategy. The mean-shift tracker [35] is a classic real-time tracker, which finds the target by
several shifts. Recently, regression based trackers becomes popular [36, 37] . The correlation filter
based trackers only need several element-wise products and a Fast Fourier Transformation (FFT)
2
operation to get the tracking result in current frame. Based on correlation filter, a real-time track-
er [12] has been proposed. Henriques et al. [12] introduced the kernel trick into correlation filter
tracking framework named as kernelized correlation filter (KCF). However, the existing real-time
trackers with simplified models are prone to failures when diverse challenging variation occurs. In
summary, the real-time trackers run fast with simple models, while they are not stable enough to
handle the variations in video sequences.
In order to achieve a robust and efficient tracker, we propose a multi-view correlation filters
tracker (MvCFT). Multi-view model extracts features from the diverse different views. It is crucial
to apply multi-view features to video processing [38]. For visual tracking, features from one single
view are usually variable along the variations of both target and the background. To adapt to the
changes, trackers have to update the model frequently, which usually results in drifting failures.
With one kind of features, it is infeasible to balance the update rate to adapt to the changes and
keep the original features. What is worse, a discriminative feature, such as color, shape or texture,
may be indistinguishable along the changing of the background. Therefore, a robust tracker needs
features from different views. In the proposed model, the features from diverse views can well
handle the variations and the more discriminative feature contributes more to the tracking result.
The fusion of multiple views has the ability to select the more discriminative and stable view to
track the target. At the same time, our algorithm fuse the models from multiple views efficiently
due to the efficient operation of correlation filter framework, which ensures the efficiency of the
proposed tracker. In addition, we present an efficient and simple mechanism to detect the scale
variations based on the affine transformation, which handles the scale variation well. The accurate
scale evaluation benefits the online update with less background noise. In mass of experiments,
our approach obtains competitive results with a fast speed both on the comprehensive benchmark
datasets OTB [1], VOT2014 [39], and VOT2015 [40].
The contributions of this paper can be summarized as follows.
First, we present a robust and efficient tracking framework, which applies multi-view model to
correlation filters. The proposed algorithm obtains a robust and efficient tracking performance. On
one hand, the multi-view model enhances the robustness of the tracker to diverse variations both
from the target and the background. On the other hand, the correlation filter provides an efficient
tracking framework, in which the multi-view models can be fused effectively. The fused model
selects the more discriminative view to do tracking, which obtains a stable performance. To the
best of our knowledge, we are the first to do these.
Second, we propose a simple yet effective scale detection mechanism based on the affine trans-
formation, which handles the scale variation problem well. This mechanism enhances the robust-
ness of our tracker and alleviates the influence of sample noise compared with the obvious corre-
lation filter based trackers with a fixed scale size. What is more, this scale detection mechanism
can be easily applied into other tracking frameworks.
In the rest of this paper, we give a related work in Section 2, present the multi-view tracker
via correlation filters, which includes the fundamental about the KCF, the multi-view model and
the scale evaluation in Section 3, describe the implementation details in Section 4, evaluate our
approach on comprehensive benchmark datasets and compare with the state-of-the-art trackers in
Section 5. Finally, we conclude our work in Section 6.
3
2. Related Work
As an extensive review on multi-view learning and correlation filters beyond the scope of
this paper, we review the works related to our approach including multi-view based trackers and
correlation filters based trackers.
To deal with variations of target and background from diverse aspects, several methods [28,
41, 31] exploit information from multiple views to enhance their robustness, due to the limitations
of one single view. Kwon and Lee [28] formulated the tracking problem as several basic observa-
tion and motion models corresponding to several types of features. The multiple basic models are
constructed by sparse principal component analysis (SPCA) and each basic tracker of each model
is integrated with an interactive Markov Chain Monte Carlo scheme. Hong et al. [41] proposed a
robust tracker via a multi-task multi-view joint sparse framework. The cues from multiple views
including various types of visual features are exploited. Each feature observation is sparsely repre-
sented by a linear combination from a feature dictionary and the method is integrated in a particle
filter framework where every view in each particle is regarded as an individual task. However,
fusing the features from different views in the original space, namely stacking the features from
different views as one vector, may degrade the power of the discriminative views due to some
indistinguishable views. Yoon et al. [31] integrated multiple trackers based on different feature
representations within a probabilistic framework to cope with various challenging factors. Each
view is concerned with one particular feature representation of the target object. Robust tracking
results are obtained by exploiting tracker interaction and selection. However these trackers also
bring high computational costs. Compared with these multi-view trackers, our tracker are fused in
an efficient framework, namely correlation filters. What is more, our fused tracker has the ability
of interaction and selection, which gives a robust performance.
In order to achieve a fast tracking, Bolme et al. [37] proposed a Minimum Output Sum of
Squared Error (MOSSE) filter. The MOSSE filer produces stable correlation filters when initial-
ized using a single frame. The MOSSE filters based tracker is robust to variations in pose and
non-rigid deformations. Henriques et al. [12] diagonalized the circular data matrix with the dis-
crete Fourier transform to reduce both storage and computation by several orders of magnitude
and find the linear regression of their formulation is equivalent to the correlation filter. Further,
they proposed a kernel regression version of correlation filter, which has the exact same complex-
ity as its linear counterpart. The above two trackers are robust to some variations, but one single
view has its limitations and the target size of these two tracers are fixed, which cannot locate the
object accurately. Compared with these trackers based on correlation filters, our tracker exploits
the features from several views to enhance the robustness to various changes and selects the more
discriminative features to ensure the accuracy.
3. The multi-view tracking with correlation filters
In order to achieve an efficient robust tracker, we propose a multi-view tracker via correlation
filters. Features from one view have their limitation in handling with diverse variations of tracking
scenario. It is crucial to apply multiple view features to strengthen the robustness of a tracker
to handle variations. In order to deal with the high computational cost of processing multiple
4
Multi-views
Original image and
the search window Response maps Fused response map Tracking result
Feature extraction Correlation filtering Views fusion
Figure 1: The main tracking framework of our approach. For each tracking frame, features from different views are
extracted from the original image within the preset search window. In the proposed approach, we exploit the features
including HOG, color and intensity. With correlation filters, the response maps of the three views are obtained. The
response maps are shown in gray image, where bright area stands for high value and dark for low value. As we can
see, the response maps from different views show different response values, due to the different discriminatory powers
in different scenes. After that, the response maps are fused with the proposed fusion method to get a more accurate
and more discriminative response map. With the fused response map, the tracking result of the current frame can be
obtained with the target size and the target center point which is the point with the highest response value in the fused
response map.
views, we propose a multi-view tracker via correlation filters. The features from different views
are efficiently processed with correlation filters. Afterwards, the responses from different views
are fused to obtain a more accurate and stable result under a probabilistic framework. The main
tracking procedure of per-frame can be seen in Fig. 1. In the following of this section, we present
our approach in three parts: theoretical foundation about correlation filter, the multi-view model
and a simple but effective scale evaluation.
3.1. Kernelized Correlation filter
The KCF [12] tracker trains a classifier with all sub-windows of an image, which is called
dense sampling. With the kernel trick, the data matrix of samples becomes highly structured,
which are amenable to manipulation with cyclic shifts. Besides, the image is transformed into
Fourier domain through FFT, where the convolution of two images can be computed by an efficient
element-wise product. With the circulate data matrices and the efficient element-wise product, the
KCF achieves a fast and satisfying performance. However, features from one single view have
their limitation in dealing with various changes in tracking. In order to obtain a more robust
tracker to various challenges in tracking, we introduce the multi-view model to the efficient KCF
tracking framework.
In the following paragraphs, we briefly introduce the KCF approach. More details can be
found in [12]. The KCF tracker casts the tracking problem as a classification problem. The
classifier f(x)is trained on an image patch xwith size M×N. The target-centered image patch
serves as the positive example and the negative samples are generated by cyclic shifts xm,nof x,
(m,n)∈ {0,1, ...M1}×{0,1, ..., N1}. The regression labels y(m,n)are set as a Gaussian
5
function, which takes the highest value of 1 in the center and decrease to 0 with the shifts to the
border. The classifier f(x) = wTxis trained by minimizing the regularized risk. The minimization
problem is
argmin
w
m,n
|
ϕ
(xm,n),wy(m,n)|2+
λ
w2(1)
where .denotes the inner product,
ϕ
is the mapping to the Hilbert space,
λ
is a regularization
parameter. Here the kernel trick is applied. The inner product of xand xis computed as
κ
(x,x) =
ϕ
(x),
ϕ
(x).
As the solution wis expressed with a linear combination of samples as w=i
α
(m,n)
ϕ
(xm,n)
in the kernel trick, it can be computed from
α
=F1F(y)
F(kx) +
λ
(2)
where Fand Fare the discrete Fourier operator and its inverse, respectively. kx=
κ
(xm,n,x).The
vector
α
contains all the
α
icoefficients. The model of the KCF tracker consists both of the learned
target appearance ˆxand the coefficients F(
α
).
In the tracking prcess, a candidate zof the same size as xis scored by
f(z) = F(F(kz)⊙ F (
α
)) (3)
where denotes the element-wise product, kz=
κ
(z,ˆx)and ˆxis the learned target appearance.
3.2. The Multi-view Model
A multi-view model should exploit the complementary information of different views and fuse
the different views in a subspace where they have the same statistical property. The selection
of features and fusion method directly affect the performance of the multi-view model. For a
tracking problem, features should have the capability of dealing with variations of tracking target
and the background, such as deformation, motion blur, rotation, and illumination variation. In
our approach, we select the features from views of edge, color and intensity, which correspond
to features of HOG [42], Color Names [43] and gray value. The HOG features are robust to
illumination and deformation, which obtains excellent results in human detection [42]. Color
Names and gray value are robust to motion blur, which gives good results in image retrieval [43].
The fusion operation can be done on feature level, model level or tracker level. For the correlation
filter, the model level of different views is suitable for fusion, due to the shared probability space.
We propose to unify different views under a probabilistic framework. For each view t,t=
1, ..., v, a probability distribution pt
i j,pt
i j =1, (i,j)∈ {1, ..., M} × {1, ..., N}, which is the prob-
ability of position (i,j)to be the center of the target, can be obtained with the correlation filter. To
find the optimal fusion model, the kullback-leibler (KL) divergences between the distributions of
each view and the fused distribution are minimized. The fused optimal probability distribution Q
is obtained by
argmin
Q
t
KL(Pt||Q)
s.t.qi j =1
(4)
6
where pi j and qi j are the (i,j)s elements of Pand Qcorrespondingly. The KL divergence is
calculated as:
KL(Pt||Q) = pt
i j log pt
i j
qi j
(5)
In the tracking video sequences with various changes, the response from a single view may
contain noises, which should be filtered before the fusion. To filter the noises, some trackers
exploit the Gaussian distribution and here we apply the distribution of another view to filter the
noises, which is more accurate and adaptive. We denote the HOG, Color Name and Gray value
feature as H,Cand Gcorrespondingly. Given the current image Iand the correlation filter F,
we get the response maps on each features as PH=FH(I),PC=FC(I)and PG=FG(I). The
response of view HOG filtered by the response of view Color Name is
PHC =PHPC(6)
where stands for the element-wise product. The above equation means the response of view
HOG is weighted by the response of Color Name. With this weighting, the response map with
noises is filtered by another response map, which achieves area selection of the original response
map. Namely, the response map with mass of noise or error can be filtered out by the discrimina-
tive response map from a more discriminative view. Further, the fused response map of the two
views depends more on the more salient or discriminative view. After the filtering operation with
selection ability between all the pair combinations of the views, we get three fused response maps
(PHC,PHG and PCG ) with less noise and more salient response. With Eq. 4, Eq. 5 and Eq. 6, the
new objective function for the multi-view model is
argmin
Q
o∈{HC,HG,CG}
i j
po
i j log po
i j
qi j
s.t.qi j =1
(7)
The above equation can be solved with Lagrange multiplier method, and the final solution Q
is calculated as
Q=PHC(I)PHG (I)PCG(I)
3(8)
where stands for element-wise addition. As we can see, our method only need several element-
wise product and addition operations, which is very efficient. From an intuitive point of view,
the final probability distribution map is obtained by sum all the filtered response maps, which
strengthens the final response with the responses from all the views. For each filtered response
map, they have been filtered with less noise and the response value are more confident, which
can be seen in Fig. 2. Therefore, the proposed multi-view model has the ability of selection and
strengthening.
For visual tracking problem, the selection and strengthening ability of the multi-view model is
the essential point of a robust and stable tracking performance. Features from an individual view
7
can be indistinguishable when some challenging variations occur. The variations of the target or
the background may degrade the discrimination of a feature, which may result in losing the object.
For example, the HOG feature is sensitive to motion blur and rotation, and the color feature may
be invalid for illumination changes. To deal with various changes from different aspects, we need
to consider features from different views. We want to use multiple views in a way that satisfies the
following two rules. When the features of some views are disabled, we hope to exploit the more
discriminative views to obtain the results. When the discriminations of all the views are degraded
by the variations, we hope to get the result by the votes of all the views, which performs like a
strengthening manner from all the views. These two rules can be seen in Fig. 2.
HOG
Color
Fused response
Current target
Fused response
Current target
(b) Two indiscriminative views
(a) A discriminative view and an indiscriminative view
r
O
n
a
t
i
v
e
v
i
e
w
a
n
d
a
n
i
HOG
Color
Figure 2: The selection and enhancement abilities of the fusion method. The tracking target in (a) has just undergone
dramatic rotation by turning around. As we can see the response map of the HOG feature is indistinguishable with
much noise around the center and the response map of color feature is more discriminative. The fusion of these two
features achieves the area selection of the HOG response map by the color response map. The tracking target in (b)
undergoes the illumination variation and motion blur problems. Both of the two views are disturbed. However, the
fusion method can achieve the mutual selection between the two views, which enhances the response area where both
two maps have a high confidence.
3.3. Scale Evaluation
In the correlation filter tracking framework, we can obtain the center position of the target by
finding the maximum value of the response map, but the scale change of the tracking target cannot
8
be detected from the response map. In object tracking, scale change is one of the most common
challenging aspects, which influences the tracking accuracy directly. In this section, we raise a
simple yet effective scale change detection mechanism based on the affine transformation.
For most of the tracking approaches, the template size or model size of the target is fixed, which
can be a manually setting size or the initial target size. In order to deal with the candidate images
with different sizes, the candidate images are usually resized to the same size as their model size
with the affine transformation. Inspired by this strategy, we also resize the original image to the
fixed size of the correlation filter model. The difference is that we do not calculate how big or how
small it becomes precisely, which is unnecessary and complicated. Due to the size of the target
changes gradually in the tracking video sequences, we only check the changing direction of the
target size, namely detect that the size of the target becomes larger, smaller or stay unchanged.
With the change direction, we change the target size by a small step, which affords the changes
between two adjacent frames. The experiment results well verify the effectiveness of our method.
In the correlation filter framework, the model size is set as the initial target size sz1= (H1,W1).
For subsequent notational convenience, we use superscript to denote the frame number and sub-
script to denote the scale change direction. Given the current frame index t, the scale change
direction dtD={−1,0,1}, which stand for zoom out, unchanged and zoom in based on the
target size of the previous frame, respectively. The resized target image patches pt
d, the change
direction of the t-th frame ˆ
dtcan be calculate as:
ˆ
dt=argmax
dtD
ρ
d
ψ
(pt
d)(9)
where
ψ
(pt
d)gives the maximum value of the correlation filter response map from image pt
d,
ρ
d∈ {
ρ
1,
ρ
0,
ρ
1}can be regarded as the weights of the three change directions, which gives a
higher weight to the unchanged case to keep the scale change stable. In other words, we prefer to
keep the target size unchanged when the response map is less confident. This becomes significant
when the tracking target undergoes drastic changes. The image patches pt
dare obtained by resizing
the candidate image patches pt
swith three scales as the model size (H1,W1). The pt
sare cropped
from the t-th frame image with the center position of the target and the three candidate scales in
szt. The three candidate scales are calculated as
szt=sz1×St(10)
where Stis the scale factor of frame t, which contains three candidate sizes. The Stis calculated
as
St=ˆ
St1×R(11)
where ˆ
St1is the optimal scale factor of frame t1, R∈ {r1,1,r1},(0<r1<1,1<r1)includes
the change ratios of the three directions. With Eq. 9 and Eq. 11, we obtain the scale change
direction and the optimal scale of the target in frame t. The details of the parameter settings can
be found in Section 4.
9
4. Implementation Details
In this section, we describe the overall tracking process of the proposed tracker and the pa-
rameter settings. Our tracker begins with the initial bounding box, which locates the position of
the tracking target in the first frame. Filter models of each view are trained with the first frame
and then the tracker runs iteratively on each frame. In each iteration, we first evaluate the optimal
scale size and then locate the center position of the target with the multi-view model. Finally, we
update the correlation filter models in a linear incremental way. The overall process can be seen
in Algorithm 1.
For the features we used are set as follows. The HOG features are 32 dimensional, which is
the same with KCF’s. There are 18 contrast sensitive orientation channels, 9 contrast insensitive
orientation channels, 4 texture channels and 1 all zeros channel. The Color Names maps the RGB
values to a probabilistic 11 dimensional color representation which sums up to 1.
The parameters are set as follows. The size of the search window is set as sz window =
2sz, i.e., the twice of the target size. We use the same parameters about the correlation filter as
in [12]. For the HOG features, we set the padding size as 1 and the learning rate of the model as
0.02. The learning rates of the Color Name model and the gray value model are set as 0.15. The
regularization parameter
λ
of the correlation filter model is set as 0.01. In the scale evaluation part,
the weights of the three scale change directions
ρ
dare set as {0.99,1,0.98}, which correspond to
zoom out, stay unchanged and zoom in. The scale change ratios are set as r1=0.95, r1=1.05.
Algorithm 1 The MvCFT tracker
1: Input: the initial target bounding box b0, the target size sz, the search window sz window, the
initial tracking frame I0, the model learning rate
γ
;
2: Extract the target features of each view from I0with area b0;
3: Train the initial models M0of each views with Eq. 2 and b0;
4: while The video sequences is not end do
5: Evaluate the scale change and get the optimal scale factor ˆ
Stwith Eq. 9, Eq. 10 and Eq. 11;
6: Crop out the search window with sz window ˆ
Stfrom the current frame Itand extract the
features of each views from the search window;
7: Compute the correlation filter response maps FH,FC,FGof each view with Eq. 3;
8: Calculate the fused probability distribution Qwith Eq. 8 and Eq. 6;
9: Get the target position of the current frame twith Qand the current target size sz ˆ
St;
10: Get the current correlation filter model Mtwith the current target and update the correla-
tion model of each view as Mt= (1
γ
)Mt1+
γ
Mt;
11: end while
5. Experiment
In order to evaluate the proposed tracker objectively and comprehensively, we test the pro-
posed tracker both in the way of OTB [1] and the way of VOT [39], which are the most popular
and widely used evaluation methods. The OTB develops a library and benchmark containing 29
10
trackers and 50 video sequences, which has been cited more than 300 times. The VOT challenges
provide the visual tracking community with a precisely defined and repeatable way of compar-
ing trackers, whose detail information can be found in http://www.votchallenge.net/
challenges.html. In the following sections, we first introduce the experiment settings of
the proposed tracker, and then two subsections both including the datasets, comparing trackers,
evaluation criteria, performance results and performance analysis are given.
5.1. Experiment Settings
The proposed tracker is implemented in MATLAB with MEX files using a single thread with-
out further optimization. The experiments runs on a PC with an Intel-i5-4590 (3.3 GHz, Quad-
Core) CPU and 8 GB RAM. For all the experiments, we use the same parameters detailed in
section 4.
Average overlap measure is the most appropriate for tracker comparison [44], which accounts
for both position and size. To this end, we use the typical criteria Pascal VOC Overlap Ratio
(VOR) [45]. Given the bounding BRof the result and the bounding box BGof the ground truth,
the VOR can be computed as VOR =Area{BRBG}
Area{BRBG}.
5.2. Evaluation with OTB
The OTB builds a large tracking dataset with 50 fully annotated sequences and categorizes
the sequences with 11 attributes, which are out-of-plane rotation (OR), in-plane rotation (IR),
occlusion (OCC), scale variation (SV), illumination variation (IV), background cluster (BC), de-
formation (DEF), fast motion (FM), motion blur (MB), out of view (OOV) and low resolution
(LR). Besides, it is the first work to show the tracking performance with the area under curve
(AUC) score. The AUC score shows the success rates at different thresholds, which are more
comprehensive than a single figure of overlap ratio. There are three term to measure the perfor-
mance: one-pass evaluation (OPE), temporal robustness evaluation (TRE), and spatial robustness
evaluation (SRE). TRE runs 20 times with different start frames on each video sequence. SRE
runs 12 times with different spatial perturbations.
Datasets: The total datasets include 50 video sequences and dataset ’jogging’ includes 2 tar-
gets. Therefore the total number of tracking results is 51. The OTB collects and annotates most
commonly used tracking sequences. For a better evaluation and analysis on the strength and weak-
ness of trackers, the sequences are categorized with 11 attributes. Each dataset includes several
attributes. Among all the 50 video sequence, there are 39 sequences with OR attributes, 31 se-
quences with IR attributes, 29 sequences with OCC attributes, 28 sequences with SV attributes, 25
sequences with IV attributes, 21 sequences with BC attributes, 19 sequences with DEF attributes,
17 sequences with FM attributes, 12 sequences with MB attributes, 6 sequences with OOV at-
tributes and 4 with LR attributes, which can also be seen in Fig. 3. In Fig. 3, we give both the
overall performance on the total 50 sequences and the results on subsets corresponding to each
attributes to show specific challenging conditions.
Trackers: We compared our tracker with the top 4 trackers among all the 29 trackers in OTB.
The 4 trackers are Struck [18], SCM [30], TLD [19] and ASLSA [46]. Struck uses a kernelized
11
structured output SVM to provide adaptive tracking. SCM develops a sparsity-based discrimina-
tive classifier and a sparsity-based generative model to do tracking. TLD decomposes the long-
term tracking task into tracking, learning, and detection. ASLSA propose to do tracking with
a structural local sparse appearance model. Besides two other classical trackers KCF [12] and
DSST [47] are also compared with our tracker. KCF is basically a kernelized correlation filter
operating on simple HOG features. In our experiment, the KCF exploits the HOG feature. DSST
extends the MOSSE tracker with robust scale estimation. The codes and settings are all the same
with OTB, which are widely approved. Our proposed tracker also run in the same framework of
the OTB for data loading and result calculating. The comparison results can be seen in Fig. 3.
Evaluation: Three experiments are designed on OTB to validate the proposed model com-
prehensively. First, we show comparison with state-of-the-art trackers on the term of OPE, TRE,
and SRE. Second, we give the results of feature fusion and scale evaluation separately to show
the contribution of each part. Third, we give the comparisons of the fusion model and each single
feature.
The OPE performance of 7 trackers on each attribute is shown in Fig. 3. The overall perfor-
mance of the seven trackers can be seen in the top-left subfigure of Fig. 3. The proposed tracker
achieves the second performance in the seven trackers. DSST achieves the best performance. For
the attributes of OR, OCC, FM, MB, and DEF, our tracker achieves the best or close to the best
performance. These advantages benefit from the multi-view model. The results on DEF and OC-
C, we tracker achieves the best performance, which benefits from the combined color and HOG
model. The fused model are more robust to occlusion and deformation problems. The multi-
view model also makes the target more discriminative from the background, which is verified by
the performance on attribute occlusion and background clutter. For scale variation, our result is
very close to the best result (DSST) who considers . Illumination variation, out of view and low
resolution are all not directly related with the target appearance, therefore the advantage of the
multi-view model is not obvious. For the attributes of fast motion and motion blur, the proposed
tracker is the second best and struck achieves the best performance. Fast motion results in large
distance between the target locations in adjacent frames. Motion blur makes the appearance of the
target corrupted. The proposed multi-view model may be influenced by the corrupted appearance.
In summary, our tracker achieves the best or close to the best results in almost all the attributes.
To give sufficient comparison results, we also show the overall comparison performance on
SRE and TRE, which can be seen in Fig. 4. For the results of TRE, our tracker almost achieves the
best performance with average overlap threshold 0.578 (the best is 0.579). TRE run 20 times with
different start frames. The results on TRE shows the robust of our tracker on initialization. For the
results of SRE, our tracker achieves the second best performance, which is lower than DSST and
higher than KCF. Since our model is based on KCF, the results show the robustness of the fused
model.
To validate each part of the proposed model, the comparison of the whole model and the model
without scale evaluation on scale variation data sequences are given in the left subfigure of Fig 5.
As we can see, the whole model achieves 0.5 and the model without scale evaluation achieves
0.414 on the scale variation data sequences. The gap between the two performances shows the
effectiveness of the scale evaluation strategy. To validate the combination model, the comparisons
of the fused model and each single feature are shown in the right subfigure of Fig 5. As we can see,
12
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE Overall (51)
DSST [0.555]
OursS [0.532]
KCF [0.514]
Struck [0.481]
SCM [0.466]
ASLA [0.454]
TLD [0.408]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE  outofplane rotation (39)
DSST [0.535]
OursS [0.521]
KCF [0.496]
Struck [0.447]
ASLA [0.447]
SCM [0.433]
TLD [0.403]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE  inplane rotation (31)
DSST [0.561]
OursS [0.507]
KCF [0.497]
ASLA [0.466]
Struck [0.444]
SCM [0.414]
TLD [0.388]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE  occlusion (29)
OursS [0.555]
DSST [0.534]
KCF [0.513]
SCM [0.462]
Struck [0.428]
TLD [0.420]
ASLA [0.391]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE  scale variation (28)
DSST [0.542]
OursS [0.500]
SCM [0.499]
ASLA [0.481]
Struck [0.438]
KCF [0.428]
TLD [0.360]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE  illumination variation (25)
DSST [0.565]
OursS [0.515]
KCF [0.494]
ASLA [0.476]
Struck [0.437]
SCM [0.427]
TLD [0.356]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE  background clutter (21)
KCF [0.534]
DSST [0.519]
OursS [0.476]
Struck [0.454]
ASLA [0.449]
SCM [0.405]
TLD [0.309]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE  deformation (19)
OursS [0.555]
KCF [0.533]
DSST [0.510]
Struck [0.434]
SCM [0.408]
ASLA [0.400]
TLD [0.358]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Overlap threshold
Success rate
Success plots of OPE  fast motion (17)
Struck [0.475]
KCF [0.462]
OursS [0.461]
DSST [0.436]
TLD [0.361]
SCM [0.303]
ASLA [0.255]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE  motion blur (12)
KCF [0.500]
OursS [0.470]
DSST [0.465]
Struck [0.460]
TLD [0.346]
SCM [0.288]
ASLA [0.261]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE  out of view (6)
KCF [0.552]
OursS [0.506]
TLD [0.471]
DSST [0.459]
Struck [0.390]
SCM [0.374]
ASLA [0.367]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Overlap threshold
Success rate
Success plots of OPE  low resolution (4)
DSST [0.408]
Struck [0.406]
OursS [0.365]
KCF [0.310]
SCM [0.277]
TLD [0.235]
ASLA [0.158]
Figure 3: The results on OTB. MvCFT is short for the name of the proposed tracker. In each subfigure, the plots
show the ratios of successful frames at the thresholds varied from 0 to 1. The left-top subfigure shows the overall
performance on the 50 data sequences. Other 11 subfigures titled with their corresponding attributes and the numbers
of the corresponding data sequences show the performance of each tracker in different views. It is obvious that our
tracker achieves the best performance in the overall datasets and almost all the attributes.
13
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of TRE
DSST [0.579]
OursS [0.578]
KCF [0.557]
Struck [0.514]
SCM [0.514]
ASLA [0.485]
TLD [0.448]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of SRE
DSST [0.495]
Ours [0.481]
KCF [0.462]
Struck [0.439]
ASLA [0.421]
SCM [0.420]
TLD [0.402]
Figure 4: The comparisons on SRE and TRE.
Table 1: The running time of the seven trackers on our computer.
Trackers TLD Struck ASLSA SCM Ours
Speed(fps) 25.0 21.3 7.1 0.47 25.5
the the result of the fused model is significantly better than the result of any single feature. The
HOG feature obtains a low performance, which is worse than the original KCF. This is because
the scale evaluation may be inaccurate when only depend on the hog feature as our evaluation
only consider the direction of the scale change. The small errors in scale evaluation lead to bad
tracking performance. In summary, the fusion model and the scale evaluation model are effective
and achieve promising results.
The running time can be seen in Table 1. We run the algorithms on the 50 sequences and
calculate the average running speed. As we can see, the CSK achieves the fastest speed. Our
tracker achieves a real-time speed, which benefits from the correlation filter framework.
5.3. Evaluation with VOT
The VOT challenge provides a common platform for evaluating in the field of visual tracking.
For ease of use, the VOT provides an improved version of the cross-platform evaluation kit. The
VOT is held once a year. Here, we utilize the latest VOT2014 and VOT2015 benchmark to evaluate
our tracker.
The VOT2014 challenge includes two experiments. One experiment called as baseline is run-
ning the tracker on all sequences by initializing with the ground truth bounding boxes. The other
experiment called as region noise performs baseline, but initializes with a noisy bounding box
rather than the ground truth bounding box. The noisy bounding box works as a randomly pertur-
bation and the perturbation is about ten percent of the ground truth bounding box size in position
14
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE  scale variation (28)
MvCFT [0.500]
MvCFT2 [0.414]
0 0.2 0.4 0.6 0.8 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Overlap threshold
Success rate
Success plots of OPE
MvCFT [0.532]
MvCFTcolor [0.493]
MvCFThog [0.433]
MvCFTgray [0.344]
Figure 5: The result comparisons of each part of the proposed model.
and size.
Datasets: The dataset of VOT2014 comprises 25 sequences showing various objects in chal-
lenging backgrounds. Among the 25 sequences, 8 sequences are from VOT2013 and other se-
quences are newly added. The new sequences including a fish underwater or surfer riding a big
wave show complementary objects and backgrounds. The 25 sequences are carefully chosen using
a methodology based on clustering visual features of object and background so that the sequences
can well represent most tracking problems. The ground truth of the 25 sequences were annotat-
ed by the VOT committee using rotated bounding boxes, which provides highly accurate values
for comparing results. To show the performance from different perspectives, each frame in each
selected sequence have been labeled manually or semi-manually with five visual attributes reflect-
ing a particular challenge in appearance attribute. The five attributes are Occlusion, Illumination
Change, Motion Change, Size Change and Camera Motion. When a frame does not correspond
to any of the five attributes, it is denoted as neutral. More detail information about the VOT2014
dataset can be found in [39]. The dataset of VOT2015 includes 356 sequences, which is an exten-
sion of VOT2014.
Trackers: In VOT2014, the committee contributed 5 baseline trackers and 33 entries were
submitted by participants. Thus the VOT2014 includes 38 trackers in total. For all these trackers,
the default parameters are selected. When default parameters are not available, reasonable val-
ues are selected. The 38 trackers can be roughly divided into several categories. Several trackers
track the target by parts. Some trackers apply global generative visual models for target local-
ization. The 38 trackers cover most popular and classical algorithms or some trackers modified
from them. Our tracker achieves the best performance compared with the 38 trackers. In order to
show the performance clearly, we select 11 top and classical trackers, i.e., ACT [11], CMT [48],
DSST [47], IIVT [49], KCF [12], LGT [50], MIL [17], OGT [51], PT [52], struck [18], CT [53],
for showing. We took part in VOT2015. The results of VOT2015 is published in [40]. We select
15
Table 2: VOT2014 results with 12 top and classical algorithms.
MIL CT Ptp IIVT OGT CMT Struck ACT LGT KCF DSST Ours
base AR 10.42 9.88 10.00 7.82 4.83 5.92 6.42 6.50 8.81 2.50 2.50 2.42
RR 8.25 9.29 6.69 8.35 9.73 8.41 6.78 4.85 3.40 4.67 3.92 3.81
noise AR 10.69 9.09 9.08 8.00 5.33 6.75 6.75 6.83 7.89 2.42 2.17 2.75
RR 8.42 8.92 6.62 8.06 9.75 8.49 6.42 5.43 3.18 4.88 4.65 3.17
6 classical trackers, which are MEEM [54], MUSTer [55], TGPR [56], KCFv2, DSST and sKCF,
for comparison. MEEM uses an online SVM with a redetection based on the entropy of the score
function. MUSTer is a dualcomponent approach to object tracking. TGPR models the probabil-
ity of target appearance using Gaussian Process Regression. KCFv2 enhances the robustness of
KCF by examining the similarity between each candidate patch generated by the KCF tracker and
the Restore Point patch. sKCF extends KCF framework by introducing an adjustable Gaussian
window function and keypoint-based model for scale estimation.
Figure 6: The VOT2014 results on baseline and region noise. Our tracker achieves the best performance both in the
experiment baseline and region noise.
Evaluation: The VOT evaluation method focuses on two terms: accuracy and robustness.
The accuracy measures how well the predicted bounding box overlaps with the ground truth. The
robustness measures how many times the tracker loses the target during tracking. A failure is
indicated when the overlap area becomes zero. It should be noticed that five frames after each
failure the tracker is re-initialized and ten frames after re-initialization are ignored in computation.
16
These settings reduce the bias in robustness measure. Trackers run 15 times on each sequence to
obtain a better statistics on performance measures. The per-frame accuracy is used as an average
over these runs and the robustness on each sequence is computed by averaging failure rates over
different runs. The two measures of each attribute are calculated only on the subset of frames in
the sequences that contain the corresponding attribute. After the above calculation, trackers are
ranked with respect to each measure separately on each attribute and the final ranking is obtained
by averaging the ranks.
Figure 7: The VOT2014 results on each per-attribute in experiment region noise. Our tracker achieves the best
performance in camera motion, illumination change, size change, motion change, neutral and third best in occlusion.
The overall results and the per-attribute results of VOT2014 are shown in Fig. 6 and Fig. 7,
in which each tracker is represented as a point in the joint accuracy-robustness rank space. In the
accuracy-robustness rank space, the tracker is better if it is closer to the top-right corner. As we
can see from Fig. 6, both in the experiment baseline and region noise, our tracker is the closest to
the right corner and better than the DSST tracker, who ranks first in the VOT2014 challenge. In the
experiment baseline, our tracker is better than DSST both in terms of accuracy rank and robustness
track. This is because our tracker exploits more information of the tracker from different views. In
the experiment region noise with perturbed initializations, our tracker is a little worse than DSST
17
and KCF in terms of accuracy but is much better than DSST and KCF in terms of robustness.
Considering the combination of the two terms, our tracker is the closest to the top-right corner
among the 12 state-of-the-art trackers. The result of experiment region noise shows the robustness
of our tracker, which benefits from the multi-view model. The multi-view model makes our tracker
more discriminative and robust to perturbed initializations. The summarized data can be seen in
Table 2.
In order to show the robustness of the trackers better, we select the results on experiment
reion noise to show the performance on each attribute results, which can be seen in Fig. 7. As
can be seen, our tracker achieves the best performance on all the attributes except the attribute
of occlusion. For the attributes of camera motion, illumination change and motion change, our
tracker, KCF and DSST all achieve the best performance in terms of accuracy. However, the pro-
posed multi-view tracker is more robust than KCF and DSST. These benefit from the multi-view
model, which exploits more information of the target from diverse views and makes the tracker ro-
bust in complicated scenarios. For the attribute of size change, our tracker, OGT, KCF and DSST
all achieve the best performance in terms of accuracy. However, our tracker has a great advantage
on robustness compared with OGT, KCF and DSST. These results show the effectiveness of our
scale evaluation method. Our method achieves the same accuracy with a simpler mechanism com-
pared with the image pyramid based scale evaluation method in DSST. For the attribute of neutral,
our tracker, CMT and DSST perform the best both in terms of accuracy and most of the trackers
get the same results in terms of robustness. These may because the frames without any challenges
are easy to tracking with less failures. For the attribute of occlusion, our tracker achieves the
third best performance, which is worse than DSST and KCF in terms of accuracy and is worse
than KCF in terms of robustness. However, in experiment baseline on the term of occlusion, our
tracker achieves the best performance. This is because the initialization with noise may interfere
our tracker when occlusion occurs. But, when the initialization is precise, our tracker achieves
the best performance for occlusion. In summary, the proposed multi-view correlation filter tracker
achieves the best or third best performance in the per-attributes and the best performance in the
overall performance in the VOT2014 benchmark.
The results of VOT2015 are shown in Table 3, which is excerpted from the official results [40].
In Table 3, A is the raw accuracy. R is the average number of failures. Φis the expected average
overlap. As we can see, our tracker achieves the least average number of failures, which shows
the robustness of our fusion model. For raw accuracy, our tracker achieves the second best among
all the 7 trackers. DSST achieves the best raw accuracy, but our tracker is more robust. For Φ,
a combined measure of accuracy and robustness, our tracker is the second best and is better than
DSST and other KCF based trackers.
6. Conclusion
We propose a multi-view tracker with correlation filters to achieve a real-time and robust per-
formance. The multi-view model applying the features from different views shows its superiorities
in deal with various changes of the tracking target and the background. The fusion method with se-
lectivity and strengthening well exploits the complementary information between different views.
The correlation filter not only provides an efficient tracking framework, but also well supports the
18
Table 3: VOT2015 results with 5 classical algorithms.(M is Matlab, C is C or C++.)
Tracker A R ΦImpl
MEEM 0.50 1.85 0.22 M
MvCFT(ours) 0.52 1.72 0.21 binary
MUSTer 0.52 2.00 0.19 M C
TGPR 0.48 2.31 0.19 M C
KCFv2 0.48 2.17 0.17 M
DSST 0.54 2.56 0.17 M C
sKCF 0.48 2.68 0.16 C
fusion mode of different views. Besides, the simple scale evaluation mechanism shows its effec-
tiveness on large amount data sequence with scale change. The experiment results on 75 selected
datasets show the competitiveness of our tracker compared with 29 trackers, which are state-of-
the-art or widely used. The exhaustive analyses on the experiment results with different attributes
present the capabilities of our tracker.
References
[1] Y. Wu, J. Lim, M. H. Yang, Online object tracking: A benchmark, in: 2013 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2013, pp. 2411–2418.
[2] I. Matthews, T. Ishikawa, S. Baker, The template update problem, IEEE Transactions on Pattern Analysis and
Machine Intelligence 26 (6) (2004) 810 – 815.
[3] J. Lim, R. S. Lin, M. H. Yang, D. A. Ross, Incremental learning for robust visual tracking, International Journal
of Computer Vision 77 (1-3) (2008) 125–141.
[4] X. Mei, H. Ling, Robust visual tracking using l1 minimization, in: 2009 IEEE 12th International Conference on
Computer Vision (ICCV), 2009, pp. 1436 – 1443.
[5] S. Yi, Y.-M. Cheung, Single object tracking via robust combination of particle filter and sparse representation,
in: Signal Processing, Vol. 110, 2015, pp. 178–187.
[6] X. You, X. Li, X.F.Zhang, A robust local sparse tracker with global consistency constraint, in: Signal Processing,
Vol. 111, 2014, pp. 308–318.
[7] Z. He, S. Yi, Y. M. Cheung, X. You, Y. Y. Tang, Robust object tracking via key patch sparse representation,
IEEE Transactions on Cybernetics PP (99) (2016) 1–11.
[8] X. Li, Z. He, X. You, C. P. Chen, A novel joint tracker based on occlusion detection, Knowledge-Based Systems
71 (2014) 409–418.
[9] X. You, Q. Chen, B. Fang, Y. Y. Tang, Thinning character using modulus minima of wavelet transform, Interna-
tional Journal of Pattern Recognition and Artificial Intelligence 20 (03) (2006) 361–375.
[10] J. Huang, X. You, Y. Yuan, F. Yang, L. Lin, Rotation invariant iris feature extraction using gaussian markov
random fields with non-separable wavelet, Neurocomputing 73 (4) (2010) 883–894.
[11] M. Danelljan, F. S. Khan, M. Felsberg, J. van de Weijer, Adaptive color attributes for real-time visual tracking,
in: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1090–1097.
[12] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, IEEE
Transactions on Pattern Analysis and Machine Intelligence 37 (3) (2015) 583–596.
[13] C. Ma, X. Yang, C. Zhang, M.-H. Yang, Long-term correlation tracking, in: 2015 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), 2015, pp. 5388–5396.
[14] R. T. Collins, Y. Liu, M. Leordeanu, Online selection of discriminative tracking features, IEEE Transactions on
Pattern Analysis and Machine Intelligence 27 (10) (2005) 1631–1643.
19
[15] S. Avidan, Support vector tracking, IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (8)
(2004) 1064–1072.
[16] H. Grabner, M. Grabner, H. Bischof, Real-time tracking via on-line boosting, in: British Machine Vision Con-
ference (BMVC), Vol. 1, 2006, p. 6.
[17] B. Babenko, M.-H. Yang, S. Belongie, Robust object tracking with online multiple instance learning, IEEE
Transactions on Pattern Analysis and Machine Intelligence 33 (8) (2011) 1619–1632.
[18] S. Hare, A. Saffari, P. H. Torr, Struck: Structured output tracking with kernels, in: 2011 IEEE International
Conference on Computer Vision (ICCV), 2011, pp. 263–270.
[19] Z. Kalal, J. Matas, K. Mikolajczyk, Pn learning: Bootstrapping binary classifiers by structural constraints, in:
2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 49–56.
[20] X. You, Q. Peng, Y. Yuan, Y.-m. Cheung, J. Lei, Segmentation of retinal blood vessels using the radial projection
and semi-supervised approach, Pattern Recognition 44 (10) (2011) 2314–2324.
[21] Z. Pan, X. You, H. Chen, D. Tao, B. Pang, Generalization performance of magnitude-preserving semi-supervised
ranking with graph-based regularization, Information Sciences 221 (2013) 284–296.
[22] W. Ou, X. You, D. Tao, P. Zhang, Y. Tang, Z. Zhu, Robust face recognition via occlusion dictionary learning,
Pattern Recognition 47 (4) (2014) 1559–1572.
[23] W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparse collaborative appearance model, IEEE Trans-
actions on Image Processing 23 (5) (2014) 2356–2368.
[24] Y. Yin, D. Xu, X. Wang, M. Bai, Online state-based structured svm combined with incremental pca for robust
visual tracking, IEEE transactions on cybernetics 45 (9) (2015) 1988–2000.
[25] S. Chen, S. Li, R. Ji, Y. Yan, S. Zhu, Discriminative local collaborative representation for online object tracking,
Knowledge-Based Systems 100 (2016) 13–24.
[26] J. H. Yoon, D. Y. Kim, K.-J. Yoon, Visual tracking via adaptive tracker selection with multiple features, in: 2012
European Conference on Computer Vision (ECCV), Springer, 2012, pp. 28–41.
[27] X. Lan, A. J. Ma, P. C. Yuen, Multi-cue visual tracking using robust feature-level fusion based on joint sparse
representation, in: 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp.
1194–1201.
[28] J. Kwon, K. M. Lee, Visual tracking decomposition, in: 2010 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2010, pp. 1269–1276.
[29] N. V. Lopes, P. Couto, A. Jurio, P. Melo-Pinto, Hierarchical fuzzy logic based approach for object tracking,
Knowledge-Based Systems 54 (2013) 255–268.
[30] W. Zhong, H. Lu, M.-H. Yang, Robust object tracking via sparsity-based collaborative model, in: 2012 IEEE
Conference on Computer vision and pattern recognition (CVPR), 2012, pp. 1838–1845.
[31] J. H. Yoon, M. H. Yang, K. J. Yoon, Interacting multiview tracker, IEEE Transactions on Pattern Analysis and
Machine Intelligence 38 (5) (2016) 903–917.
[32] Y. Zhang, H. Ji, A robust and fast partitioning algorithm for extended target tracking using a gaussian inverse
wishart phd filter, Knowledge-Based Systems 95 (2016) 125–141.
[33] Z. He, Y. Cui, H. Wang, X. You, C. P. Chen, One global optimization method in network flow model for multiple
object tracking, Knowledge-Based Systems 86 (2015) 21–32.
[34] Z. He, X. Li, X. You, D. Tao, Y. Y. Tang, Connected component model for multi-object tracking, IEEE Transac-
tions on Image Processing 25 (8) (2016) 3698–3711.
[35] R. T. Collins, Mean-shift blob tracking through scale space, in: 2003 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition (CVPR), Vol. 2, 2003.
[36] X. Ma, Q. Liu, Z. He, X. Zhang, W.-S. Chen, Visual tracking via exemplar regression model, Knowledge-Based
Systems 106 (2016) 26 – 37.
[37] D. S. Bolme, J. R. Beveridge, B. Draper, Y. M. Lui, et al., Visual object tracking using adaptive correlation
filters, in: 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010, pp. 2544–2550.
[38] C. Xu, D. Tao, C. Xu, A survey on multi-view learning, arXiv preprint arXiv:1304.5634.
[39] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, L. ˇ
Cehovin, G. Nebehay, T. Voj´
ıˇ
r, G. Fernandez, A. Lukeˇ
ziˇ
c,
A. Dimitriev, et al., The visual object tracking vot2014 challenge results, in: 2014 European Conference on
Computer Vision Workshops (ECCVW), Springer, 2014, pp. 191–217.
20
[40] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernandez, T. Vojir, G. Hager, G. Nebehay,
R. Pflugfelder, A. Gupta, A. Bibi, A. Lukezic, A. Garcia-Martin, A. Saffari, A. Petrosino, A. S. Montero, The
visual object tracking vot2015 challenge results, 2015, pp. 564–586.
[41] Z. Hong, X. Mei, D. Prokhorov, D. Tao, Tracking via robust multi-task multi-view joint sparse representation,
in: 2013 IEEE International Conference on Computer Vision (ICCV), 2013, pp. 649–656.
[42] N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: 2005 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition(CVPR), Vol. 1, 2005, pp. 886–893.
[43] J. Van De Weijer, C. Schmid, J. Verbeek, D. Larlus, Learning color names for real-world applications, IEEE
Transactions on Image Processing 18 (7) (2009) 1512–1523.
[44] L. Cehovin, M. Kristan, A. Leonardis, Is my new tracker really better than yours?, in: 2014 IEEE Winter
Conference on Applications of Computer Vision (WACV), 2014, pp. 540–547.
[45] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, A. Zisserman, The pascal visual object classes (voc)
challenge, International journal of computer vision(IJCV) 88 (2) (2010) 303–338.
[46] X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local sparse appearance model, in: 2012 IEEE
Conference on Computer vision and pattern recognition (CVPR), 2012, pp. 1822–1829.
[47] M. Danelljan, G. H ¨
ager, F. Khan, M. Felsberg, Accurate scale estimation for robust visual tracking, in: British
Machine Vision Conference (BMVC), Nottingham, September 1-5, 2014, BMVA Press, 2014.
[48] G. Nebehay, R. Pflugfelder, Consensus-based matching and tracking of keypoints for object tracking, in: 2014
IEEE Winter Conference on Applications of Computer Vision (WACV), 2014, pp. 862–869.
[49] K. M. Yi, H. Jeong, B. Heo, H. J. Chang, J. Y. Choi, Initialization-insensitive visual tracking through voting
with salient local features, in: 2013 IEEE International Conference on Computer Vision (ICCV), 2013, pp.
2912–2919.
[50] L. Cehovin, M. Kristan, A. Leonardis, Robust visual tracking using an adaptive coupled-layer visual model,
IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (4) (2013) 941–953.
[51] H. Nam, S. Hong, B. Han, Online graph-based tracking, in: 2014 European Conference on Computer Vision
(ECCV), Springer, 2014, pp. 112–126.
[52] S. Duffner, C. Garcia, Pixeltrack: a fast adaptive algorithm for tracking non-rigid objects, in: 2013 IEEE Inter-
national Conference on Computer Vision (ICCV), 2013, pp. 2480–2487.
[53] K. Zhang, L. Zhang, M.-H. Yang, Real-time compressive tracking, in: 2012 European Conference on Computer
Vision (ECCV), Springer, 2012, pp. 864–877.
[54] J. Zhang, S. Ma, S. Sclaroff, Meem: robust tracking via multiple experts using entropy minimization, in: 2014
European Conference on Computer Vision (ECCV), Springer, 2014, pp. 188–203.
[55] Z. Hong, Z. Chen, C. Wang, X. Mei, D. Prokhorov, D. Tao, Multi-store tracker (muster): A cognitive psychology
inspired approach to object tracking, in: 2015 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2015, pp. 749–758.
[56] J. Gao, H. Ling, W. Hu, J. Xing, Transfer learning based visual tracking with gaussian processes regression, in:
2014 European Conference on Computer Vision (ECCV), Springer, 2014, pp. 188–203.
21
... Recently, discriminative correlation filter (DCF)-based trackers boost the tracking performance to a higher level [26,32,35]. One prominent merit of the DCF-based tracker is the efficient in the training and detection, as they can be transferred into the Fourier domain and operated by element-wise multiplication. ...
Article
Full-text available
The advance of visual tracking has provided unmanned aerial vehicle (UAV) with the intriguing capability for various practical applications. With promising performance and efficiency, discriminative correlation filter (DCF)-based trackers have drawn significant attention and undergone remarkable progress. However, the boundary effect and filter degradation remain two intractable problems. In this work, we propose a novel Adaptive Spatio-Temporal Regularized Correlation Filters (ASTR-CF) model to address the two problems. The ASTR-CF model simultaneously optimizes the spatial and temporal regularization weights adaptively, and it is optimized by the alternating direction method of multipliers (ADMM) effectively. Extensive experiments on 4 UAV tracking benchmarks have proven the superiority of the proposed ASTR-CF compared with more than 30 state-of-the-art trackers in terms of accuracy and speed.
... They also proposed adaptive low-dimensional variant of color attributes. Li et al. [20] suggested multi-view correlation tracker. In [8], original correlation filter was extended to adapt scale and translation changes in the sequence. ...
Article
Full-text available
Visual object tracking is among the most attractive topics in various applications of vision system. Target objects’ appearance often drastically changes over time therefore Object tracking is challenging issue. In recent years, adaptive correlation filters have been satisfactorily employed in object tracking due to high efficiency and robustness. In this paper, a correlation filter-based tracker is proposed, which combines Histograms of Oriented Gradients (HOG), Local Binary Pattern (LBP) and optical flow as the feature set to extract textural, and motion model information. Proposed framework is capable of handling tracking attributes, especially occlusion and out-of-view. Adaptive threshold for optical flow is formulated to robust estimation. Threshold in each frame is calculated based on background properties, and similarity transformation is utilized for predicting scale, rotation and translation changes. The experimental results on OTB-100 indicate that the proposed tracker obtains promising prediction performance with respect to the state-of-the-art visual tracking approaches. This method exploits the advantages of conventional correlation filter-based tracking methods as well as purposeful features simultaneously. Proposed tracker is executable online and does not necessitate pre-training.
... artificial intelligence, which can be applied to a wide range of applications, e.g., automatic driving, human-computer interaction, and advanced video processing. Trackers based on different numerous theories have been proposed, such as trackers based on subspace learning [1], correlation filter [4], deep learning [3]. However, it remains a challenging problem as the variation of a target is arbitrary and can be drastic and complicated (e.g., deformation, rotation, and occlusion). ...
Article
Full-text available
Existing trackers usually exploit robust features or online updating mechanisms to deal with target variations which is a key challenge in visual tracking. However, the features being robust to variations remain little spatial information, and existing online updating methods are prone to overfitting. In this paper, we propose a dual-margin model for robust and accurate visual tracking. The dual-margin model comprises an intra-object margin between different target appearances and an inter-object margin between the target and the background. The proposed method is able to not only distinguish the target from the background but also perceive the target changes, which tracks target appearance changing and facilitates accurate target state estimation. In addition, to exploit rich off-line video data and learn general rules of target appearance variations, we train the dual-margin model on a large off-line video dataset. We perform tracking under a Siamese framework using the constructed appearance set as templates. The proposed method achieves accurate and robust tracking performance on five public datasets while running in real-time. The favorable performance against the state-of-the-art methods demonstrates the effectiveness of the proposed algorithm.
Article
In the era of social networks, the scale and speed of online information dissemination have been greatly enhanced, thus leading to a large number of meme being generated, spread and popularized in the Internet. Social networks strongly promote the replication and dissemination of modalities, which are powerfully explosive and can be copied and spread in large numbers in a short period of time. Because of the typical decentralized nature of meme propagation in social networks, the development of blockchain technology provides a reliable environment for meme propagation models. However, there are various hurdles to using online user behavior data to predict meme popularity, such as the fact that there are only a few user comments and retweets beneath a meme topic, making it difficult to predict meme popularity directly by basic social data mining at this moment. In this scenario, mining the contextual information of web users can considerably increase meme prediction performance. Because social data contains a wealth of contextual and user relationship characteristics, we offer for the first time a blockchain-assisted based meme popularity prediction (BMP) mechanism based on empirical approach. To begin, we suggest a blockchain-based data storage approach to mimic a decentralized ecosystem. Following that, we assess meme popularity in terms of four contextually based web user characteristics. Finally, using a probabilistic linear model, we present a meme popularity prediction model that integrates the four contextual characteristics. By making a comparison of comprehensive tests with existing methodologies, we illustrate the usefulness and accuracy of our proposed model. The experimental results indicate that the proposed meme prediction approach can provide a meme prediction service with high accuracy, a well-defined decentralized environment, and steady performance, making it a reliable service for recommendation systems and web-based information dissemination.
Chapter
Imbalanced class distribution is an issue that appears in various applications. In this paper, we undertake a comprehensive study of the effects of sampling on the performance of bootstrap aggregating in the context of imbalanced data. Concretely, we carry out a comparison of sampling methods applied to single and ensemble classifiers. The experiments are conducted on simulated and real-life data using a range of sampling methods. The contributions of the paper are twofold: i) demonstrate the effectiveness of ensemble techniques based on resampled data over a single base classifier and ii) compare the effectiveness of different resampling techniques when used during the bagging stage for ensemble classifiers. The results reveal that ensemble methods overwhelmingly outperform single classifiers based on resampled data. In addition, we discover that NearMiss and random oversampling (ROS) are the optimal sampling algorithms for ensemble learning.
Article
Full-text available
Imbalance ensemble classification is one of the most essential and practical strategies for improving decision performance in data analysis. There is a growing body of literature about ensemble techniques for imbalance learning in recent years, the various extensions of imbalanced classification methods were established from different points of view. The present study is initiated in an attempt to review the state-of-the-art ensemble classification algorithms for dealing with imbalanced datasets, offering a comprehensive analysis for incorporating the dynamic selection of base classifiers in classification. By conducting 14 existing ensemble algorithms incorporating a dynamic selection on 56 datasets, the experimental results reveal that the classical algorithm with a dynamic selection strategy deliver a practical way to improve the classification performance for both a binary class and multi-class imbalanced datasets. In addition, by combining patch learning with a dynamic selection ensemble classification, a patch-ensemble classification method is designed, which utilizes the misclassified samples to train patch classifiers for increasing the diversity of base classifiers. The experiments’ results indicate that the designed method has a certain potential for the performance of multi-class imbalanced classification.
Conference Paper
Full-text available
Classification with imbalanced data is a common and challenging problem in many practical machine learning problems. Ensemble learning is a popular solution where the results from multiple base classifiers are synthesized to reduce the effect of a possibly skewed distribution of the training set. In this paper, binary classifiers based on Gaussian processes are chosen as bases for inferring the predictive distributions of test latent variables. We apply a Gaussian process latent variable model where the outputs of the Gaussian processes are used for making the final decision. The tests of the new method in both synthetic and real data sets show improved performance over standard approaches.
Article
Full-text available
Recently, with the development of technology, it is quite important to study scalable computational methods for handling large-scale data in big data applications. The cloud/edge computing are powerful tools for solving big data problems, and provide flexible computation and huge storage capability. Moreover, in the real world, the data in big data applications usually are stored in decentralized computation resources, which affects the design of artificial intelligence algorithms. Therefore, in this work, we focus on distribution machine learning, and propose a novel distributed classification algorithm to deal with imbalanced data. Specifically, we explore the distributed alternating direction method of multipliers (ADMM) framework, and divide the distributed classification problem as some small problems which can be solved by the decentralized resources in parallel. Furthermore, based on the distributed framework, we use a acceleration strategy to improve the time efficiency with designing a more suitable model for imbalanced data classification. The theoretical analysis and experiments results show that our proposed method converges faster than other distributed ADMM method and saves training time, which can improve the scalability of distributed classification on imbalanced data.
Article
Full-text available
Visual tracking remains a challenging problem in computer vision due to the intricate variation of target appearances. Some progress made in recent years has revealed that correlation filters, which formulate the tracking process by creating a regressor in the frequency domain, have achieved remarkable experimental results on a large amount of video tracking sequences. On the contrary, building the regressor in the spatial domain directly has been considered as a limited approach since the number of training samples is restricted. And without sufficient training samples, the regressor will have less discriminability. In this paper, we demonstrate that, by giving a very simple positive-negative prior knowledge for the training samples, the performance of the ridge regression model can be improved by a large margin, even better than its frequency domain competitors-the correlation filters, on most challenging sequences. In particular, we build a regressor (or a score function) by learning a linear combination of some selected training samples. The selected samples consist of a large number of negative samples, but a few positive ones. We constrain the combination such that only the coefficients of positive samples are positive, while all coefficients of negative samples are negative. The coefficients are learnt under such a regression setting that makes the outputs fit overlap ratios of the bounding box, where the overlap ratios are measured by calculating the overlaps between the inputs and the estimated position in the last frame. We call this regression exemplar regression because of the novel positive-negative arrangement of the linear combination. In addition, we adopt a non-negative least square approach to solve this regression model. We evaluate our approach on both the standard CVPR2013 benchmark and the 50 selected challenging sequences, which include dozens of state-of-the-art trackers and more than 70 datasets in total. In both of the two experiments, our algorithm achieves a promising performance, which outperforms the state-of-the-art approaches.
Conference Paper
Full-text available
The Visual Object Tracking challenge 2014, VOT2014, aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 38 trackers are presented. The number of tested trackers makes VOT 2014 the largest benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the appendix. Features of the VOT2014 challenge that go beyond its VOT2013 predecessor are introduced: (i) a new VOT2014 dataset with full annotation of targets by rotated bounding boxes and per-frame attribute, (ii) extensions of the VOT2013 evaluation methodology, (iii) a new unit for tracking speed assessment less dependent on the hardware and (iv) the VOT2014 evaluation toolkit that significantly speeds up execution of experiments. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (http:// votchallenge. net).
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Article
Cutaneous warts are caused by human papillomavirus infection. Immunosuppressive state is one of the risk factors of human papillomavirus infection. A girl diagnosed of nephrotic syndrome and on immunosuppressive therapy developed multiple common warts. We treated her on a single lesion by local hyperthermia therapy at 44 °C for 3 consecutive days, each therapy lasted for 30 min. Ten days later, the patient received another 2 consecutive therapy. All lesions are completely resolved at the 9th week after the treatment. No recurrent sign was observed in a 3-mo follow-up. Side effects included burning sensation, stabbing pain at the target site during treatment. Key Words: Hyperthermia, Warts, Nephrotic syndrome, Immunosuppression Core tip: Common warts on immunosuppressive patients are characterized by multiple lesions, long duration and hard to treat. Current treatment method includes laser therapy, cryotherapy, topical salicylic acid, etc. Scar formation and high recurrence rate are the most common disadvantages of these treatments. In this case report, we provide a noninvasive treatment method called hyperthermia treatment. Using this method we succeed to cure multiple warts on a nephrotic syndrome patient who received immunosuppressive treatment for years without any scar formation. And we did not see recurrence 3 mo after the treatment. Citation: Zhang YJ, Qi RQ, Gao XH. Local hyperthermia cleared multiple cutaneous warts on a nephrotic syndrome patient. World J Dermatol 2016; 5(3): 125-128
Article
In multi-object tracking, it is critical to explore the data associations by exploiting the temporal information from a sequence of frames rather than the information from the adjacent two frames. Since straightforwardly obtaining data associations from multi-frames is an NP-hard Multi-Dimensional Assignment (MDA) problem, most existing methods solve this MDA problem by either developing complicated approximate algorithms, or simplifying MDA as a two-dimensional assignment problem based upon the information extracted only from adjacent frames. In this paper, we show that the relation between associations of two observations is the equivalence relation in the data association problem, based on the spatial-temporal constraint that trajectories of different objects must be disjoint. Therefore, the MDA problem can be equivalently divided into independent subproblems by equivalence partitioning. In contrast to existing works for solving the MDA problem, we develop a Connected Component Model (CCM) by exploiting the constraints of the data association and the equivalence relation on the constraints. Based upon CCM, we can efficiently obtain the global solution of the MDA problem for multi-object tracking by optimizing a sequence of independent data association subproblems. Experiments on challenging public datasets demonstrate that our algorithm outperforms the state-of-the-art approaches.
Conference Paper
Modeling the target appearance is critical in many modern visual tracking algorithms. Many tracking-by-detection algorithms formulate the probability of target appearance as exponentially related to the confidence of a classifier output. By contrast, in this paper we directly analyze this probability using Gaussian Processes Regression (GPR), and introduce a latent variable to assist the tracking decision. Our observation model for regression is learnt in a semi-supervised fashion by using both labeled samples from previous frames and the unlabeled samples that are tracking candidates extracted from the current frame. We further divide the labeled samples into two categories: auxiliary samples collected from the very early frames and target samples from most recent frames. The auxiliary samples are dynamically re-weighted by the regression, and the final tracking result is determined by fusing decisions from two individual trackers, one derived from the auxiliary samples and the other from the target samples. All these ingredients together enable our tracker, denoted as TGPR, to alleviate the drifting issue from various aspects. The effectiveness of TGPR is clearly demonstrated by its excellent performances on three recently proposed public benchmarks, involving 161 sequences in total, in comparison with state-of-the-arts.