ArticlePDF Available

Abstract and Figures

The training of a feature extraction network typically requires abundant manually annotated training samples, making this a time-consuming and costly process. Accordingly, we propose an effective self-supervised learning-based tracker in a deep correlation framework (named: self-SDCT). Motivated by the forward-backward tracking consistency of a robust tracker, we propose a multi-cycle consistency loss as self-supervised information for learning feature extraction network from adjacent video frames. At the training stage, we generate pseudo-labels of consecutive video frames by forward-backward prediction under a Siamese correlation tracking framework and utilize the proposed multi-cycle consistency loss to learn a feature extraction network. Furthermore, we propose a similarity dropout strategy to enable some low-quality training sample pairs to be dropped and also adopt a cycle trajectory consistency loss in each sample pair to improve the training loss function. At the tracking stage, we employ the pre-trained feature extraction network to extract features and utilize a Siamese correlation tracking framework to locate the target using forward tracking alone. Extensive experimental results indicate that the proposed self-supervised deep correlation tracker (self-SDCT) achieves competitive tracking performance contrasted to state-of-the-art supervised and unsupervised tracking methods on standard evaluation benchmarks.
Content may be subject to copyright.
IEEE TRANSACTIONS ON IMAGE PROCESSING 1
Self-supervised Deep Correlation Tracking
Di Yuan, Xiaojun Chang, Po-Yao Huang, Qiao Liu, and Zhenyu He, Senior Member, IEEE
Abstract—The training of a feature extraction network typ-
ically requires abundant manually annotated training samples,
making this a time-consuming and costly process. Accordingly,
we propose an effective self-supervised learning-based tracker in
a deep correlation framework (named: self-SDCT). Motivated by
the forward-backward tracking consistency of a robust tracker,
we propose a multi-cycle consistency loss as self-supervised in-
formation for learning feature extraction network from adjacent
video frames. At the training stage, we generate pseudo-labels
of consecutive video frames by forward-backward prediction
under a Siamese correlation tracking framework and utilize the
proposed multi-cycle consistency loss to learn a feature extrac-
tion network. Furthermore, we propose a similarity dropout
strategy to enable some low-quality training sample pairs to
be dropped and also adopt a cycle trajectory consistency loss
in each sample pair to improve the training loss function. At
the tracking stage, we employ the pre-trained feature extraction
network to extract features and utilize a Siamese correlation
tracking framework to locate the target using forward tracking
alone. Extensive experimental results indicate that the proposed
self-supervised deep correlation tracker (self-SDCT) achieves
competitive tracking performance contrasted to state-of-the-art
supervised and unsupervised tracking methods on standard
evaluation benchmarks.
Index Terms—Visual tracking, self-supervised learning, multi-
cycle consistency loss.
I. INTRODUCTION
SINGLE target tracking is both a very hot and important
topic, with numerous applications including video surveil-
lance, autonomous vehicles, man-machine interaction, etc. The
core task of tracking is to provide ground-truth of a selected
target at the first frame and use a tracker to accurately predict
the target position in all consecutive video frames. Recently,
trackers depend on deep convolutional neural networks (CNN)
trained on manually annotated images that have promising
tracking performance. However, it is still a tough problem
to train an efficient feature extraction network in the deep
learning-based tracking framework, because of the limited
number of labeled training data.
Tracking methods based on deep CNN structure have re-
cently achieved remarkable performance and have become
increasingly popular in the tracking community [1], [2], [3],
D. Yuan and Q. Liu are with the School of Computer Science and
Technology, Harbin Institute of Technology, Shenzhen, 518055 China. (e-
mail: dyuanhit@gmail.com; liuqiao@stu.hit.edu.cn).
X. Chang is with the Faculty of Information Technology, Monash Uni-
versity and a Distinguished Adjunct Professor with the Faculty of Com-
puting and Information Technology, King Abdulaziz University. (e-mail:
cxj273@gmail.com).
P-Y. Huang is with the School of Computer Science, Carnegie Mellon
University, Pittsburgh, PA 15213 USA (e-mail: poyaoh@cs.cmu.edu).
Z. He (Corresponding author) is with the School of Computer Science
and Technology, Harbin Institute of Technology, Shenzhen, 518055 China,
and also with Peng Cheng Laboratory, Shenzhen, 518055 China (e-mail:
zhenyuhe@hit.edu.cn).
Fig. 1. Comparison of tracking speed and AUC score of our self-SDCT
tracker and other deep learning based trackers on OTB-100 dataset.
[4], [5]. Usually, these deep CNN-based trackers exploit a
pre-trained network for feature extraction purposes, then use
a correlation or similarity function to calculate a similarity
score for the template sample and the candidate samples, after
which they choose the candidate with a maximum score as
object target in the current image frame [1], [2], [6], [7], [8].
While these methods have improved performance relative to
trackers based on hand-crafted features, online tracking with-
out updating limits the generalization capabilities. Although
several trackers have attempted to employ deep networks
for feature expression, when the target is unknown during
training, it’s requisite to adapt the weights of the network
online by executing Stochastic Gradient Descent (SGD), which
significantly affects the tracking speed [9], [10], [11]. In [12],
Bertinetto et al. propose a SiamFC tracker that focuses on
learning a similarity function of target and candidates in the
offline phase, which achieves remarkable tracking performance
compared with other trackers from the same period. The
ECO [4] tracker introduces a factorized convolution operator
into the discriminative correlation filter model and proposes
a generative model to enhance sample diversity, which can
improve both tracking accuracy and speed.
However, there are two major shortcomings of these deep
CNN-based trackers. The first one is that the feature extraction
network requires numerous manually annotated samples for
training. These manually annotated training samples are very
limited and obtaining them is also time-consuming and costly,
meaning that a trained feature extraction network based on
limited labeled samples is unable to represent the target
features well. The second one is that most deep convolutional
network-based trackers require a network with multiple layers
to extract features and fine-tune their pre-trained networks in
online tracking phases, which results in high computational
complexity. Some deep CNN-based trackers unable to achieve
IEEE TRANSACTIONS ON IMAGE PROCESSING 2
a real-time tracking speed because of the high dimension of
the feature extraction network [7], [9], [13]. For example, the
MDNet [9] tracker needs to pre-train a deep CNN architecture
for the similarity-matching task. In the tracking stage, the
MDNet tracker uses an SGD strategy to learn a detector
with candidates extracted from the current sequence. This
approach could not obtain a real-time tracking speed because
of the high computational consumption. As shown in Fig.1, the
computational overhead prevents tackers with deep features to
achieve real-time performance (e.g., SINT [2], MCPF [14],
and CREST [15]).
To solve the above two problems, in this work, we develop
a robust and efficient deep correlation-based tracker with
two key components: a self-supervised learning-based pre-
trained deep feature extraction network and an efficient deep
correlation tracking framework. Unlike most supervised and
unsupervised deep trackers, our self-supervised self-SDCT
tracker obtains a competitive tracking performance (see in
Fig.1). Despite the limitations in the number of labeled train-
ing samples, there are abundant unlabeled video sequences
available for self-supervised learning. In light of this obser-
vation, we propose to train the feature extraction network
via self-supervised learning, so that only the label of the
target in initial frame is needed. After the initial target’s
ground-truth is provided, we use a correlation filter approach
to generate pseudo-labels for other samples, and also use
a cycle-consistency loss method for network training. The
cyclic-consistency loss of most training network methods just
calculated the difference between the original state and the
final state after a forward-backward prediction. Different from
these methods, we use a multi-cycle-consistency loss for our
network training, which considers both the final result (Fig.4:
Final-Loss) and intermediate results (Fig.4:Mid-Loss). The
multi-cycle-consistency can improve the robustness of our
feature extraction network. In addition, to alleviate the impact
of low-quality training sample pairs, we propose a low simi-
larity dropout strategy to dropout these training sample pairs.
Besides, the target and background can be better distinguished
through the cyclic trajectory consistency of the target, which
can reduce the influence of background information on the
feature extraction network. Both the low similarity dropout
strategy and the cycle trajectory consistency loss can improve
the feature extraction network effectively. Once the network
training is completed, we apply it to an efficient Siamese
correlation tracking framework to track the target and the
average tracking speed is around 48 fps. Compared to other
supervised tracking methods (such as CFNet [16] and SiamFC
[12]) and unsupervised tracking methods (such as UDT [17]),
our self-SDCT tracker can achieve competitive tracking results
(see in Fig. 2).
The main contributions of this paper are as follows:
We formulate a multi-cycle consistency loss based self-
supervised learning manner to pre-training the deep fea-
ture extraction network, which can take advantage of
extensive unlabeled video samples rather than limited
manually annotated samples.
We use a multi-cycle consistency loss, a low similarity
Fig. 2. Tracking example about the proposed self-SDCT tracker and other
supervised and unsupervised trackers.
dropout, and a cycle trajectory consistency loss to pre-
train our feature extraction network jointly, which can
effectively improve the representational ability and reduce
the overfitting risk.
We conduct extensive experimental evaluations to demon-
strate the competitive of our self-SDCT tracker with state-
of-the-art supervised and unsupervised trackers on large
benchmarks: OTB-2013 [18], OTB-100 [19], UAVDT
[20], TColor-128[21], and UAV-123[22].
II. RE LATE D WOR KS
In this section, we present some reviews of the relevant
literature regarding deep correlation tracking algorithms, self-
supervised learning for feature representation algorithms, and
cycle consistency in time series.
Deep Correlation Tracking. Trackers based on a deep corre-
lation structure have gained increasing attention. The Siamese
architecture-based tracking methods formulate the tracking
task as a cross-correlation problem [2], [12], [16], [23],
[24], [25], [26]. The SINT [2] tracker proposes to train a
Siamese network that determines the target location by finding
the maximum similarity between candidate samples and the
initial target. The SiamFC [12] tracker incorporates a fully-
convolutional network for tracking tasks, which demonstrates
powerful representation ability of the offline training feature
extraction network. Currently, Siamese network-based trackers
[27], [28], [29], [30] enhance their tracking accuracy by adding
a region proposal network (RPN) module. In [27], in order to
obtain high accuracy and real-time tracking performance, Li et
al. present a SiamRPN tracker, which can discarded the multi-
scale test and online fine-turning. However, the SiamRPN
tracker is susceptible to interference from similar objects in
the tracking scene, which will reduce tracking performance.
Fan et al. [30] provides a Siamese network-based cascaded
RPN tracker (SiamCRPN). The SiamCRPN tracker gradually
defines the target’s position in each RPN through the adjusted
anchor frame, thereby making the target positioning more
accurate. Besides, the correlation filter (CF)-based tracking
methods train a linear template to discriminate between a
image patch and its translation. Benefiting from its formulation
in the Fourier domain, CF-based trackers can achieve a fast-
IEEE TRANSACTIONS ON IMAGE PROCESSING 3
tracking speed [31], [32], [33]. Therefore, to improve the
tracking performance of CF-based trackers, researches have
been carried out from different aspects, e.g., scale estimation
[34], spatio-temporal context [35], [36], learning models [37],
non-linear kernels [38] and boundary effects [39], [40], [41].
Inspired by this, Meanwhile, some deep learning-based track-
ing methods have attempted to treat the correlation filter as an
additional layer in their network structure to achieve faster-
tracking speed. The CFNet [16] tracker integrates the correla-
tion filter into the Siamese network-based tracking framework
and gives a closed-form solution. The C-COT [42] tracker
introduces an effective expression for training continuous
convolution filters; moreover, the ECO [4] tracker proposes
a factorized convolution operator, which significantly reduces
the scale of parameters in C-COT [42] tracker. All of these
deep correlation tracking methods use either an off-the-shelf
feature extraction network (e.g. VGGNet or AlexNet) to fine-
tune on the tracking task or a large number of manually labeled
datasets to train their feature extraction networks. However, the
former usually brings high computational complexity to the
tracker and makes the tracking speed very slow, while the latter
typically produce some unsatisfactory tracking results due
to insufficient labeled training data. Although some changes
in network structure can improve the feature representation
capacity [43], [44], [45], [46], the insufficient labeled training
data is still a major constraint on network performance. Ac-
cordingly, unlike the above deep trackers that use a pre-trained
feature extraction network with numerous manually labeled
training samples, we adopted the self-supervised learning
method to trains the network by using training data that just
requires the initial target’s ground-truth, as the tracking task
does.
Self-supervised Learning for Feature Representation. The
learning of feature representations from numerous videos or
images has been extensively studied. Wu and Huang [47]
proposed a self-supervised learning approach using both su-
pervised and unsupervised training data. Based on this, they
are given a discriminant-EM method that would automatically
select good classification features. The human visual system
often pays more attention to motion information; inspired
by this situation, Pathak et al. [48] propose a motion-based
segmentation on videos to obtain particular segments, which
are then used as a pseudo label to train a segmentation
convolutional network. Vondrick et al. [49] consider video
coloring as a self-supervised learning problem. This method
involves learning to associate an area of a color reference
frame with a region of a gray frame by learning an embedding
then copying the reference color of the specified area to the
gray image. This represents a departure from other methods
that use an off-the-shelf approach for tracking, to provide a
supervisory signal for training [48], [7], [50]. In [51], the
authors try to jointly learn optical flow and tracking and
consequently point out that these two problems are comple-
mentary. Lai et al. [52] proposed a memory-based method to
learn a feature representation, which can guarantee the pixel-
wise correspondences between frames. Our work is inspired
by the unsupervised representation learning method of UDT
[17], which integrates the tracking algorithm into unsupervised
training. We train our deep network for feature representation
using a self-supervised learning approach, which only requires
an initial target location without any additional information.
The supervised information we used to train the deep feature
extraction network came from these pseudo-labels generated
by the forward-backward tracking.
Cycle Consistency in Time Series. The cycle consistency
in time series has been widely explored in many kinds of
literature [53], [54], [55]. Wang et al. [51] proposes to
use cycle consistency to learn visual representations, which
mainly focusing on unifying the optical flow and tracking
in a single video to achieve better embedding representation
using a self-supervised learning way. Dwibedi et al. [56] train
a network using a differentiable temporal cycle-consistency
loss for seek correspondences across time in multiple videos
[57]. Li et al. [58] proposes to track large image patches and
establish associations between consecutive video frames. As
a representative of cycle consistency in time series, forward-
backward consistency has been widely used in tracking tasks.
The TLD [59] tracker proposes a forward-backward error to
estimate the reliability of a tracking trajectory. Their tracking
result is corrected by verifying the trajectory backward and
comparing it to the relevant trajectory. The MTA [60] tracker
performs forward tracking by predicting the forward-backward
consistency of multiple component trackers and identifying the
best tracker through a maximum robustness score. The UDT
[17] tracker revisits the forward-backward tracking framework
and trains a deep tracker to use an unsupervised way. However,
the above-mentioned cycle consistency in time series only
focuses on the final result; this can lead to inaccurate inter-
mediate results while the final result is accurate. Therefore,
we propose a multi-cycle consistency that also considers the
intermediate tracking results in the forward-backward tracking
process, yielding improved tracking performance.
III. SEL F-S UP ERVISED DEEP COR RE LATION TRACKING
In this section, we propose the self-supervised deep cor-
relation tracking network for the tracking task. Firstly, we
provide a brief review deep correlation network-based methods
in Sec. III-A. We then present the self-supervised learning
approach designed to pre-train the feature extraction network
with numerous non-labeled data in Sec. III-B. Furthermore,
we adopt multi-cycle consistency loss, low similarity dropout
and cycle trajectory consistency loss to improve the pre-trained
network effectively. Finally, we outline the training details in
Sec. III-C. The architecture of our self-supervised tracker is
illustrated in Fig.3.
A. Revisiting Deep Correlation Tracker
Tracking arbitrary targets can be addressed by using a cor-
relation learning method in a deep tracking framework (such
as the Siamese framework [16], [24]). The Siamese correlation
trackers propose to learn a function f(x, z) = g(ϕ(x), ϕ(z))
that compares an exemplar image zto a candidate image x
and return a score that can indicate similarity. Since the dis-
criminative correlation filters framework could be efficiently
calculated in the Fourier domain, it is often added into the deep
IEEE TRANSACTIONS ON IMAGE PROCESSING 4
Fig. 3. An overview of the self-SDCT architecture. We use a Siamese
correlation filters tracking framework as the baseline. The feature extraction
network is trained through a forward-backward tracking task under a Siamese
correlation framework with a multi-cycle consistency loss. Once the training
is complete, we, like other Siamese-based trackers, use only forward tracking
to locate the target.
tracking framework as a network layer to improve tracking
speed. Motivated by this, we use the discriminative correlation
filters framework for forward-backward tracking to generate
pseudo-labels of training sample pairs.
The discriminative correlation filters framework uses target
Xand its label Yto train a filter W:
W= arg min
WkWXYk2+λkWk2,(1)
where is the circular convolution, λis a regularization
parameter. Due to the label Yis Gaussian shape, the filter
Wtrained from the data Xcontains the coefficients of the
Gaussian ridge regression. By using a Fourier transformation
to compute this Gaussian ridge regression model, Eq.(1) could
be acquired as follows:
W=F1F(X)F0(Y)
F0(X)F(X) + λ,(2)
where Fis the Fourier transformation and F1is its inverse
transformation. At tracking stage, an image patch Zwith
the same size of Xis cropped out in current frame, and its
response score could be computed as:
f(Z) = F1(F(Z)F0(W)),(3)
where f(Z)is the response map of image patch Z, while
means the element-wise product. Once f(Z)is obtained,
we can select the location with maximum response value in
f(Z)as the target center, and treat it as the label center
to generate the pseudo-Gaussian label. The next step is to
train the new filter using the pseudo-Gaussian label and image
patch Z. After that, these steps are repeated to generate the
pseudo-Gaussian labels for other samples. Finally, the feature
extraction network is improved by repeated forward-backward
tracking.
B. Cycle Consistency Regression
Our work is motivated by forward-backward consistency in
time, which has been used to evaluate consistency in some
tracking methods [59], [60]. Considering that the tracking
Fig. 4. Example of the proposed multi-cycle consistency loss. The multi-cycle
consistency loss not only takes into account the final loss in the forward and
backward movement of the target (Final-Loss), but also the loss in the middle
of the movement (Mid-Loss).
task involves predicting and locating the target’s state in
subsequent frames after given the initial ground-truth, we
propose a self-supervised learning method that uses massive
unlabeled data to pre-train our feature extraction network.
In each video sequence, we choose 4image frames as one
training sample pair. With ground-truth in the initial image
frame, we use a forward-backward tracking way under the
Siamese correlation framework to generate the pseudo-labels
of other frames for multi-cycle consistency training. To further
enhance the capabilities of the feature extraction network, we
also use a similarity function to drop out some low-quality
training pairs and use a cycle trajectory consistency loss to
highlight the role of moving targets in the training process.
Multi-Cycle Consistency Loss. Convention forward-
backward tracking is concerned only with the final tracking
result; in other words, forward-backward tracking only cares
about the result of starting from the first frame and finally
returning to the first frame (Fig.4: Final-Loss). As for the
accuracy of the tracking results of the intermediate frames,
the current work has not been directly involved. In fact, many
trackers may still relocate to the target after a long period
of time after losing the target. However, the performance
of such a tracker is unacceptable. We accordingly propose
that both the final tracking result (Fig.4: Final-Loss) and the
result of the intermediate frame (Fig.4: Mid-Loss) should
be considered in the forward-backward tracking process.
Therefore, we implement a multi-cycle consistency loss for
the training stage. Fig.4 presents an overview of the proposed
method. The multi-cycle consistency loss can be written as
follows:
Li
total =XLt,(4)
where Li
total is the multi-cycle consistency loss of the i-th
training sample pair, Ltis the forward-backward loss of each
training sample in the same pair (Lt=kRtR0
tk2), Rtand
R0
tdenote the response map of forward-backward tracking of
the t-th image patch.
Low Similarity Dropout. The quality of the training samples
is also greatly affect feature learning for tracking. In the train-
ing dataset, sample pairs may contain targets with different
similarities (as shown in Fig.5(a)). The different similarities
of samples in each pair have the same effect on the training
process, which affects the representational ability of the trained
network. Moreover, if the training sample pairs are unable to
contain the target at the same time, this constitutes a fatal
blow to the trained feature extraction network. Therefore, in
the training process, we take the similarity between samples
IEEE TRANSACTIONS ON IMAGE PROCESSING 5
(a) training sample pairs
(b) cycle trajectory consistency loss
Fig. 5. (a) Examples of training sample pairs with different similarities.
Dropping out some samples with low similarity can reduce training loss and
avoid overfitting. (b) Examples of cycle trajectory consistency loss. The target
has forward and backward consistency during the movement. In adjacent
image frames, the moving part is more likely to be target than background.
in each training pair into account to improve the robustness
of feature extraction network. High similarity indicates that
the sample pair is more important; thus, we retain it in the
training process. The sample pairs with low similarity may
not contain moving objects simultaneously, but adding them to
the training process will undermine the representational ability
of the feature extraction network. Therefore, we consider to
dropout the low similarity training pairs to solve this problem.
The similarity from the samples in each training pair could be
calculated as:
fs=Similarity(x, y),(5)
where x, y denotes training samples into same pair, xis the
first frame and y={y1, y2, y3}are the other frames. The
similarity function can be Euclidean function, Mahalanobis
function, Cosine function, etc. In this paper, we use the
Euclidean function. To ensure the quality of these training
samples and avoid overfitting, we dropout 10% of these
training sample pairs:
fdrop =(1, fvs > α
0, otherwise (6)
where fdrop denotes the dropout condition, while αis a
threshold determined by the similarity ranking result of all
training sample pairs and the dropout rate. fvs = (fs(x, y1) +
fs(x, y2) + fs(x, y3))/3is the average similarity of each
sample pair. After dropping 10% training sample pairs with
the lowest similarity, our network becomes more suitable for
the tracking tasks.
Cycle Trajectory Consistency Loss. In addition to the low
similarity of sample pairs, which will degrade the performance
of the feature extraction network, the inclusion of a large
amount of background information in the training samples
will also affect the network’s performance. The trajectory
of the target can draw an effective distinction between the
target and the background [17], [60]. More specifically, the
trajectory of the target moving from current t-th frame to
next t+ 1-th frame is consistent with the trajectory from
t+ 1-th frame moving to t-th frame. Meanwhile, the relative
target position between these two frames is also consistent.
After considering the trajectory consistency, we designed a
cycle trajectory consistency loss Ltc that could reduce the
background impact on the tracking performance. Accordingly,
we formulate a cycle trajectory consistency loss Ltc to all
training sample pairs. Every element Li
tc could be calculated
as follows:
Li
tc =XLi
tt+1,
Li
tt+1 =1
2(kRtRt+1k2
2+kRt+1 R0
tk2
2),
(7)
where Li
tc is the cycle trajectory consistency loss of the
i-th training pair samples, Li
tt+1 is the cycle trajectory
consistency loss of t-th frame to t+1-th frame in i-th training
pair (see Fig.5(b)), Rtis the label of t-th frame, Rt+1 is the
label of t+ 1-th frame generated by forward tracking, R0
tis
the label of t-th frame generated by backward tracking.
Cycle Consistency Regression Loss. Taking into account the
multi-cycle consistency loss, the low similarity dropout and
the cycle trajectory consistency loss, our cycle consistency
regression objective function can be written as follows:
Lcc =XLi
cc,(8)
where Lcc is the total cycle consistency regression loss, Li
cc
is the cycle consistency regression loss of the i-th training
pair samples (Li
cc = (Li
total fi
drop)/(Li
tc +ε)); moreover, ε
is a parameter used to ensure that the denominator is not 0
(we set ε= 1 in this work). We ensure the tracking accuracy
by narrowing the difference between the forward-backward
tracking results of the same image frame. Furthermore, in
the case that the motion probability of the target is greater
than that of the background, we make sure that the tracking
position is the target rather than the background by increasing
the difference between adjacent frames.
C. Self-supervised Training Details
Network Structure. With reference to the DCFNet [24]
tracker and the UDT [17] tracker, we used a network with
only two convolutional layers to extract features and track
the target under the Siamese framework. The filter sizes are
3×3×3×32 and 3×3×32 ×32, respectively. Since there
are only two convolutional layers in the feature extraction
network, the magnitude of parameters in this network is very
small. The training process only needs 30 iterations, and the
model can reach convergence. This lightweight network (less
than 40KB) thus provides competitive and real-time tracking
speed.
Training Data. We choose the ILSVRC2015 [61] as our
training dataset just like other supervised [16], [12], [24] and
unsupervised [17] trackers. Unlike the supervised trackers,
however, we do not require the labels for each image frame
[16], [24]; instead, we follow the unsupervised UDT [17]
IEEE TRANSACTIONS ON IMAGE PROCESSING 6
tracker, which doesn’t pre-process any training data but rather
only crops the center patch in every image frame and resizes
it to 125×125. For each image video, we choose four cropped
patches from continuous frames, then set one as the template
image and the others as the search images. We take the target
in the template image center as the tracking target and give
its ground-truth.
D. Model Update
To adapt the target appearance variations in the tracking
stage, a linear model update strategy was adopted to update
the correlation filter parameters:
W= (1 δ)Wt1+δ W t,(9)
where δis learning rate and Wtis the current correlation filter.
IV. EXP ER IM EN TS
We first introduce some experimental details and the e-
valuation criterion, then analyze the effectiveness of each
component of the proposed self-unsupervised learning-based
pre-trained feature extraction network. Finally, we make some
evaluation about our self-supervised learning based self-SDCT
tracker alongside some state-of-the-art supervised and unsu-
pervised trackers on OTB-2013 [18], OTB-100 [19], UAVDT
[20], TColor-128 [21], and UAV-123 [22] datasets.
A. Experimental Details and Evaluation Criterion
Experimental Details. We follow UDT [17] and DCFNet
[24] which apply the stochastic gradient descent (SGD) with
a momentum of 0.9to train the feature extraction network.
The weight decay is set to 5e-4, and the learning rate is set to
1e-5. The network is trained for 30 epochs with a mini-batch
size of 32. The model update learning rate δis set to 0.025.
Our experiments are performed in Matlab2019 on a PC with
an i7 4.2GHz CPU and an NVIDIA GTX 2080Ti GPU. The
tracking speed is around 48 fps.
Evaluation Criterion. We mainly use the precision and
success index [62] to evaluate the tracking performance of
our self-SDCT tracking method, which is introduced in OTB
benchmark [18], [19]. The precision index refers to the average
distance precision of the predicted position and the ground-
truth under different thresholds. Meanwhile, the success index
is measured by an average overlap of the tracking result and
the ground-truth, and trackers are ranked using area-under-the-
curve (AUC). Moreover, tracking speed is also a significant
index for evaluating a tracker.
B. Ablation Study
We carry out ablation studies on OTB-2013 [18] and OTB-
100 [19] benchmark to analyze the effect of each component
in the training process. The comparison results are demon-
strated in Table I. Note that self-SDCT denotes the tracking
result of the pre-trained network included in each component;
self-SDCTsccl denotes the tracking result of the pre-trained
network with only the final consistency loss (shown in Fig.4),
self-SDCTolsd denotes the tracking result of the pre-trained
TABLE I
ABL ATION S TU DY RES ULTS O N TH E OTB-2013 AND OTB-100 DATASE TS.
OTB-2013 [18] OTB-100 [19]
Trackers Prec. scores AUC scores Prec. scores AUC scores
self-SDCT 84.9 64.1 84.6 63.8
self-SDCTsccl 82.3 62.0 81.9 61.6
self-SDCTolsd 83.5 63.5 82.5 62.7
self-SDCTotcl 83.0 63.2 82.1 62.4
TABLE II
TRAC KIN G RE SULT S UND ER D IFFE RE NT DR OPO UT RAT ES.
OTB-2013 [18] OTB-100 [19]
Dropout rate Prec. scores AUC scores Prec. scores AUC scores
No dropout 83.5 63.5 82.5 62.7
5% dropout 84.0 63.5 83.8 63.2
10% dropout 84.9 64.1 84.6 63.8
15% dropout 83.5 63.1 83.7 63.5
20% dropout 81.4 62.0 81.9 62.3
network without low similarity dropout, and self-SDCTotcl
denotes the tracking result of the pre-trained network without
cycle trajectory consistency loss. From Table I we can see the
tracking performance of our self-SDCT tracking method is
significantly improved than the self-SDCTsccl tracker, which
benefits from the multi-cycle consistency loss. Moreover, if
any of these three components are removed, the tracking
performance will be reduced. This directly reflects the effec-
tiveness of multi-cycle consistency, low similarity dropout and
cycle trajectory consistency in the network training process.
We also report the tracking performance of the pre-trained
network under different dropout rates as shown in Table
II. From this table, we can see that an appropriate dropout
rate (e.g., 5%,10%) can bring certain tracking performance
improvements to the pre-trained network. However, a larger
dropout rate (e.g., 20%) will reduce the diversity of the training
samples and cause the tracking performance of the pre-
trained network to decline. The dropout rate not specifically
mentioned in this paper is set to 10%.
C. State-of-the-art Comparison
In order to verify the proposed self-SDCT tracker, we
made some experimental comparisons between our tracker and
some state-of-the-art trackers on standard benchmark datasets
including OTB-100 [19], UAVDT [20], TColor-128 [21], and
UAV-123 [22].
Experiment on OTB-100 Benchmark. We conduct some
comparisons between our self-SDCT tracker and other trackers
including ATOM [63], SiamRPN [27], MetaCREST [64],
UDT+[17], TRACA [65], ARCF [66], ACFN [67], SiamTri
[68], SiamFC [12], DCFNet [24], CFNet [16], CNT [69]
and UDT [17] on OTB-100 [19] dataset. Fig. 6 presents the
experimental results of comparing our self-SDCT tracker with
these state-of-the-art trackers. In Fig. 6, we can know our
self-SDCT tracker is comparable with these baseline fully-
supervised trackers [16], [12], [24]. Compared to CFNet [16],
our proposed tracker achieves 5.2% improvement in terms
of AUC index. Moreover, the accuracy of our self-SDCT
tracker is comparable to that of the SiamRPN [27] tracker.
This tracking result demonstrates that the self-supervised
IEEE TRANSACTIONS ON IMAGE PROCESSING 7
Fig. 6. Precision and success plots on OTB-100 [19] dataset.
Fig. 7. Precision and success plots on UAVDT [20] dataset.
TABLE III
STATE-OF -TH E-ART COMPARISON ON THE UAVDT [20] DATAS ET IN
TERMS OF PRECISION SCORES,SUC CE SS SC OR ES AN D TR ACK ING S PE ED.
THE FI RST,SECOND AND THIRD BEST ARE HIGHLIGHTED IN R ED,B LU E
AN D GRE EN ,RES PEC TI VELY.
Trackers Precision (%) AUC (%) Tracking Speed (f ps)
SINT [2] 57.0 29.0 4
HDT [7] 59.6 30.3 10
STRCF [41] 63.8 41.7 29
CREST [15] 65.0 39.6 1
MCPF [14] 66.0 39.8 1
PTAV [70] 67.5 38.1 26
SRDCF [35] 67.9 43.25
CFNet [16] 68.1 42.865
SiamFC [12] 68.244.7 58
UDT+ [17] 68.7 43.6 35
UDT [17] 68.8 44.6 45
Staple-CA [71] 69.7 39.5 35
ARCF [66] 74.0 47.0 15
self-SDCT (Ours) 71.1 45.3 48
learning method for feature extraction network training is very
effective. Although the accuracy of our self-SDCT tracker
is little worse than the ATOM [63] tracker; this is mainly
because the ATOM tracker benefits from an accurate target
estimation strategy. The feature extraction network of the
ATOM tracker is trained by a supervised learning manner,
which requires lots of labeled training data. Instead, the feature
extraction network of our self-SDCT tracker is trained by a
self-supervised learning manner, which means our network can
be trained without the label of training data.
Experiment on UAVDT Benchmark. Fig. 7 and Table. III
illustrates the experimental results of the proposed self-SDCT
tracker against other trackers, including ARCF [66], Staple-CA
[71], UDT+, UDT [17], SRDCF [35], SiamFC [12], CFNet
[16], CREST [15], MCPF [14], PTAV [70], SINT [2], STRCF
Fig. 8. Precision and success plots for long-term attribute on UAVDT dataset.
[41], and HDT [7] on the UAVDT [20] dataset. Among the
compared tracking methods, our self-SDCT tracker performs
the second-best scores in both precision and AUC metrics. Al-
though the tracking accuracy of our self-SDCT tracker is worse
than the ARCF [66] tracker, its tracking speed is far less than
that of our tracker, and even can not meet the requirements for
real-time tracking (as shown in Table. III). Fig. 8 shows the
comparative performance of these trackers on the long-term
attribute on the UAVDT dataset, which demonstrates that our
self-SDCT tracker can achieve competitive tracking results on
the long-term tracking sequence. These experimental results
demonstrates that the self-supervised learning-based tracker
can also achieve competitive tracking results.
Experiment on UAV-123 Benchmark. Table.IV demonstrates
the comparison results of our self-SDCT tracker and other
state-of-the-art trackers, including UDT [17], CFNet [16],
SRDCF [35], MUSTer [72], SAMF [73], MEEM [74], SiamFC
[12], DSST [75], ARCF [66], ARCFH [66], BACF [39] and
CNT [69] on the UAV-123 [22] dataset. In these contrast
tracking methods, our self-SDCT tracker shows the best s-
cores in both precision and AUC metrics. Compared with
these CF-based tracking methods (e.g., SAMF, DSST), the
proposed self-SDCT tracker achieves remarkable improvement
in tracking performance. Compared with these deep learning-
based tracking methods (e.g., CFNet, CNT), the proposed
self-SDCT tracker also makes some improvements in tracking
performance. By taking account of the intermediate state,
our multi-cycle consistency loss based self-SDCT tracker
outperforms than the unsupervised learning-based UDT [17]
tracker. In general, our self-SDCT tracking method shows
favorably against these state-of-the-art trackers in terms of
tracking performance.
Experiment on TColor-128 Benchmark. We also test and
verify our self-SDCT tracker on TColor-128 [21] benchmark
against 12 state-of-the-art trackers, including UDT [17], BACF
[39], SRDCF [35], Staple [71], MEEM [74], SiamFC [12],
CFNet [16], HDT [7], DSST [75], ARCF [66], CNT [69] and
SITUP[76]. The experimental comparison results are presented
in Table.V. Among these 12 compared trackers, the correla-
tion filters-based BACF, ARCF, SRDCF and DSST trackers
achieve the precision and AUC scores of (66.0%/49.6%),
(70.9%/52.5%), (69.6%/51.6%) and (54.9%/38.7%) respec-
tively. By contrast, our self-SDCT tracker performs well both
in precision and AUC metrics (72.9%/54.0%). Moreover, our
self-SDCT tracker also achieves competitive tracking perfor-
mance compared to some supervised learning-based tracking
IEEE TRANSACTIONS ON IMAGE PROCESSING 8
TABLE IV
PRECISION AND AUC SC OR ES OF T HE P ROP OSE D SE LF-SDCT T RAC KER A ND OT HE R TRAC KE RS ON T HE UAV-123 [22] DATASE T. THE FI RST,S ECO ND
AND THIRD BEST SCORES ARE HIGHLIGHTED IN RED ,BL UE AN D GR EEN ,RE SPE CT IVE LY.
Trackers self-SDCT UDT CFNet SRDCF MUSTer SAMF MEEM SiamFC DSST ARCF ARCFH BACF CNT
Ours [17] [16] [35] [72] [73] [74] [12] [75] [66] [66] [39] [69]
Precision (%)72.6 67.2 65.167.6 59.1 59.2 62.772.6 58.667.6 65.6 65.4 52.4
AUC (%)50.1 48.0 43.6 46.4 39.1 39.6 39.249.8 35.6 47.0 45.3 45.7 36.9
TABLE V
PRECISION AND AUC SC OR ES OF O UR S ELF -SDCT TR ACK ER AN D OTH ER TR ACK ER S ON TH E TCOLO R-128 [21] DATAS ET. TH E FIRS T,SECOND AND
THIRD BEST SCORES ARE HIGHLIGHTED IN RED ,BL UE AN D GR EEN ,RE SPE CT IVE LY.
Trackers self-SDCT UDT BACF SRDCF Staple MEEM SiamFC CFNet HDT DSST ARCF CNT SITUP
Ours [17] [39] [35] [71] [74] [12] [16] [7] [75] [38] [69] [76]
Precision (%)72.9 65.8 66.0 69.6 66.870.8 69.4 60.7 68.6 53.470.9 44.9 63.9
AUC (%)54.0 50.7 49.651.6 50.9 50.0 50.5 45.6 48.0 40.552.5 33.5 47.0
Fig. 9. Qualitative comparison of the self-SDCT tracker and other trackers
on some tracking sequences (from top to bottom are skiing, soccer, matrix,
skating2-1 and liquor).
methods (e.g., CFNet, SiamFC). Compared with the SiamFC
[12] tracker, our self-SDCT tracker, our tracker does not
require a large amount of labeled data for training and
still achieved more than a 3% improvement on the tracking
performance. In summary, our self-SDCT tracking method
has achieved competitive tracking performance compared with
other state-of-the-art trackers.
D. Qualitative Comparison
We give a qualitative comparison of our self-supervised
learning-based self-SDCT tracker with other state-of-the-art
tracking methods, including UDT [17], SiamFC [12], CFNet
[16], DCFNet [24], and SiamTri [68]. Fig. 9 shows the
comparison results of these trackers on some challenging video
sequences. As for the unsupervised learning-based UDT [17]
tracker, it is easily interfered in the scenes of occlusion and
fast motion (e.g., matrix and skiing). An explanation for such
drawback is because it adopts the feature extraction network
trained by using a single-cycle consistency loss under an
unsupervised learning approach, meaning that it could not
model a suitable target appearance in some complex scenes.
In contrast, the proposed self-SDCT tracker adopts a multi-
cycle consistency loss to train the feature extraction network,
which can extract more robust features. Compared with other
trackers, such as SiamFC [12] and CFNet [16], our self-
SDCT tracker also achieves some competitive tracking results.
Compared to other trackers, with only a limited amount of
labeled data and numerous amount of self-supervised pairs to
train the feature extraction network, our self-SDCT tracker is
still able to achieve competitive tracking performance.
V. CONCLUSIONS
We propose an effective multi-cycle consistency loss-based
self-supervised learning method to train a deep feature ex-
traction network without need for numerous manual labeled
samples. In the proposed self-SDCT tracker, we use a forward-
backward prediction under a Siamese correlation-based track-
ing framework to generate pseudo-labels of these training
samples; and adopt the multi-cycle consistency loss to train
feature extraction network. Meanwhile, we propose a low
similarity dropout strategy and a cycle-trajectory consisten-
cy loss to enhance the robustness of the feature extraction
network. The ablation studies validated the effectiveness of
each component in the proposed self-SDCT tracker. Moreover,
the Siamese correlation-based tracking architecture supplies a
faster tracking speed, which guarantees that the proposed self-
SDCT tracker can be able to engage in real-time tracking.
Extensive experiments show the effectiveness of our proposed
self-SDCT tracker.
ACKNOWLEDGMENT
This study was supported by the National Natural Science
Foundation of China (Grant No.61672183), by the Shenzhen
Research Council (Grant No. JCYJ2017041310455226946,
JCYJ20170815113552036), partially by the projects ’PCL
Future Greater-Bay Area Network Facilities for Large-scale
Experiments and Applications (PCL2018KP001)’ and ’The
Verficiation Platform of Multi-tier Coverage Communication
Network for Oceans (PCL2018KP002)’. Di Yuan was sup-
ported by a scholarship from China Scholarship Council. Dr
Xiaojun Chang was partially supported by Australian Research
IEEE TRANSACTIONS ON IMAGE PROCESSING 9
Council (ARC) Discovery Early Career Researcher Award
(DECRA) under grant no. DE190100626.
REFERENCES
[1] H. Li, Y. Li, and F. Porikli, “DeepTrack: Learning discriminative feature
representations by convolutional neural networks for visual tracking,” in
BMVC, 2014, pp. 1–12.
[2] R. Tao, E. Gavves, and A. W. M. Smeulders, “Siamese instance search
for tracking,” in CVPR, 2016, pp. 1420–1429.
[3] Q. Liu, X. Li, Z. He, N. Fan, D. Yuan, W. Liu, and Y. Liang, “Multi-task
driven feature models for thermal infrared tracking,” in AAAI, 2020, pp.
11 604–11 611.
[4] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ECO: Efficient
convolution operators for tracking,” in CVPR, 2017, pp. 6638–6646.
[5] J. Choi, J. Kwon, and K. M. Lee, “Deep meta learning for real-time
target-aware visual tracking,” in ICCV, 2019, pp. 911–920.
[6] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Convolutional
features for correlation filter based visual tracking,” in ICCV Workshops,
2015, pp. 621–629.
[7] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M. H. Yang,
“Hedged deep tracking,” in CVPR, 2016, pp. 4303–4311.
[8] K. Li, Y. Kong, and Y. Fu, “Visual object tracking via multi-stream deep
similarity learning networks,” IEEE Transactions on Image Processing,
vol. 29, pp. 3311–3320, 2019.
[9] H. Nam and B. Han, “Learning multi-domain convolutional neural
networks for visual tracking,” in CVPR, 2016, pp. 4293–4302.
[10] L. Wang, W. Ouyang, X. Wang, and H. Lu, “STCT: Sequentially training
convolutional networks for visual tracking,” in CVPR, 2016, pp. 1373–
1381.
[11] Y. Song, M. Chao, X. Wu, L. Gong, and M. H. Yang, “VITAL: Visual
tracking via adversarial learning,” in CVPR, 2018, pp. 8990–8999.
[12] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr,
“Fully-convolutional Siamese networks for object tracking,” in ECCV
Workshop, 2016, pp. 850–865.
[13] D. Yuan, X. Li, Z. He, Q. Liu, and S. Lu, “Visual object tracking with
adaptive structural convolutional network,” Knowledge-Based Systems,
vol. 194, p. 105554, 2020.
[14] T. Zhang, C. Xu, and M.-H. Yang, “Multi-task correlation particle filter
for robust object tracking,” in CVPR, 2017, pp. 4819–4827.
[15] Y. Song, C. Ma, L. Gong, J. Zhang, R. W. Lau, and M.-H. Yang,
“CREST: Convolutional residual learning for visual tracking,” in ICCV,
2017, pp. 2574–2583.
[16] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. S. Torr,
“End-to-end representation learning for correlation filter based tracking,”
in CVPR, 2017, pp. 2085–2813.
[17] N. Wang, Y. Song, C. Ma, W. Zhou, W. Liu, and H. Li, “Unsupervised
deep tracking,” in CVPR, 2019, pp. 1308–1317.
[18] Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,”
in CVPR, 2013, pp. 2411–2418.
[19] Y. Wu, J. Lim, and M. H. Yang, “Object tracking benchmark,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 37,
no. 9, pp. 1834–1848, 2015.
[20] D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang,
and Q. Tian, “The unmanned aerial vehicle benchmark: Object detection
and tracking,” in ECCV, 2018, pp. 370–386.
[21] P. Liang, E. Blasch, and H. Ling, “Encoding color information for visual
tracking: Algorithms and benchmark,” IEEE Transactions on Image
Processing, vol. 24, no. 12, pp. 5630–5644, 2015.
[22] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for
UAV tracking,” in ECCV, 2016, pp. 445–461.
[23] Z. Liang and J. Shen, “Local semantic Siamese networks for fast
tracking,” IEEE Transactions on Image Processing, vol. 29, pp. 3351–
3364, 2020.
[24] Q. Wang, J. Gao, J. Xing, M. Zhang, and W. Hu, “DCFNet:
Discriminant correlation filters network for visual tracking,” CoRR, vol.
abs/1704.04057, 2017. [Online]. Available: http://arxiv.org/abs/1704.
04057
[25] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang, “Learning
dynamic Siamese network for visual object tracking,” in ICCV, 2017,
pp. 1763–1771.
[26] Y. Zhang, L. Wang, J. Qi, D. Wang, M. Feng, and H. Lu, “Structured
Siamese network for real-time visual tracking,” in ECCV, 2018, pp.
351–366.
[27] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual
tracking with Siamese region proposal network,” in CVPR, 2018, pp.
8971–8980.
[28] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, “Distractor-aware
Siamese networks for visual object tracking,” in ECCV, 2018, pp. 101–
117.
[29] M. H. Abdelpakey and M. S. Shehata, “Dp-siam: Dynamic policy
Siamese network for robust object tracking,IEEE Transactions on
Image Processing, vol. 29, pp. 1479–1492, 2019.
[30] H. Fan and H. Ling, “Siamese cascaded region proposal networks for
real-time visual tracking,” in CVPR, 2019, pp. 7952–7961.
[31] T. Zhang, S. Liu, C. Xu, B. Liu, and M.-H. Yang, “Correlation particle
filter for visual tracking,” IEEE Transactions on Image Processing,
vol. 27, no. 6, pp. 2676–2687, 2017.
[32] F. Liu, C. Gong, X. Huang, T. Zhou, J. Yang, and D. Tao, “Robust visual
tracking revisited: From correlation filter to template matching,” IEEE
Transactions on Image Processing, vol. 27, no. 6, pp. 2777–2790, 2018.
[33] Z. He, S. Yi, Y.-M. Cheung, X. You, and Y. Y. Tang, “Robust object
tracking via key patch sparse representation,” IEEE transactions on
cybernetics, vol. 47, no. 2, pp. 354–364, 2016.
[34] G. Ding, W. Chen, S. Zhao, J. Han, and Q. Liu, “Real-time scalable
visual tracking via quadrangle kernelized correlation filters,” IEEE
Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp.
140–150, 2017.
[35] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Learning spatially
regularized correlation filters for visual tracking,” in ICCV, 2015, pp.
4310–4318.
[36] D. Yuan, X. Shu, and Z. He, “TRBACF: Learning temporal regularized
correlation filters for high performance online visual object tracking,”
Journal of Visual Communication and Image Representation, vol. 72, p.
102882, 2020.
[37] B. Zhang, S. Luan, C. Chen, J. Han, W. Wang, A. Perina, and L. Shao,
“Latent constrained correlation filter,IEEE Transactions on Image
Processing, vol. 27, no. 3, pp. 1038–1048, 2017.
[38] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed
tracking with kernelized correlation filters,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596,
2014.
[39] H. K. Galoogahi, A. Fagg, and S. Lucey, “Learning background-aware
correlation filters for visual tracking,” in ICCV, 2017, pp. 1135–1143.
[40] W. Feng, R. Han, Q. Guo, J. Zhu, and S. Wang, “Dynamic saliency-
aware regularization for correlation filter-based object tracking,IEEE
Transactions on Image Processing, vol. 28, no. 7, pp. 3232–3245, 2019.
[41] F. Li, T. Cheng, W. Zuo, L. Zhang, and M. H. Yang, “Learning spatial-
temporal regularized correlation filters for visual tracking,” in CVPR,
2018, pp. 4904–4913.
[42] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg, “Beyond
correlation filters: Learning continuous convolution operators for visual
tracking,” in ECCV, 2016, pp. 472–488.
[43] C. Tian, Y. Xu, W. Zuo, B. Zhang, L. Fei, and C.-W. Lin, “Coarse-to-fine
CNN for image super-resolution,” IEEE Transactions on Multimedia,
2020.
[44] S. Luan, C. Chen, B. Zhang, J. Han, and J. Liu, “Gabor convolutional
networks,” IEEE Transactions on Image Processing, vol. 27, no. 9, pp.
4357–4366, 2018.
[45] C. Tian, Y. Xu, Z. Li, W. Zuo, L. Fei, and H. Liu, “Attention-guided
CNN for image denoising,” Neural Networks, vol. 124, pp. 117–129,
2020.
[46] X. Li, C. Ma, B. Wu, Z. He, and M.-H. Yang, “Target-aware deep
tracking,” in CVPR, 2019, pp. 1369–1378.
[47] Y. Wu and T. S. Huang, “Self-supervised learning for visual tracking
and recognition of human hand,” in AAAI, 2000, pp. 243–248.
[48] D. Pathak, R. Girshick, P. Doll´
ar, T. Darrell, and B. Hariharan, “Learning
features by watching objects move,” in CVPR, 2017, pp. 2701–2710.
[49] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy,
“Tracking emerges by colorizing videos,” in ECCV, 2018, pp. 391–408.
[50] X. Wang, K. He, and A. Gupta, “Transitive invariance for self-supervised
visual representation learning,” in ICCV, 2017, pp. 1329–1338.
[51] X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence from the
cycle-consistency of time,” in CVPR, 2019, pp. 2566–2576.
[52] Z. Lai, E. Lu, and W. Xie, “MAST: A memory-augmented self-
supervised tracker,” in CVPR, 2020, pp. 6479–6488.
[53] P. Huang, G. Kang, W. Liu, X. Chang, and A. G. Hauptmann, “Annota-
tion efficient cross-modal retrieval with adversarial attentive alignment,”
in ACM MM, 2019, pp. 1758–1767.
[54] C. Liu, X. Chang, and Y. Shen, “Unity style transfer for person re-
identification,” in CVPR, 2020, pp. 6887–6896.
IEEE TRANSACTIONS ON IMAGE PROCESSING 10
[55] X. Chang, Y. Yu, Y. Yang, and E. P. Xing, “Semantic pooling for
complex event analysis in untrimmed videos,IEEE Trans. Pattern Anal.
Mach. Intell., vol. 39, no. 8, pp. 1617–1632, 2017.
[56] D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman,
“Temporal cycle-consistency learning,” in CVPR, 2019, pp. 1801–1810.
[57] P. Sermanet, C. Lynch, Y. Chebotar, J. Hsu, E. Jang, S. Schaal, S. Levine,
and G. Brain, “Time-contrastive networks: Self-supervised learning from
video,” in ICRA, 2018, pp. 1134–1141.
[58] X. Li, S. Liu, S. D. Mello, X. Wang, J. Kautz, and M.-H. Yang, “Joint-
task self-supervised learning for temporal correspondence,” in NIPS,
2019, pp. 317–327.
[59] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,”
IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 34, no. 7, pp. 1409–1422, 2011.
[60] D.-Y. Lee, J.-Y. Sim, and C.-S. Kim, “Multihypothesis trajectory analysis
for robust visual tracking,” in CVPR, 2015, pp. 5088–5096.
[61] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large
scale visual recognition challenge,” International Journal of Computer
Vision, vol. 115, no. 3, pp. 211–252, 2015.
[62] M. Luo, X. Chang, Z. Li, L. Nie, A. G. Hauptmann, and Q. Zheng,
“Simple to complex cross-modal learning to rank,” Comput. Vis. Image
Underst., vol. 163, pp. 67–77, 2017.
[63] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ATOM: Accurate
tracking by overlap maximization,” in CVPR, 2019, pp. 4660–4669.
[64] E. Park and A. C. Berg, “Meta-Tracker: Fast and robust online adaptation
for visual object trackers,” in ECCV, 2018, pp. 1–17.
[65] J. Choi, H. J. Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris,
and Y. C. Jin, “Context-aware deep feature compression for high-speed
visual tracking,” in CVPR, 2018, pp. 479–488.
[66] Z. Huang, C. Fu, Y. Li, F. Lin, and P. Lu, “Learning aberrance repressed
correlation filters for real-time UAV tracking,” in ICCV, 2019, pp. 2891–
2900.
[67] J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, and Y. C. Jin,
“Attentional correlation filter network for adaptive visual tracking,” in
CVPR, 2017, pp. 4828–4837.
[68] X. Dong and J. Shen, “Triplet loss in Siamese network for object
tracking,” in ECCV, 2018, pp. 459–474.
[69] K. Zhang, Q. Liu, Y. Wu, and M. H. Yang, “Robust visual tracking via
convolutional networks without training,IEEE Transactions on Image
Processing, vol. 25, no. 4, pp. 1779–1792, 2016.
[70] H. Fan and H. Ling, “Parallel tracking and verifying: A framework for
real-time and high accuracy visual tracking,” in ICCV, 2017, pp. 5486–
5494.
[71] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. Torr, “Staple:
Complementary learners for real-time tracking,” in CVPR, 2016, pp.
1401–1409.
[72] Z. Hong, C. Zhe, C. Wang, M. Xue, D. Prokhorov, and D. Tao, “Multi-
store tracker (MUSTer): a cognitive psychology inspired approach to
object tracking,” in CVPR, 2015, pp. 749–758.
[73] Y. Li and J. Zhu, “A scale adaptive kernel correlation filter tracker with
feature integration,” in ECCV Workshops, 2014, pp. 254–265.
[74] J. Zhang, S. Ma, and S. Sclaroff, “MEEM: Robust tracking via multiple
experts using entropy minimization,” in ECCV, 2014, pp. 188–203.
[75] M. Danelljan, G. Hager, F. S. Khan, and M. Felsberg, “Discriminative
scale space tracking,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 39, no. 8, pp. 1561–1575, 2017.
[76] H. Ma, S. T. Acton, and Z. Lin, “SITUP: Scale invariant tracking
using average peak-to-correlation energy,” IEEE Transactions on Image
Processing, vol. 29, pp. 3546–3557, 2020.
Di Yuan received the M.S. degrees in Applied
Mathematics from Harbin Institute of Technology,
Shenzhen, China in 2017. He is pursuing the Ph.D
degree in Computer Science with the research in
statute of Biocomputing, School of Computer Sci-
ence and Technology, Harbin Institute of Technolo-
gy, Shenzhen, China. His current research interests
include object tracking, machine learning and self-
supervised learning.
Xiaojun Chang received his Ph.D. degree in Centre
for Artificial Intelligence & Faculty of Engineering
and Information Technology, University of Technol-
ogy Sydney, Sydney, in 2016. He is currently a Se-
nior Lecturer at Faculty of Information Technology,
Monash University Clayton Campus, Australia. He
is also a Distinguished Adjunct Professor with the
Faculty of Computing and Information Technology,
King Abdulaziz University. He is an ARC Discovery
Early Career Researcher Award (DECRA) Fellow
between 2019C2021. Before joining Monash, he was
a Postdoc Research Associate in School of Computer Science, Carnegie
Mellon University, working with Prof. Alex Hauptmann. He has spent most
of time working on exploring multiple signals (visual, acoustic, textual)
for automatic content analysis in unconstrained or surveillance videos. He
has achieved top performance in various international competitions, such as
TRECVID MED, TRECVID SIN, and TRECVID AVS.
Po-Yao Huang is a Ph.D. student at the School
of Computer Science at Carnegie Mellon Universi-
ty. His research interest is in multimodal machine
learning. He is particularly interested in bridging
computer vision and natural language processing
for the tasks of multimodal machine translation,
cross-modal search and retrieval, and large-scale
multimodal data mining and analysis.
Qiao Liu received the B.E degree in computer sci-
ence from the Guizhou Normal University, Guiyang,
China, in 2016. He is currently working toward
the Ph.D. degree with the Department of Com-
puter Science and Technology, Harbin Institute of
Technology, Shenzhen, China. His current research
interests include thermal infrared object tracking and
machine learning.
Zhenyu He (SM’12) received his Ph.D. degree from
the Department of Computer Science, Hong Kong
Baptist University, Hong Kong, in 2007. From 2007
to 2009, he worked as a postdoctoral researcher in
the department of Computer Science and Engineer-
ing, Hong Kong University of Science and Technol-
ogy. He is currently a full professor in the School
of Computer Science and Technology, Harbin Insti-
tute of Technology, Shenzhen, China. His research
interests include machine learning, computer vision,
image processing and pattern recognition.
... e DCF-based trackers [5][6][7] attempt to train filters to learn the correlation between the target and background appearance. e target is then detected in consecutive frames by convolving the trained filter via the fast Fourier transform (FFT) [5]. ...
Article
Full-text available
Visual object tracking takes an important role in realistic applications, such as video understanding, unmanned auto vehicles, and autonomous robots. Although the Siamese-based tracker has achieved good performance in tracking tasks, the existing methods using initial template or updating template with simple strategy result in the performance degradation of the model when the target varies in realistic scenarios such as target occlusion, scale variation, and deformation. In this paper, we propose a visual tracking framework with adaptive template update and spatiotemporal attention, named SiamAttnAT. Specially, we propose a historical template selecting strategy and a template adaptively generating method for robust tracking. In addition, we apply the proposed mechanisms to the employed baseline SiamRPN++. Extensive experiments and comparisons with state-of-the-art trackers on short-term and long-term visual tracking benchmarks including VOT2018, OTB-100, UAV123, NFS, and LaSOT show that the proposed framework achieves the outstanding performance with a considerable real-time speed, verifying its efficiency and effectiveness.
Article
Recent advances in self-supervised learning (SSL) have made remarkable progress, especially for contrastive methods that target pulling two augmented views of one image together and pushing the views of all other images away. In this setting, negative pairs play a key role in avoiding collapsed representation. Recent studies, such as those on bootstrap your own latent (BYOL) and SimSiam, have surprisingly achieved a comparable performance even without contrasting negative samples. However, a basic theoretical issue for SSL arises: how can different SSL methods avoid collapsed representation, and is there a common design principle? In this study, we look deep into current non-contrastive SSL methods and analyze the key factors that avoid collapses. To achieve this goal, we present a new indicator of uniformity metric and study the local dynamics of the indicator to diagnose collapses in different scenarios. Moreover, we present some principles for choosing a good predictor, such that we can explicitly control the optimization process. Our theoretical analysis result is validated on some widely used benchmarks spanning different-scale datasets. We also compare recent SSL methods and analyze their commonalities in avoiding collapses and some ideas for future algorithm designs.
Article
The emergent studies and success on contrastive self-supervised learning have been well verified in the pretext task of instance discrimination, which learns visual representations by maximizing agreement between different augmented views of the same image sample (positive pairs). However, randomly cropping on original images may cause that the augmented view contains interference from a large proportion of the backgrounds, referred to as noisy data. Aiming to optimize the data augmentation and improve positive pairs, a Saliency-Augmented Module is proposed to obtain the augmented views, which only contain the ”latent” object area, referred to as clean data. Furthermore, a Saliency-Guided Self-Supervised Learning Network (SiSL-Net) is constructed as a new pattern of contrastive learning. A symmetric structure of trunk net and branch net is trained to learn a feature mapping from the clean data space and the noisy data space. Besides, a novel loss function is designed, including the embedding contrastive loss and distribution consistency loss, to optimize the feature representations during network training. The linear classification performance of our SiSL-Net is evaluated on the miniImageNet dataset with ResNet-50. Experiments show that our method achieves the top-1 accuracy from 64.67% to 69.02%, outperforming the state-of-the-art performance.
Article
Visual object tracking is an extremely challenging task. Many existing trackers cannot handle various challenges simultaneously. In this paper, we propose a novel tracking framework based on an occlusion recognition mechanism to improve the performance in occlusion situations. Firstly, we design an occlusion recognition mechanism based on patch pool and local correlation to describe the occlusion of objects in each frame of an image sequence. Secondly, taking advantage of the occlusion recognition mechanism, we construct a specific training set to train the filter. Thirdly, combining global correlation, we implement our own tracker based on the traditional discriminative correlation filters. Finally, we evaluate it on both OTB and VOT platforms, and the experimental results demonstrate that our design is advanced and effective.
Article
Vehicle tracking on satellite videos poses a challenge for the existing object tracking algorithms due to the few features, object occlusion, and similar objects appearance. To improve the performance of the object tracking algorithm, a historical-model-based tracker intended for satellite videos is proposed in this study. It updates the tracker by using the historical model of each frame in the video, which contains plenty of object information and background information, so as to improve tracking ability on few-feature objects. Furthermore, a historical model evaluation scheme is designed to obtain reliable historical models, which ensures that the tracker is sensitive to the object in the current frame, thus avoiding the impact caused by changes in object appearance and background. Besides, to solve the drift issue of the tracker caused by object occlusion and the appearance of similar objects, an antidrift tracker correction scheme is proposed as well. According to the comparative experiments conducted on satellite videos dataset SatSOT, our tracker produces an excellent performance. Moreover, sensitivity analysis, varying criteria comparative experiments, and ablation experiments are conducted to demonstrate that the proposed schemes are effective in improving the precision and success rate of the tracker.
Article
Deep cross-modal hashing enables a flexible and efficient way for large-scale cross-modal retrieval. Existing cross-modal retrieval methods based on deep hashing aim to learn the unified hashing representation for different modalities with the supervision of pair-wise correlation, and then encode the out-of-samples via modality-specific hashing network. However, the semantic gap and distribution shift weren’t considered enough, and the hashing codes can’t be unified as expected under different modalities. At the same time, hashing is still a discrete problem that hasn’t been solved well in the deep neural network. Therefore, we propose the Discrete Fusion Adversarial Hashing (DFAH) network for cross-modal retrieval to address these issues. In DFAH, the Modality-Specific Feature Extractor is designed to capture image and text features with pair-wise supervision. Especially, the Fusion Learner is proposed to learn the unified hash code, which enhances the correlation of heterogeneous modalities via the embedding strategy. Meanwhile, the Modality Discriminator is designed to adapt to the distribution shift cooperating with the Modality-Specific Feature Extractor in an adversarial way. In addition, we design an efficient discrete optimization strategy to avoid the relaxing quantization errors in the deep neural framework. Finally, the experiment results and analysis on several popular datasets also show that DFAH outperforms the state-of-the-art methods for cross-modal retrieval.
Article
Deep learning based fully-supervised visual trackers entail the requirement of large-scale and frame-wise annotation that needs a laborious and tedious data annotation process. To reducing the amount of labeled efforts, a self-supervised learning framework, the ETC, is proposed in this work that exploits temporal coherence as a self-supervised signal and uses visual transformer to capture the relationship among the unlabeled video frames. We design a cycle-consistent transformer architecture to cast self-supervised tracking as cycle prediction problems. With carefully-designed and targeted configurations for cycle-consistent transformer including temporal sampling strategies, tracking initialization and data augmentation, our approach is applicable for two tracking settings, i.e., the unlabeled sample (ULS) scene and the few labeled sample (FLS) scene. To learn richer and more discriminative representations, we not only utilize the inter-frame correspondence, but also conduct the intra-frame correspondence to effectively model the target-to-frame and long-range correspondence. Extensive experiments are conducted on the popular benchmark datasets OTB2015, VOT2018, UAV123, TColor-128, NFS and LaSOT, and the results show that our approach achieves competitive results in the ULS setting, and supplies a trade-off between performance and annotation cost in the FLS setting.
Article
Full-text available
Correlation filter-based trackers (CFTs) have recently shown remarkable performance in the field of visual object tracking. The advantage of these trackers originates from their ability to convert time-domain calculations into frequency domain calculations. However, a significant problem of these CFTs is that the model is insufficiently robust when the tracking scenarios are too complicated, meaning that the ideal tracking performance cannot be acquired. Recent work has attempted to resolve this problem by reducing the boundary effects from modeling the foreground and background of the object target effectively (e.g., CFLB, BACF, and CACF). Although these methods have demonstrated reasonable performance, they are often affected by occlusion, deformation, scale variation, and other challenging scenes. In this study, considering the relationship between the current frame and the previous frame of a moving object target in a time series, we propose a temporal regularization strategy to improve the BACF tracker (denoted as TRBACF), a typical representative of the aforementioned trackers. The TRBACF tracker can efficiently adjust the model to adapt the change of the tracking scenes, thereby enhancing its robustness and accuracy. Moreover, the objective function of our TRBACF tracker can be solved by an improved alternating direction method of multipliers, which can speed up the calculation in the Fourier domain. Extensive experimental results demonstrate that the proposed TRBACF tracker achieves competitive tracking performance compared with state-of-the-art trackers.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However, these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Article
Convolutional Neural Networks (CNN) have been demonstrated to achieve state-of-the-art performance in visual object tracking task. However, existing CNN-based trackers usually use holistic target samples to train their networks. Once the target undergoes complicated situations (e.g., occlusion, background clutter, and deformation), the tracking performance degrades badly. In this paper, we propose an adaptive structural convolutional filter model to enhance the robustness of deep regression trackers (named: ASCT). Specifically, we first design a mask set to generate local filters to capture local structures of the target. Meanwhile, we adopt an adaptive weighting fusion strategy for these local filters to adapt to the changes in the target appearance, which can enhance the robustness of the tracker effectively. Besides, we develop an end-to-end trainable network comprising feature extraction, decision making, and model updating modules for effective training. Extensive experimental results on large benchmark datasets demonstrate the effectiveness of the proposed ASCT tracker performs favorably against the state-of-the-art trackers.