Conference PaperPDF Available

Multi-Task Driven Feature Models for Thermal Infrared Tracking

Authors:

Abstract and Figures

Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However, these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Content may be subject to copyright.
Multi-Task Driven Feature Models for Thermal Infrared Tracking
Qiao Liu,1Xin Li,1Zhenyu He,1, 3Nana Fan,1Di Yuan,1Wei Liu,2, 3 Yongsheng Liang1
1Harbin Institute of Technology, Shenzhen
2Shenzhen Institute of Information Technology
3Peng Cheng Laboratory
liuqiao@stu.hit.edu.cn, {xinlihitsz, nanafanhit, dyuanhit}@gmail.com
{zhenyuhe, liangyongsheng}@hit.edu.cn, liuwei@sziit.edu.cn
Abstract
Existing deep Thermal InfraRed (TIR) trackers usually use
the feature models of RGB trackers for representation. How-
ever, these feature models learned on RGB images are neither
effective in representing TIR objects nor taking fine-grained
TIR information into consideration. To this end, we develop a
multi-task framework to learn the TIR-specific discriminative
features and fine-grained correlation features for TIR track-
ing. Specifically, we first use an auxiliary classification net-
work to guide the generation of TIR-specific discriminative
features for distinguishing the TIR objects belonging to dif-
ferent classes. Second, we design a fine-grained aware mod-
ule to capture more subtle information for distinguishing the
TIR objects belonging to the same class. These two kinds of
features complement each other and recognize TIR objects
in the levels of inter-class and intra-class respectively. These
two feature models are learned using a multi-task matching
framework and are jointly optimized on the TIR tracking task.
In addition, we develop a large-scale TIR training dataset to
train the network for adapting the model to the TIR domain.
Extensive experimental results on three benchmarks show
that the proposed algorithm achieves a relative gain of 10%
over the baseline and performs favorably against the state-
of-the-art methods. Codes and the proposed TIR dataset are
available at https://github.com/QiaoLiuHit/MMNet.
Introduction
TIR object tracking is an important task in artificial intel-
ligence. It has been widely used in maritime rescue, video
surveillance, and driver assistance at night (Gade and Moes-
lund 2014) as it can track the object in total darkness. De-
spite much progress, TIR tracking still faces several chal-
lenging problems, such as distractor, occlusion, size change,
and thermal cross (Liu et al. 2019a).
Inspired by the success of Convolution Neural Networks
(CNNs) in visual tracking, there are several attempts to
use CNNs to improve the performance of TIR trackers.
These methods can be roughly divided into two categories,
deep feature based TIR trackers and matching-based deep
Qiao Liu and Xin Li contribute equally.
Zhenyu He is the corresponding author.
Copyright c
2020, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
TIR trackers. Deep feature based TIR trackers, e.g., DSST-
tir (Gundogdu et al. 2016), MCFTS (Liu et al. 2017), and
LMSCO (Gao et al. 2018), use a pre-trained classification
network for extracting deep features and then integrate them
into conventional trackers. Despite the demonstrated suc-
cess, their performance is limited by the pre-trained deep
features which are learned from RGB images and are less
effective in representing TIR objects. Matching-based deep
TIR tracking methods, e.g., HSSNet (Li et al. 2019a) and
MLSSNet (Liu et al. 2019b), cast tracking as a matching
problem and train a matching network off-line for online
tracking. These methods receive much attention recently be-
cause of their high efficiency and simplicity. However, they
are also limited by the weak discriminative capacity of the
learned features due to the following reasons. First, they
do not learn how to separate samples belonging to differ-
ent classes, namely, the learned features are sensitive to all
semantic objects. Second, their features are insensitive to
similar objects as they are usually learned on a global se-
mantic feature space without fine-grained information. Not-
ing that fine-grained information is crucial for distinguish-
ing TIR objects as similar semantic patterns are generated
from intra-class TIR objects. Third, their features are often
learned from RGB or small TIR datasets, which do not learn
the specific patterns of TIR objects.
To address the above-mentioned issues, we propose to
learn TIR-specific discriminative features and fine-grained
correlation features. Specifically, we use a classification net-
work, targeting at distinguishing TIR objects from different
classes, to guide the generation of the TIR-specific discrim-
inative feature. In addition, we design a fine-grained aware
network, which consists of a holistic correlation and pixel-
level correlation modules, for obtaining the fine-grained cor-
relation features. When the TIR-specific discriminative fea-
tures are not able to distinguish similar distractors, the fine-
grained correlation feature provides more detailed informa-
tion for distinguishing them.
To integrate these two complemental features effectively,
we design a multi-task matching framework for learning
them simultaneously. To adapt the feature model to the
TIR domain better, we construct a large-scale TIR im-
age sequence dataset to train the proposed network. The
dataset includes 30 classes, over 1,100 image sequences,
over 450,000 frames, and over 530,000 annotated bound-
ing boxes. As far as we know, this is the largest TIR
dataset till now. Extensive experimental results on the VOT-
TIR2015 (Felsberg et al. 2015), VOT-TIR2017 (Kristan et
al. 2017), and PTB-TIR (Liu et al. 2019a) benchmarks show
that the proposed method performs favorably against the
state-of-the-art methods.
In this work, we make the following contributions:
We propose a feature model comprising TIR-specific dis-
criminative features and fine-grained correlation features
for TIR object representation. We develop a classification
network and a fine-grained aware network to generate the
TIR-specific discriminative features and fine-grained cor-
relation features respectively. Furthermore, we design a
multi-task matching framework for integrating these two
features effectively.
We construct a large-scale TIR video dataset with annota-
tions. The dataset can be easily used in TIR-based appli-
cations and we believe it will contribute to the develop-
ment of the TIR vision field.
We explore how to better use the grayscale and TIR train-
ing datasets for improving a TIR tracking framework and
test several strategies.
We conduct extensive experiments on three benchmarks
and demonstrate that the proposed algorithm achieves fa-
vorable performance against the state-of-the-art methods.
Related Work
Deep feature based TIR trackers. Existing deep TIR
trackers usually use the pre-trained feature for represen-
tation and combine it with conventional frameworks for
tracking. DSST-tir (Gundogdu et al. 2016) investigates the
classification-based deep feature with Correlation Filters
(CFs) for TIR tracking and shows that the deep features
achieve better performance than the hand-crafted features.
MCFTS (Liu et al. 2017) combines the different layer fea-
tures of VGGNet (Simonyan and Zisserman 2014) to con-
struct an ensemble TIR tracker. LMSCO (Gao et al. 2018)
uses the deep appearance and motion features in a structural
support vector machine for TIR tracking. ECO-tir (Zhang
et al. 2019) trains a Siamese network on a large amount of
synthetic TIR images to extract the deep feature and then
combine it with ECO (Danelljan et al. 2017) for tracking.
Different from these methods, we propose to learn the TIR-
specific discriminative feature and fine-grained correlation
feature for representing TIR objects more effectively.
Matching-based deep trackers. A key issue of the
matching-based deep tracker is how to enable its discrimi-
nating ability. Several methods focus on this problem from
different aspects. DSiam (Guo et al. 2017) online updates
the Siamese network by two linear regression models for
adapting to the variation of the object. CFNet (Valmadre et
al. 2017) updates the target template by incorporating a CF
module into the network. SA-Siam (He et al. 2018) learns
a twofold matching network by introducing complemen-
tary semantic features while FlowTrack (Zhu et al. 2018)
combines the optical flow features for matching. SiamFC-
tri (Dong and Shen 2018) learns the more discriminative
deep features by formulating the triplet relationship using a
triple loss. StructSiam (Zhang et al. 2018) learns the fine-
grained features for matching using a local structure de-
tector and a context relation model. RASNet (Wang et al.
2018a) introduces three kinds of attention mechanisms to
adapt the model for online matching. TADT (Li et al. 2019b)
online selects the target-aware features using two auxiliary
tasks for compact matching. DWSiam (Zhipeng et al. 2019)
uses a deeper and wider backbone network on a Siamese
framework to obtain more accurate tracking results. Differ-
ent from these methods, we use multiple complementary
tasks to learn more powerful TIR features for represent-
ing TIR objects. The proposed multi-task matching network
distinguishes TIR objects based on both the inter-class and
intra-class differences.
Multi-task learning. When different tasks are sufficient re-
lated, multi-task learning can obtain better generalization
and benefit all of these tasks. This is demonstrated in several
applications including person re-identification, image re-
trieval, and object tracking, etc. MTDnet (Chen et al. 2017)
simultaneously takes a binary classification task and a rank-
ing task into account to boost the performance of person
re-identification. MSP-CNN (Shen et al. 2017) uses three
kinds of task constrains to learn more discriminative fea-
tures on a Siamese framework for person re-identification.
Cp-mtML (Bhattarai et al. 2016) simultaneously learns face
identity, age recognition, and expression recognition on het-
erogeneous datasets for face retrieval. SiamRPN (Li et al.
2018a) exploits a classification task and a regression task
on a Siamese network to boost the accuracy and efficiency
of object tracking. EDCF (Wang et al. 2018b) jointly trains
a low-level fine-grained matching and high-level semantic
matching tasks on a Siamese framework for object tracking.
Different the above methods, we jointly train a classifica-
tion task, a discriminative matching task, and a fine-grained
matching task for robust TIR tracking.
TIR dataset. TIR training dataset is crucial for training a
deep TIR tracker. Most deep TIR trackers only use RGB
datasets to train the model, since there is not a proper and
large-scale TIR dataset. This hinders the development of
CNNs-based TIR tracking. To this end, several methods
attempt to use TIR data to train a network for tracking.
DSST-tir (Gundogdu et al. 2016) uses a small TIR dataset
to train a classification network for feature extraction and
then combines it with the DSST tracker for TIR tracking.
ECO-tir (Zhang et al. 2019) explores a Generative Adver-
sarial Network (GAN) to generate synthetic TIR images and
then uses them to train a Siamese network for feature ex-
traction. The trained model using these synthetic TIR im-
ages achieves favorable results. MLSSNet (Liu et al. 2019b)
trains a multi-level similarity based Siamese network on an
RGB and TIR dataset simultaneously. Despite the promis-
ing performance they have achieved, the used TIR dataset
is not large enough, which hinders them from further im-
provements. In this paper, we construct a larger TIR dataset
to train the proposed network for adapting the model to the
TIR domain.
Softmax
CF
Crop
Corr
CF
Crop
Corr
Fine-grained matching
Lcls
LdisLfin
GAP
Samples classification
Conv
FANet
Conv3
Conv5
Conv3
Target example
Search region
Discriminative matching
Training
Testing
Conv3
Conv5
Figure 1: Architecture of the proposed Multi-task Matching
Network (MMNet). It comprises a shared feature extracted
network, a classification branch, a discriminative matching
branch, and a fine-grained matching branch. In this fig-
ure, every box denotes a network layer or a subnetwork.
Conv, GAP, CF, Corr, and FANet denote the convolution,
global average pooling, correlation filter, cross-correlation,
and fine-grained aware network (see Fig. 2), respectively.
Multi-Task Matching Network
In this section, we show how to learn TIR-specific features
and integrate them in a multi-task matching network for TIR
tracking. First, we present the overall multi-task matching
network and introduce the TIR-specific discriminative fea-
ture module and the fine-grained correlation feature module.
Then, we introduce the constructed TIR dataset and analyze
three multi-domain aggregation learning strategies. Finally,
we give the flow of the tracking algorithm using the pro-
posed model.
Multi-task architecture
We propose a multi-task matching network to integrate the
TIR-specific discriminative features and the fine-grained
correlation features for TIR tracking. The network con-
sists of a shared feature extracted network, a discriminative
matching branch, a classification branch, and a fine-grained
matching branch, as shown in Fig. 1. Different from existing
trackers using pre-trained features on visual images, the pro-
posed multi-task network uses both TIR-specific discrimi-
native features and fine-grained correlation features for TIR
object localization under a matching framework. In the fol-
lowing, we present the details of each component.
Discriminative matching. Considering tracking efficiency,
we use a general matching architecture which is the same as
that of CFNet (Valmadre et al. 2017) to perform tracking. As
deeper convolution layers contain more discriminative fea-
tures, we construct the discriminative matching module on
top of the last convolution layer of the shared feature extrac-
tion network. Given a target example Zand a search image
Y, the discriminative similarity fdis(Z,Y)can be formu-
lated as:
fdis(Z,Y) = g(σ(φconv5(Z)), φconv5(Y)),(1)
where φconv5(·)extracts features using the last convolu-
tional layer of the shared feature extraction network, g(·,·)
denotes the cross-correlation operator and σ(·)is the CF
block which is used to improve the discriminative capacity
by online updating the target template. We adopt a logistic
loss to train this branch:
Ldis(y, o) = 1
|D|X
uD
log(1 + exp(y[u]o[u])),(2)
where DRM×Mis the similarity map generated by Eq. 1,
o[u]denotes the real value of a single target-candidate pair,
and y[u]is the ground-truth of this pair.
TIR-specific discriminative features. We use a classifica-
tion branch as an auxiliary task to obtain the TIR-specific
discriminative features and then use them in the discrimina-
tive matching branch. The classification task aiming to dis-
tinguish TIR objects belonging to different classes learns the
features focusing on the class-level difference.
In the auxiliary network, we first use a global average
pooling layer instead of a fully connected layer to avoid the
over-fitting problem. Then, a 1×1convolution layer is used
to adapt the number of the class of the training set. Finally,
we use a cross-entropy loss to train it:
Lcls(y, p) =
K
X
k=0
yklog pk,(3)
where yis the ground-truth, pis the predicted label, and K
denotes a total number of the classes.
Fine-grained matching. The intra-class TIR objects of-
ten have a similar visual pattern as they do not have color
information. Coupled with the TIR-specific discriminative
branch, we construct a fine-grained matching branch to
distinguish intra-class TIR objects. We note that the fine-
grained correlation features are helpful for distinguishing
distractors. We compute the fine-grained correlation feature
on a shallow convolution layer since the shallow convolu-
tion features mainly contain more detailed information. The
fine-grained similarity can be formulated as:
ffin(Z,Y) = g(σ(ω(φconv3(Z))), ω(φconv3(Y))),(4)
where φconv3(·)extract features using the third convolu-
tional layer of the shared feature extraction network, ω(·)
denotes the proposed fine-grained aware module. We use a
logistic loss which is the same with Eq. 2 to train this branch.
Fine-grained correlation features. To get the fine-grained
correlation features, we design a fine-grained aware net-
work which consists of a holistic correlation module and a
pixel-level correlation module. Fig. 2 depicts the architec-
ture. Given an input feature map XRH×W×C, the fine-
grained aware module can be formulated as:
ω(X) = fc(ϕh(X), ϕp(X)),(5)
where ϕh(·)denotes the holistic correlation module which
formulates the relationship between local regions, ϕp(·)de-
notes the pixel-level correlation module which is used to for-
mulate the relationship between all feature units, and fc(·,·)
ۨ
Conv
Conv
Conv
HWHW
HW
Conv
HWCHWC
HWC
HWC
HWC
Conv
Conv
Deconv
Deconv
Sigmoid
(a) Holistic correlation
(b) Pixel-level correlation
Softmax
HWC
Reshape
CHW
Reshape & transpose
Reshape HWC
HWC
Concat
Conv
Figure 2: Architecture of the proposed Fine-grained Aware
Network (FANet). It consists of a holistic correlation mod-
ule and a pixel-level correlation module. The input and out-
put are a H×W×Cfeature map, denotes the broad-
cast element-wise multiplication, denotes the batch ma-
trix multiplication, and is the broadcast element-wise ad-
dition.
is cascaded by a concat and a 1×1convolutional lay-
ers, which integrates these two complementary correlations.
Fig. 3 compares the TIR-specific discriminative feature and
the fine-grained correlation feature using visualizations of
the feature maps.
To formulate the relationships between local regions,
we use an encoder-decoder architecture based on a self-
attention mechanism. We first exploit two large convolution
kernels to find out discriminative local regions. Then, we use
two deconvolution layers to locate them. After that, a corre-
lation map is generated using a Sigmoid activation function.
The map denotes the importance of every local region. Fi-
nally, we weight the original feature map using this corre-
lation map for making it focus on the local discriminative
regions. The weighed feature map is computed as:
ϕh(X) = Xexp(WX)
exp(WX)+1,(6)
where Wdenotes the transform matrix which is constituted
by two convolution and two deconvolution layers.
As pixel-level context information is crucial for represent-
ing TIR objects, we exploit a pixel-level correlation mod-
ule to formulate the relationships between every feature unit
for obtaining more fine-grained correlation information. The
pixel-level correlation model is similar to the non-local net-
work (Wang et al. 2018c) which captures long-range depen-
dencies. Specifically, we first formulate the pixel-level re-
lationships with a spatial correlation map SRHW ×H W ,
which is computed as:
sij =exp(WqxiNWkxj)
PN
n=1 exp(WqxiNWkxn),(7)
Input image Discriminative Low-level Fine-grained
Figure 3: Visualization of the TIR-specific discriminative
features and fine-grained correlation features. The visualized
feature maps are generated by summing all the channels.
From left to right, each column shows the original images,
the TIR-specific discriminative features (Conv5), the low
level features (Conv3), and the learned fine-grained correla-
tion features from Conv3 respectively. This figure shows that
TIR-specific discriminative feature is too coarse to achieve
accurate localization , while the fine-grained correlation fea-
ture map focuses on local prominent regions which con-
tributes to accurate localization.
where sij Sdenotes the relationship between the i-th fea-
ture unit and the j-th feature unit, Wqand Wkrepresent the
two 1×1convolutional layers respectively, xiis the i-th fea-
ture unit in X, and X={xi}N
i=1, where N=H W . Then,
we apply this correlation map on the input feature map to
obtain the pixel-level correlation feature which can be for-
mulated as:
Sp=
N
X
j=1
N
X
i=1
sij (Wgxj),(8)
where Wgis a transform matrix which is implemented
with a 1×1convolutional operator. Finally, we perform a
weighted sum to the pixel-level correlation feature map and
the origin low-level feature map to get the comprehensive
correlation feature map using a residual-like connection:
ϕp(X) = X+δSp,(9)
where δis a scale factor which can be learned automatically.
TIR dataset
To better adapt the proposed model to the TIR domain,
we construct a large-scale TIR dataset for training the pro-
posed network. The dataset consists of 30 classes and over
1100 sequences. We annotate the object in every frame
of each sequence with bounding box and class labels us-
ing a semi-automatic tracking application according to the
VID2015 (Russakovsky et al. 2015) style. Some examples
of the annotated videos and comparison with existing track-
ing datasets are shown in the supplementary material. The
dataset includes more than 450,000 frames and 530,000
bounding boxes. Since most of our sequences are collected
from Youtube website, it has a wide range of shotting de-
vices, shotting scenes, and shotting view angles which en-
sure the diversity of the dataset. For examples, there are
four kinds of shotting devices and view angles: hand-held,
vehicle-mounted, surveillance (static), and drone-mounted.
We store these TIR images with a white-hot style and an 8
bits depth.
Multi-domain aggregation
We find that the grayscale image sample can provide rich de-
tailed information, e.g., texture and structure, which is help-
ful to the TIR tracking task. As such, we explore to use both
the grayscale and TIR domains to boost the TIR tracking
performance. To find an effective way to combine them, we
test three multi-domain aggregation learning strategies.
Re-training. We first train the proposed network on the
VID2015 (Russakovsky et al. 2015) grayscale dataset
with a multi-task loss:
L=λ1Ldis +λ2Lcls +λ3Lf in,(10)
where Lfin denotes the fine-grained similarity loss which
is same as Ldis. Then, we re-train the overall network on
the TIR dataset.
Fine-tuning. We also use the trained model on VID2015
as initial parameters of the proposed network and freeze
the first three layers of the shared feature extracted net-
work and the fine-grained matching branch for retaining
the detail information. Then, we use a smaller learning
rate to fine-tune the network on the TIR dataset.
Mix-training. We first mix the VID2015 and TIR dataset
together and get a new mixed dataset. Then, we freeze the
classification branch and train the proposed network from
scratch on the mixed dataset.
In the Ablation studies section, we report and analyze the
results of each strategy.
Tracking process
Once the multi-task matching network is learned, we prune
the classification branch and use the rest part for online TIR
tracking without updating. Fig. 1 shows the testing frame-
work. Given a target instance Zt1at the (t1)-th frame
and a search image Ytat the t-th frame, the prediction in the
t-th frame can be computed as:
ˆ
yt,i = arg max
yt,i
fdis(Zt1,Yt) + ff in (Zt1,Yt),(11)
where yt,i Ytis the i-th candidate in the search region
Yt. We use a scale-pyramid mechanism (Bertinetto et al.
2016) to estimate the size change of the object.
Experimental Results
Implementation details
We conduct the experiment using the MatConvNet (Vedaldi
and Lenc 2015) toolbox on a PC with an i7 4.0 GHz CPU
and a GTX-1080 GPU. The average speed is about 19 FPS.
We remove all the paddings of AlexNet (Krizhevsky et al.
2012) and use it as the base feature extractor. We train
the proposed network using a Stochastic Gradient Descent
(SGD) method with the batch size of 8and momentum of
0.9. At the first stage, we train the network with 60 epochs
on the VID2015 dataset and the learning rate exponentially
decays from 102to 105. We set λ1=λ2=λ3= 1
of Eq. 10 at all training stages. At the re-training and fine-
tuning stages, we train the network 30 epochs with the learn-
ing rate exponentially decays from 103to 105on the con-
structed TIR dataset. In the mix-training process, we train
the network 70 epochs using the same parameters with the
training on VID2015 dataset.
Ablation studies
Datasets. The VOT-TIR2015 (Felsberg et al. 2015) and
VOT-TIR2017 (Kristan et al. 2017) benchmarks are widely
used for evaluating TIR trackers. These two datasets contain
six kinds of challenges, such as dynamics change, camera
motion, and occlusion. Each challenge has a corresponding
subset which can be used to evaluate the ability of a tracker
to handle the challenge. In addition to the VOT-TIR2015 and
VOT-TIR2017 datasets, we also use a TIR pedestrian track-
ing dataset, PTB-TIR (Liu et al. 2019a), to evaluate the pro-
posed algorithm. PTB-TIR is a recently published tracking
benchmark that contains 60 sequences with 9 different chal-
lenges, such as background clutter, occlusion, out-of-view,
and scale variation.
Evaluation criteria. VOT-TIR2015 and VOT-TIR2017 use
Accuracy (Acc) and Robustness (Rob) (Kristan et al. 2016)
to evaluate the performance of a tracker from two aspects.
Accuracy is the average overlap rate between the predicted
bounding box and the ground truth bounding box. Robust-
ness denotes the average frequency of tracking failure on
the overall dataset. In addition, Expected Average Overlap
(EAO) is often used to evaluate the overall performance of
a tracker, which is computed based on Acc and Rob. PTB-
TIR uses the Precision (Pre) and Success (Suc) plots to eval-
uate the performance of a tracker. The precision plot mea-
sures the percentage of frames whose Center Location Error
(CLE) is within a given threshold (20 pixels), the success
plot measures the percentage of frames whose Overlap Ra-
tion (OR) is larger than a given threshold. The Area Under
the Curve (AUC) of the precision and success plots are often
used to rank methods.
Network architecture. Table 1 shows the results of ablation
study. From the first two rows, we can see that the classi-
fication branch (Cls) improves the robustness of the tracker
with more than 2% gains of EAO score on both benchmarks.
This shows the effectiveness of the TIR-specific discrim-
inative features. From the second to fourth rows, we can
see that the fine-grained matching branch using the holis-
tic correlation module (Fine-Hc) improves the accuracy by
7% and 3% on these two benchmarks respectively, while the
fine-grained matching branch using the pixel-level correla-
tion module (Fine-Pc) improves the accuracy by 6% and 4%
on these two benchmarks respectively. The last row shows
that the fine-grained matching branch using both the holistic
and pixel-level correlation modules further improves the ac-
curacy by more than 2% on both benchmarks. We attribute
Table 1: Ablation studies of the proposed model on the VOT-TIR2015 and VOT-TIR2017 benchmarks. Dis, Cls, Fine-Hc,
and Fine-Pc denote the discriminative matching branch, the classification branch, the fine-grained matching with the holistic
correlation module, and the fine-grained matching with the pixel-level correlation module respectively.
Tracker VOT-TIR2015 VOT-TIR2017
Dis Cls Fine-Hc Fine-Pc EAO Acc Rob EAO Acc Rob
X0.282 0.55 2.82 0.254 0.52 3.45
X X 0.307 0.51 2.41 0.274 0.52 3.20
X X X 0.322 0.58 2.14 0.296 0.55 2.96
X X X 0.326 0.57 2.30 0.279 0.56 3.24
X X X X 0.332 0.60 2.26 0.320 0.58 2.91
Table 2: Comparison of the different models using two
single-domain learning methods and three multi-domain ag-
gregation learning strategies on the VOT-TIR2015 and PTB-
TIR benchmarks.
VOT-TIR2015 PTB-TIR
Strategy EAO Acc Rob Pre Suc
Only-VID 0.332 0.60 2.26 0.661 0.502
Only-TIR 0.311 0.55 2.47 0.694 0.519
Re-training 0.300 0.58 2.37 0.730 0.521
Fine-tuning 0.322 0.58 2.16 0.729 0.525
Mix-training 0.344 0.61 2.09 0.759 0.539
these gains to the fine-grained correlation features, which are
effective in distinguishing similar objects, and the comple-
ment advantages of the holistic correlation and pixel-level
correlation modules, which provide more powerful features
for target localization.
Multi-domain aggregation. Table 2 shows the results of
the proposed model using different training strategies. Com-
pared with only training on the VID2015 dataset (Only-
VID), the mix-training learning strategy achieves a 1.2%
EAO score gain on VOT-TIR2015 and a 3.7% success rate
gain on PTB-TIR. Compared with only training on the TIR
dataset (Only-TIR), the mix-training strategy also improves
the EAO score by 3% on VOT-TIR2015 and the success rate
by 2% on PTB-TIR. These results demonstrate that the mix-
training can make full use of the property of grayscale and
TIR images to get more powerful features for TIR tracking.
Compared with Only-VID, the fine-tuning strategy achieves
a2.3% gain of the success rate and a 6.8% gain of preci-
sion on PTB-TIR. It also improves the robustness on VOT-
TIR2016. These results demonstrate that the fine-grained
feature learned from the grayscale dataset are useful to TIR
tracking. The re-train with the TIR dataset does not improve
the performance significantly on both datasets. This is be-
cause TIR images lack detailed features for precise locating.
Comparison with state-of-the-arts
Compared trackers. We compare the proposed method
with the state-of-the-art trackers including hand-crafted fea-
ture based correlation filter trackers, such as DSST (Danell-
jan et al. 2014), SRDCF (Danelljan et al. 2015), and
Staple-TIR (Felsberg et al. 2016); the deep feature based
correlation filter trackers, such as HDT (Qi et al. 2016),
0 5 10 15 20 25 30 35 40 45 50
Location error threshold
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Precision
Precision plots of OPE
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Overlap threshold
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Success rate
Success plots of OPE
Figure 4: Comparison of ten trackers on the PTB-TIR bench-
mark.
      
6HTXHQFHOHQJWK









([SHFWHGRYHUODS
($2IRUEDVHOLQHG\QDPLFVBFKDQJH
      
6HTXHQFHOHQJWK









([SHFWHGRYHUODS
($2IRUEDVHOLQHFDPHUDBPRWLRQ
Figure 5: EAO scores of the top ten trackers on two chal-
lenges of the VOT-TIR2017 benchmark.
deepMKCF (Tang and Feng 2015), CREST (Song et al.
2017), MCFTS (Liu et al. 2017), ECO-deep (Danelljan et
al. 2017), and DeepSTRCF (Li et al. 2018b); matching
based deep trackers, such as CFNet (Valmadre et al. 2017),
Siamese-FC (Bertinetto et al. 2016), SiamRPN (Li et al.
2018a), HSSNet (Li et al. 2019a), MLSSNet (Liu et al.
2019b), and TADT (Li et al. 2019b); and other deep track-
ers, such as TCNN (Nam et al. 2016), MDNet-N (Felsberg
et al. 2016) and VITAL (Song et al. 2018).
Results on PTB-TIR. Fig. 4 shows that the proposed algo-
rithm achieves the best success rate of 0.539 and precision of
0.759 on PTB-TIR. Compared with CFNet which just uses
a single matching branch, the proposed method (MMNet-
Mix) achieves a 10% relative gain of the success rate. Al-
though the proposed method (MMNet-VID) is not trained on
the TIR dataset, it also improves the success rate by 6%. This
demonstrates the effectiveness of the proposed network ar-
chitecture and the mix-training learning strategy. Compared
with the correlation filter based deep trackers, the proposed
Table 3: Comparison of our tracker and the state-of-the-art methods on VOT-TIR2017 and VOT-TIR2015. The bold and under-
line denote the best and the second-best scores, respectively. The notation “*” denotes the speed is reported by the authors.
Category Tracker VOT-TIR2017 VOT-TIR2015 Speed
EAO Acc Rob EAO Acc Rob FPS
Hand-crafted
feature based CF
SRDCF (Danelljan et al. 2015) 0.197 0.59 3.84 0.225 0.62 3.06 12.3
Staple-TIR (Felsberg et al. 2016) 0.264 0.65 3.31 - - - 80.0*
Deep feature
based CF
MCFTS (Liu et al. 2017) 0.193 0.55 4.72 0.218 0.59 4.12 4.7
HDT (Qi et al. 2016) 0.196 0.51 4.93 0.188 0.53 5.22 10.6
deepMKCF (Tang and Feng 2015) 0.213 0.61 3.90 - - - 5.0*
CREST (Song et al. 2017) 0.252 0.59 3.26 0.258 0.62 3.11 0.6
DeepSTRCF (Li et al. 2018b) 0.262 0.62 3.32 0.257 0.63 2.93 5.5
ECO-deep (Danelljan et al. 2017) 0.267 0.61 2.73 0.286 0.64 2.36 16.3
Other deep
tracker
MDNet-N (Felsberg et al. 2016) 0.243 0.57 3.33 - - - 1.0*
VITAL (Song et al. 2018) 0.272 0.64 2.68 0.289 0.63 2.18 4.7
TCNN (Nam et al. 2016) 0.287 0.62 2.79 - - - 1.5*
Matching based
deep tracker
Siamese-FC (Bertinetto et al. 2016) 0.225 0.57 4.29 0.219 0.60 4.10 66.9
SiamRPN (Li et al. 2018a) 0.242 0.60 3.19 0.267 0.63 2.53 160.0*
CFNet (Valmadre et al. 2017) 0.254 0.52 3.45 0.282 0.55 2.82 37.0
HSSNet (Li et al. 2019a) 0.262 0.58 3.33 0.311 0.67 2.53 10.0*
TADT (Li et al. 2019b) 0.262 0.60 3.18 0.234 0.61 3.33 42.7
MLSSNet (Liu et al. 2019b) 0.278 0.56 2.95 0.316 0.57 2.32 18.0
MMNet (Ours) 0.320 0.58 2.91 0.344 0.61 2.09 18.9
method obtains a better success rate. We attribute the good
performance to the specifically designed TIR feature model
and the constructed large-scale TIR dataset.
Results on VOT-TIRs. As shown in Table 3, the proposed
method (MMNet) achieves the best EAO scores of 0.320
and 0.344 on VOT-TIR2017 and VOT-TIR2015, respec-
tively. Compared with other matching based deep trackers,
the proposed multi-task matching network learns more ef-
fective TIR features for matching. Although TADT online
selects more compact and target-aware features from a pre-
trained CNN for matching, the proposed method still ob-
tains a better performance on both benchmarks. Compared
with the best correlation filter based deep tracker, ECO-
deep, which uses the classification-based pre-trained fea-
ture, the proposed method obtains better robustness on VOT-
TIR2015. This benefits from the learned fine-grained cor-
relation features which help the multi-task matching net-
work distinguish similar distractors. Compared with the best
deep tracker, TCNN, which uses multiple CNNs to repre-
sent objects, the proposed method achieves a better perfor-
mance on VOT-TIR2017 while running faster. We attribute
the good performance to the proposed TIR-special feature
model which is more effective in representing TIR objects.
Fig. 5 shows that our method achieves the best EAO on
the dynamic change and camera motion challenges of VOT-
TIR2017. Compared with the second best matching based
tracker, CFNet, the proposed method achieves a 9.3% EAO
score gain on the dynamics change challenge. This shows
that the proposed TIR-special feature model is more robust
to the appearance variation of the target. Furthermore, the
proposed method achieves a higher EAO score than the sec-
ond best method (TCNN) by 4.2% on the camera motion
challenge. Some more attribute-based results can be found
in the supplementary material. These results demonstrate
the effectiveness of the proposed algorithm.
Conclusions
In this paper, we propose to learn a TIR-specific feature
model for robust TIR tracking. The feature model includes
a TIR-specific discriminative feature module and a fine-
grained correlation feature module. To use these two feature
models simultaneously, we integrate them into a multi-task
matching framework. The TIR-specific discriminative fea-
tures, generated with an auxiliary multi-classification task,
are able to distinguish inter-class TIR objects. The fine-
grained correlation features are obtained with a fine-grained
aware network consisting of a holistic correlation module
and a pixel-level correlation module. These two kinds of fea-
tures complement each other and distinguish TIR objects in
the levels of inter-class and intra-class, respectively. In addi-
tion, we develop a large-scale TIR training dataset for adapt-
ing the model to the TIR domain, which can be also easily
applied to other TIR tasks. Extensive experimental results
on three benchmarks demonstrate that the proposed method
performs favorably against the state-of-the-art methods.
Acknowledgment
This work is supported by the National Natural Sci-
ence Foundation of China (Grant No. 61672183), by
the Natural Science Foundation of Guangdong Province
(Grant No.2015A030313544), by the Shenzhen Re-
search Council (Grant No. JCYJ20170413104556946,
JCYJ20170815113552036), and by the project ”The Verifi-
cation Platform of Multi-tier Coverage Communication Net-
work for oceans (PCL2018KP002)”.
References
Bertinetto, L.; Valmadre, J.; Henriques, J. F.; Vedaldi, A.; and Torr,
P. H. 2016. Fully-convolutional siamese networks for object track-
ing. In ECCV Workshops, 850–865.
Bhattarai, B.; Sharma, G.; Jurie, F.; et al. 2016. Cp-mtml: Coupled
projection multi-task metric learning for large scale face retrieval.
In CVPR, 4226–4235.
Chen, W.; Chen, X.; Zhang, J.; and Huang, K. 2017. A multi-task
deep network for person re-identification. In AAAI, 3988–3994.
Danelljan, M.; H¨
ager, G.; Khan, F.; and Felsberg, M. 2014. Accu-
rate scale estimation for robust visual tracking. In BMVC.
Danelljan, M.; Hager, G.; Shahbaz Khan, F.; and Felsberg, M.
2015. Learning spatially regularized correlation filters for visual
tracking. In ICCV, 4310–4318.
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; and Felsberg, M. 2017.
Eco: efficient convolution operators for tracking. In CVPR, 6638–
6646.
Dong, X., and Shen, J. 2018. Triplet loss in siamese network for
object tracking. In ECCV, 459–474.
Felsberg, M.; Berg, A.; Hager, G.; Ahlberg, J.; et al. 2015. The ther-
mal infrared visual object tracking vot-tir2015 challenge results. In
ICCV Workshops, 76–88.
Felsberg, M.; Kristan, M.; Matas, J.; Leonardis, A.; et al. 2016.
The thermal infrared visual object tracking vot-tir2016 challenge
results. In ECCV Workshops, 824–849.
Gade, R., and Moeslund, T. B. 2014. Thermal cameras and appli-
cations: a survey. Machine vision and applications 25(1):245–262.
Gao, P.; Ma, Y.; Song, K.; Li, C.; Wang, F.; and Xiao, L. 2018.
Large margin structured convolution operator for thermal infrared
object tracking. In ICPR, 2380–2385.
Gundogdu, E.; Koc, A.; Solmaz, B.; et al. 2016. Evaluation of
feature channels for correlation-filter-based visual object tracking
in infrared spectrum. In CVPR Workshops, 24–32.
Guo, Q.; Feng, W.; Zhou, C.; et al. 2017. Learning dynamic
siamese network for visual object tracking. In ICCV, 1763–1771.
He, A.; Luo, C.; Tian, X.; and Zeng, W. 2018. A twofold siamese
network for real-time object tracking. In CVPR, 4834–4843.
Kristan, M.; Matas, J.; Leonardis, A.; et al. 2016. A novel per-
formance evaluation methodology for single-target trackers. IEEE
TPAMI 38(11):2137–2155.
Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; et al. 2017. The
visual object tracking vot2017 challenge results. In ICCV Work-
shops, 1949–1972.
Krizhevsky, A.; Sutskever, I.; Hinton, G. E.; et al. 2012. Ima-
genet classification with deep convolutional neural networks. In
NeurIPS, 1097–1105.
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; and Hu, X. 2018a. High perfor-
mance visual tracking with siamese region proposal network. In
CVPR, 8971–8980.
Li, F.; Tian, C.; Zuo, W.; Zhang, L.; and Yang, M.-H. 2018b. Learn-
ing spatial-temporal regularized correlation filters for visual track-
ing. In CVPR, 4904–4913.
Li, X.; Liu, Q.; Fan, N.; et al. 2019a. Hierarchical spatial-aware
siamese network for thermal infrared object tracking. Knowledge-
Based Systems 166:71–81.
Li, X.; Ma, C.; Wu, B.; He, Z.; and Yang, M.-H. 2019b. Target-
aware deep tracking. In CVPR.
Liu, Q.; Lu, X.; He, Z.; et al. 2017. Deep convolutional neural
networks for thermal infrared object tracking. Knowledge-Based
Systems 134:189–198.
Liu, Q.; He, Z.; Li, X.; and Zheng, Y. 2019a. Ptb-tir: A thermal
infrared pedestrian tracking benchmark. IEEE TMM.
Liu, Q.; Li, X.; He, Z.; Fan, N.; Yuan, D.; and Wang, H. 2019b.
Learning deep multi-level similarity for thermal infrared object
tracking. arXiv preprint arXiv:1906.03568.
Nam, H.; Baek, M.; Han, B.; et al. 2016. Modeling and propa-
gating cnns in a tree structure for visual tracking. arXiv preprint
arXiv:1608.07242.
Qi, Y.; Zhang, S.; Qin, L.; et al. 2016. Hedged deep tracking. In
CVPR, 4303–4311.
Russakovsky, O.; Deng, J.; Su, H.; et al. 2015. Imagenet large scale
visual recognition challenge. IJCV 115(3):211–252.
Shen, C.; Jin, Z.; Zhao, Y.; Fu, Z.; Jiang, R.; Chen, Y.; and Hua,
X.-S. 2017. Deep siamese network with multi-level similarity per-
ception for person re-identification. In MM, 1942–1950.
Simonyan, K., and Zisserman, A. 2014. Very deep convolu-
tional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556.
Song, Y.; Ma, C.; Gong, L.; et al. 2017. Crest: Convolutional
residual learning for visual tracking. In ICCV, 2574–2583.
Song, Y.; Ma, C.; Wu, X.; Gong, L.; et al. 2018. Vital: Visual
tracking via adversarial learning. In CVPR, 8990–8999.
Tang, M., and Feng, J. 2015. Multi-kernel correlation filter for
visual tracking. In ICCV, 3038–3046.
Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; and Torr,
P. H. 2017. End-to-end representation learning for correlation filter
based tracking. In CVPR, 5000–5008.
Vedaldi, A., and Lenc, K. 2015. Matconvnet: Convolutional neural
networks for matlab. In MM, 689–692.
Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; et al. 2018a. Learning atten-
tions: residual attentional siamese network for high performance
online visual tracking. In CVPR, 4854–4863.
Wang, Q.; Zhang, M.; Xing, J.; Gao, J.; Hu, W.; and Maybank, S.
2018b. Do not lose the details: reinforced representation learning
for high performance visual tracking. In IJCAI.
Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018c. Non-local
neural networks. In CVPR, 7794–7803.
Zhang, Y.; Wang, L.; Qi, J.; et al. 2018. Structured siamese network
for real-time visual tracking. In ECCV, 351–366.
Zhang, L.; Gonzalez-Garcia, A.; van de Weijer, J.; Danelljan, M.;
and Khan, F. S. 2019. Synthetic data generation for end-to-end
thermal infrared tracking. TIP 28(4):1837–1850.
Zhipeng, Z.; Houwen, P.; Qiang, W.; et al. 2019. Deeper and wider
siamese networks for real-time visual tracking. In CVPR.
Zhu, Z.; Wu, W.; Zou, W.; and Yan, J. 2018. End-to-end flow
correlation tracking with spatial-temporal attention. In CVPR, 548–
557.
... Different from traditional methods, deep convolutional neural networks (CNN) learn infrared small target representations in a data-driven manner. In recent years, inspired by the superior performance of deep learning in machine learning, CNNbased approaches have made new advances in infrared small target detection [12] and segmentation [13], [14], [15], [16]. ...
Preprint
A key challenge of infrared small target segmentation (ISTS) is to balance false negative pixels (FNs) and false positive pixels (FPs). Traditional methods combine FNs and FPs into a single objective by weighted sum, and the optimization process is decided by one actor. Minimizing FNs and FPs with the same strategy leads to antagonistic decisions. To address this problem, we propose a competitive game framework (pixelGame) from a novel perspective for ISTS. In pixelGame, FNs and FPs are controlled by different player whose goal is to minimize their own utility function. FNs-player and FPs-player are designed with different strategies: One is to minimize FNs and the other is to minimize FPs. The utility function drives the evolution of the two participants in competition. We consider the Nash equilibrium of pixelGame as the optimal solution. In addition, we propose maximum information modulation (MIM) to highlight the tar-get information. MIM effectively focuses on the salient region including small targets. Extensive experiments on two standard public datasets prove the effectiveness of our method. Compared with other state-of-the-art methods, our method achieves better performance in terms of F1-measure (F1) and the intersection of union (IoU).
Article
Full-text available
A key challenge of infrared small target segmentation (ISTS) is to balance false negative pixels (FNs) and false positive pixels (FPs). Traditional methods combine FNs and FPs into a single objective by weighted sum, and the optimization process is decided by one actor. Minimizing FNs and FPs with the same strategy leads to antagonistic decisions. To address this problem, we propose a competitive game framework (pixelGame) from a novel perspective for ISTS. In pixelGame, FNs and FPs are controlled by different player whose goal is to minimize their own utility function. FNs-player and FPs-player are designed with different strategies: One is to minimize FNs and the other is to minimize FPs. The utility function drives the evolution of the two participants in competition. We consider the Nash equilibrium of pixelGame as the optimal solution. In addition, we propose maximum information modulation (MIM) to highlight the target information. MIM effectively focuses on the salient region including small targets. Extensive experiments on two standard public datasets prove the effectiveness of our method. Compared with other state-of-the-art methods, our method achieves better performance in terms of F1-measure ( $\mathrm{F_{1}}$ ) and the intersection of union ( $\text{IoU}$ ).
Article
Full-text available
Existing thermal infrared (TIR) trackers based on correlation filters cannot adapt to the abrupt scale variation of nonrigid objects. This deficiency could even lead to tracking failure. To address this issue, we propose a TIR tracker, called ECO_LS, which improves the performance of efficient convolution operators (ECO) via the level set method. We first utilize the level set to segment the local region estimated by the ECO tracker to gain a more accurate size of the bounding box when the object changes its scale suddenly. Then, to accelerate the convergence speed of the level set contour, we leverage its historical information and continuously encode it to effectively decrease the number of iterations. In addition, our variant, ECOHG_LS, also achieves better performance via concatenating histogram of oriented gradient (HOG) and gray features to represent the object. Furthermore, experimental results on three infrared object tracking benchmarks show that the proposed approach performs better than other competing trackers. ECO_LS improves the EAO by 20.97% and 30.59% over the baseline ECO on VOT-TIR2016 and VOT-TIR2015, respectively.
Article
The lack of large labeled training datasets hinders the usage of deep neural network for Thermal Infrared (TIR) tracking. Regular practice is to train a tracking network with large-scale RGB datasets and then retrain it to the TIR domain with limited TIR data. However, we observe that existing Siamese-based trackers can hardly generalize to TIR images though they achieve outstanding performance on RGB tracking. Therefore, the main challenge is the generalization problem: How to design a generalization-friendly Siamese tracking network and what affects the network generalization. To tackle this problem, we introduce the self-adaption structure into Siamese network and propose an effective TIR tracking model, GFSNet. GFSNet is successfully generalized to different TIR tracking tasks, including ground target, aircraft and high-diversity object tracking tasks. To estimate generalization ability, we present a notion of Growth Rate, the improvement of overall performance after retraining. Experimental results show that the Growth Rates of GFSNet exceed state-of-the-art SiamRPN++ by more than 7 times, which indicates the great power of GFSNet in generalization. In addition to experimental validations, we provide the theoretical analysis of network generalization from a novel perspective, model sensitivity. By performing some tests to analyze the sensitivity, we conclude that the self-adaption structure helps GFSNet converge to a more sensitive minimum with better generalization to new tasks. Furthermore, when compared with popular tracking methods, GFSNet maintains comparable accuracy while achieving real-time tracking with the speed of 112 FPS, 5 times faster than other TIR trackers.
Article
Many Siamese-based RGBT trackers have been prevalently designed in recent years for fast-tracking. However, the correlation operation in them is a local linear matching process, which may easily lose semantic information required inevitably by high-precision trackers. In this paper, we propose a strong cross-modal model based on transformer for robust RGBT tracking. Specifically, a simple dual-flow convolutional network is designed to extract and fuse dual-modal features, with comparably lower complexity. Besides, to enhance the feature representation and deepen semantic features, a modal weight allocation strategy and a backbone feature extracted network based on modified Resnet-50 are designed, respectively. Also, an attention-based transformer feature fusion network is adopted to improve long-distance feature association to decrease the loss of semantic information. Finally, a classification regression subnetwork is investigated to accurately predict the state of the target. Sufficient experiments have been implemented on the RGBT234, RGBT210, GTOT and LasHeR datasets, demonstrating more outstanding tracking performance against the state-of-the-art RGBT trackers.
Article
In this study, we propose a novel Wasserstein distributional tracking method that can balance approximation with accuracy in terms of Monte Carlo estimation. To achieve this goal, we present three different systems: sliced Wasserstein-based (SWT), projected Wasserstein-based (PWT), and orthogonal coupled Wasserstein-based (OCWT) visual tracking systems. Sliced Wasserstein-based visual trackers can find accurate target configurations using the optimal transport plan, which minimizes the discrepancy between appearance distributions described by the estimated and ground truth configurations. Because this plan involves a finite number of probability distributions, the computation costs can be considerably reduced. Projected Wasserstein-based and orthogonal coupled Wasserstein-based visual trackers further enhance the accuracy of visual trackers using bijective mapping functions and orthogonal Monte Carlo, respectively. Experimental results demonstrate that our approach can balance computational efficiency with accuracy, and the proposed visual trackers outperform other state-of-the-art visual trackers on several benchmark visual tracking datasets.
Article
Thermal InfraRed (TIR) target trackers are easy to be interfered by similar objects, while susceptible to the influence of the target occlusion. To solve these problems, we propose a structural target-aware model (STAMT) for the thermal infrared target tracking tasks. Specifically, the proposed STAMT tracker can learn a target-aware model, which can add more attention to the target area to accurately identify the target from similar objects. In addition, considering the situation that the target is partially occluded in the tracking process, a structural weight model is proposed to locate the target through the unoccluded reliable target part. Ablation studies show the effectiveness of each component in the proposed tracker. Without bells and whistles, the experimental results demonstrate that our STAMT tracker performs favorably against state-of-the-art trackers on PTB-TIR and LSOTB-TIR datasets.
Article
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Article
Full-text available
Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Conference Paper
Full-text available
Compared with visible object tracking, thermal infrared (TIR) object tracking can track an arbitrary target in total darkness since it cannot be influenced by illumination variations. However, there are many unwanted attributes that constrain the potentials of TIR tracking, such as the absence of visual color patterns and low resolutions. Recently, structured output support vector machine (SOSVM) and discriminative correlation filter (DCF) have been successfully applied to visible object tracking, respectively. Motivated by these, in this paper, we propose a large margin structured convolution operator (LMSCO) to achieve efficient TIR object tracking. To improve the tracking performance, we employ the spatial regularization and implicit interpolation to obtain continuous deep feature maps, including deep appearance features and deep motion features, of the TIR targets. Finally, a collaborative optimization strategy is exploited to significantly update the operators. Our approach not only inherits the advantage of the strong discriminative capability of SOSVM but also achieves accurate and robust tracking with higher-dimensional features and more dense samples. To the best of our knowledge, we are the first to incorporate the advantages of DCF and SOSVM for TIR object tracking. Comprehensive evaluations on two thermal infrared tracking benchmarks, i.e. VOT-TIR2015 and VOT-TIR2016, clearly demonstrate that our LMSCO tracker achieves impressive results and outperforms most state-of-the-art trackers in terms of accuracy and robustness with sufficient frame rate.
Article
The usage of both off-the-shelf and end-to-end trained deep networks have significantly improved performance of visual tracking on RGB videos. However, the lack of large labeled datasets hampers the usage of convolutional neural networks for tracking in thermal infrared (TIR) images. Therefore, most state of the art methods on tracking for TIR data are still based on hand-crafted features. To address this problem, we propose to use image-to-image translation models. These models allow us to translate the abundantly available labeled RGB data to synthetic TIR data. We explore both the usage of paired and unpaired image translation models for this purpose. These methods provide us with a large labeled dataset of synthetic TIR sequences, on which we can train end-to-end optimal features for tracking. To the best of our knowledge we are the first to train end-to-end features for TIR tracking. We perform extensive experiments on VOT-TIR2017 dataset. We show that a network trained on a large dataset of synthetic TIR data obtains better performance than one trained on the available real TIR data. Combining both data sources leads to further improvement. In addition, when we combine the network with motion features we outperform the state of the art with a relative gain of over 10%, clearly showing the efficiency of using synthetic data to train end-to-end TIR trackers.
Conference Paper
This work presents a novel end-to-end trainable CNN model for high performance visual object tracking. It learns both low-level fine-grained representations and a high-level semantic embedding space in a mutual reinforced way, and a multi-task learning strategy is proposed to perform the correlation analysis on representations from both levels. In particular, a fully convolutional encoder-decoder network is designed to reconstruct the original visual features from the semantic projections to preserve all the geometric information. Moreover, the correlation filter layer working on the fine-grained representations leverages a global context constraint for accurate object appearance modeling. The correlation filter in this layer is updated online efficiently without network fine-tuning. Therefore, the proposed tracker benefits from two complementary effects: the adaptability of the fine-grained correlation analysis and the generalization capability of the semantic embedding. Extensive experimental evaluations on four popular benchmarks demonstrate its state-of-the-art performance.