Conference PaperPDF Available

Multi-Task Driven Feature Models for Thermal Infrared Tracking

Authors:

Abstract and Figures

Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However, these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Content may be subject to copyright.
Multi-Task Driven Feature Models for Thermal Infrared Tracking
Qiao Liu,1Xin Li,1Zhenyu He,1, 3Nana Fan,1Di Yuan,1Wei Liu,2, 3 Yongsheng Liang1
1Harbin Institute of Technology, Shenzhen
2Shenzhen Institute of Information Technology
3Peng Cheng Laboratory
liuqiao@stu.hit.edu.cn, {xinlihitsz, nanafanhit, dyuanhit}@gmail.com
{zhenyuhe, liangyongsheng}@hit.edu.cn, liuwei@sziit.edu.cn
Abstract
Existing deep Thermal InfraRed (TIR) trackers usually use
the feature models of RGB trackers for representation. How-
ever, these feature models learned on RGB images are neither
effective in representing TIR objects nor taking fine-grained
TIR information into consideration. To this end, we develop a
multi-task framework to learn the TIR-specific discriminative
features and fine-grained correlation features for TIR track-
ing. Specifically, we first use an auxiliary classification net-
work to guide the generation of TIR-specific discriminative
features for distinguishing the TIR objects belonging to dif-
ferent classes. Second, we design a fine-grained aware mod-
ule to capture more subtle information for distinguishing the
TIR objects belonging to the same class. These two kinds of
features complement each other and recognize TIR objects
in the levels of inter-class and intra-class respectively. These
two feature models are learned using a multi-task matching
framework and are jointly optimized on the TIR tracking task.
In addition, we develop a large-scale TIR training dataset to
train the network for adapting the model to the TIR domain.
Extensive experimental results on three benchmarks show
that the proposed algorithm achieves a relative gain of 10%
over the baseline and performs favorably against the state-
of-the-art methods. Codes and the proposed TIR dataset are
available at https://github.com/QiaoLiuHit/MMNet.
Introduction
TIR object tracking is an important task in artificial intel-
ligence. It has been widely used in maritime rescue, video
surveillance, and driver assistance at night (Gade and Moes-
lund 2014) as it can track the object in total darkness. De-
spite much progress, TIR tracking still faces several chal-
lenging problems, such as distractor, occlusion, size change,
and thermal cross (Liu et al. 2019a).
Inspired by the success of Convolution Neural Networks
(CNNs) in visual tracking, there are several attempts to
use CNNs to improve the performance of TIR trackers.
These methods can be roughly divided into two categories,
deep feature based TIR trackers and matching-based deep
Qiao Liu and Xin Li contribute equally.
Zhenyu He is the corresponding author.
Copyright c
2020, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
TIR trackers. Deep feature based TIR trackers, e.g., DSST-
tir (Gundogdu et al. 2016), MCFTS (Liu et al. 2017), and
LMSCO (Gao et al. 2018), use a pre-trained classification
network for extracting deep features and then integrate them
into conventional trackers. Despite the demonstrated suc-
cess, their performance is limited by the pre-trained deep
features which are learned from RGB images and are less
effective in representing TIR objects. Matching-based deep
TIR tracking methods, e.g., HSSNet (Li et al. 2019a) and
MLSSNet (Liu et al. 2019b), cast tracking as a matching
problem and train a matching network off-line for online
tracking. These methods receive much attention recently be-
cause of their high efficiency and simplicity. However, they
are also limited by the weak discriminative capacity of the
learned features due to the following reasons. First, they
do not learn how to separate samples belonging to differ-
ent classes, namely, the learned features are sensitive to all
semantic objects. Second, their features are insensitive to
similar objects as they are usually learned on a global se-
mantic feature space without fine-grained information. Not-
ing that fine-grained information is crucial for distinguish-
ing TIR objects as similar semantic patterns are generated
from intra-class TIR objects. Third, their features are often
learned from RGB or small TIR datasets, which do not learn
the specific patterns of TIR objects.
To address the above-mentioned issues, we propose to
learn TIR-specific discriminative features and fine-grained
correlation features. Specifically, we use a classification net-
work, targeting at distinguishing TIR objects from different
classes, to guide the generation of the TIR-specific discrim-
inative feature. In addition, we design a fine-grained aware
network, which consists of a holistic correlation and pixel-
level correlation modules, for obtaining the fine-grained cor-
relation features. When the TIR-specific discriminative fea-
tures are not able to distinguish similar distractors, the fine-
grained correlation feature provides more detailed informa-
tion for distinguishing them.
To integrate these two complemental features effectively,
we design a multi-task matching framework for learning
them simultaneously. To adapt the feature model to the
TIR domain better, we construct a large-scale TIR im-
age sequence dataset to train the proposed network. The
dataset includes 30 classes, over 1,100 image sequences,
over 450,000 frames, and over 530,000 annotated bound-
ing boxes. As far as we know, this is the largest TIR
dataset till now. Extensive experimental results on the VOT-
TIR2015 (Felsberg et al. 2015), VOT-TIR2017 (Kristan et
al. 2017), and PTB-TIR (Liu et al. 2019a) benchmarks show
that the proposed method performs favorably against the
state-of-the-art methods.
In this work, we make the following contributions:
We propose a feature model comprising TIR-specific dis-
criminative features and fine-grained correlation features
for TIR object representation. We develop a classification
network and a fine-grained aware network to generate the
TIR-specific discriminative features and fine-grained cor-
relation features respectively. Furthermore, we design a
multi-task matching framework for integrating these two
features effectively.
We construct a large-scale TIR video dataset with annota-
tions. The dataset can be easily used in TIR-based appli-
cations and we believe it will contribute to the develop-
ment of the TIR vision field.
We explore how to better use the grayscale and TIR train-
ing datasets for improving a TIR tracking framework and
test several strategies.
We conduct extensive experiments on three benchmarks
and demonstrate that the proposed algorithm achieves fa-
vorable performance against the state-of-the-art methods.
Related Work
Deep feature based TIR trackers. Existing deep TIR
trackers usually use the pre-trained feature for represen-
tation and combine it with conventional frameworks for
tracking. DSST-tir (Gundogdu et al. 2016) investigates the
classification-based deep feature with Correlation Filters
(CFs) for TIR tracking and shows that the deep features
achieve better performance than the hand-crafted features.
MCFTS (Liu et al. 2017) combines the different layer fea-
tures of VGGNet (Simonyan and Zisserman 2014) to con-
struct an ensemble TIR tracker. LMSCO (Gao et al. 2018)
uses the deep appearance and motion features in a structural
support vector machine for TIR tracking. ECO-tir (Zhang
et al. 2019) trains a Siamese network on a large amount of
synthetic TIR images to extract the deep feature and then
combine it with ECO (Danelljan et al. 2017) for tracking.
Different from these methods, we propose to learn the TIR-
specific discriminative feature and fine-grained correlation
feature for representing TIR objects more effectively.
Matching-based deep trackers. A key issue of the
matching-based deep tracker is how to enable its discrimi-
nating ability. Several methods focus on this problem from
different aspects. DSiam (Guo et al. 2017) online updates
the Siamese network by two linear regression models for
adapting to the variation of the object. CFNet (Valmadre et
al. 2017) updates the target template by incorporating a CF
module into the network. SA-Siam (He et al. 2018) learns
a twofold matching network by introducing complemen-
tary semantic features while FlowTrack (Zhu et al. 2018)
combines the optical flow features for matching. SiamFC-
tri (Dong and Shen 2018) learns the more discriminative
deep features by formulating the triplet relationship using a
triple loss. StructSiam (Zhang et al. 2018) learns the fine-
grained features for matching using a local structure de-
tector and a context relation model. RASNet (Wang et al.
2018a) introduces three kinds of attention mechanisms to
adapt the model for online matching. TADT (Li et al. 2019b)
online selects the target-aware features using two auxiliary
tasks for compact matching. DWSiam (Zhipeng et al. 2019)
uses a deeper and wider backbone network on a Siamese
framework to obtain more accurate tracking results. Differ-
ent from these methods, we use multiple complementary
tasks to learn more powerful TIR features for represent-
ing TIR objects. The proposed multi-task matching network
distinguishes TIR objects based on both the inter-class and
intra-class differences.
Multi-task learning. When different tasks are sufficient re-
lated, multi-task learning can obtain better generalization
and benefit all of these tasks. This is demonstrated in several
applications including person re-identification, image re-
trieval, and object tracking, etc. MTDnet (Chen et al. 2017)
simultaneously takes a binary classification task and a rank-
ing task into account to boost the performance of person
re-identification. MSP-CNN (Shen et al. 2017) uses three
kinds of task constrains to learn more discriminative fea-
tures on a Siamese framework for person re-identification.
Cp-mtML (Bhattarai et al. 2016) simultaneously learns face
identity, age recognition, and expression recognition on het-
erogeneous datasets for face retrieval. SiamRPN (Li et al.
2018a) exploits a classification task and a regression task
on a Siamese network to boost the accuracy and efficiency
of object tracking. EDCF (Wang et al. 2018b) jointly trains
a low-level fine-grained matching and high-level semantic
matching tasks on a Siamese framework for object tracking.
Different the above methods, we jointly train a classifica-
tion task, a discriminative matching task, and a fine-grained
matching task for robust TIR tracking.
TIR dataset. TIR training dataset is crucial for training a
deep TIR tracker. Most deep TIR trackers only use RGB
datasets to train the model, since there is not a proper and
large-scale TIR dataset. This hinders the development of
CNNs-based TIR tracking. To this end, several methods
attempt to use TIR data to train a network for tracking.
DSST-tir (Gundogdu et al. 2016) uses a small TIR dataset
to train a classification network for feature extraction and
then combines it with the DSST tracker for TIR tracking.
ECO-tir (Zhang et al. 2019) explores a Generative Adver-
sarial Network (GAN) to generate synthetic TIR images and
then uses them to train a Siamese network for feature ex-
traction. The trained model using these synthetic TIR im-
ages achieves favorable results. MLSSNet (Liu et al. 2019b)
trains a multi-level similarity based Siamese network on an
RGB and TIR dataset simultaneously. Despite the promis-
ing performance they have achieved, the used TIR dataset
is not large enough, which hinders them from further im-
provements. In this paper, we construct a larger TIR dataset
to train the proposed network for adapting the model to the
TIR domain.
Softmax
CF
Crop
Corr
CF
Crop
Corr
Fine-grained matching
Lcls
LdisLfin
GAP
Samples classification
Conv
FANet
Conv3
Conv5
Conv3
Target example
Search region
Discriminative matching
Training
Testing
Conv3
Conv5
Figure 1: Architecture of the proposed Multi-task Matching
Network (MMNet). It comprises a shared feature extracted
network, a classification branch, a discriminative matching
branch, and a fine-grained matching branch. In this fig-
ure, every box denotes a network layer or a subnetwork.
Conv, GAP, CF, Corr, and FANet denote the convolution,
global average pooling, correlation filter, cross-correlation,
and fine-grained aware network (see Fig. 2), respectively.
Multi-Task Matching Network
In this section, we show how to learn TIR-specific features
and integrate them in a multi-task matching network for TIR
tracking. First, we present the overall multi-task matching
network and introduce the TIR-specific discriminative fea-
ture module and the fine-grained correlation feature module.
Then, we introduce the constructed TIR dataset and analyze
three multi-domain aggregation learning strategies. Finally,
we give the flow of the tracking algorithm using the pro-
posed model.
Multi-task architecture
We propose a multi-task matching network to integrate the
TIR-specific discriminative features and the fine-grained
correlation features for TIR tracking. The network con-
sists of a shared feature extracted network, a discriminative
matching branch, a classification branch, and a fine-grained
matching branch, as shown in Fig. 1. Different from existing
trackers using pre-trained features on visual images, the pro-
posed multi-task network uses both TIR-specific discrimi-
native features and fine-grained correlation features for TIR
object localization under a matching framework. In the fol-
lowing, we present the details of each component.
Discriminative matching. Considering tracking efficiency,
we use a general matching architecture which is the same as
that of CFNet (Valmadre et al. 2017) to perform tracking. As
deeper convolution layers contain more discriminative fea-
tures, we construct the discriminative matching module on
top of the last convolution layer of the shared feature extrac-
tion network. Given a target example Zand a search image
Y, the discriminative similarity fdis(Z,Y)can be formu-
lated as:
fdis(Z,Y) = g(σ(φconv5(Z)), φconv5(Y)),(1)
where φconv5(·)extracts features using the last convolu-
tional layer of the shared feature extraction network, g(·,·)
denotes the cross-correlation operator and σ(·)is the CF
block which is used to improve the discriminative capacity
by online updating the target template. We adopt a logistic
loss to train this branch:
Ldis(y, o) = 1
|D|X
uD
log(1 + exp(y[u]o[u])),(2)
where DRM×Mis the similarity map generated by Eq. 1,
o[u]denotes the real value of a single target-candidate pair,
and y[u]is the ground-truth of this pair.
TIR-specific discriminative features. We use a classifica-
tion branch as an auxiliary task to obtain the TIR-specific
discriminative features and then use them in the discrimina-
tive matching branch. The classification task aiming to dis-
tinguish TIR objects belonging to different classes learns the
features focusing on the class-level difference.
In the auxiliary network, we first use a global average
pooling layer instead of a fully connected layer to avoid the
over-fitting problem. Then, a 1×1convolution layer is used
to adapt the number of the class of the training set. Finally,
we use a cross-entropy loss to train it:
Lcls(y, p) =
K
X
k=0
yklog pk,(3)
where yis the ground-truth, pis the predicted label, and K
denotes a total number of the classes.
Fine-grained matching. The intra-class TIR objects of-
ten have a similar visual pattern as they do not have color
information. Coupled with the TIR-specific discriminative
branch, we construct a fine-grained matching branch to
distinguish intra-class TIR objects. We note that the fine-
grained correlation features are helpful for distinguishing
distractors. We compute the fine-grained correlation feature
on a shallow convolution layer since the shallow convolu-
tion features mainly contain more detailed information. The
fine-grained similarity can be formulated as:
ffin(Z,Y) = g(σ(ω(φconv3(Z))), ω(φconv3(Y))),(4)
where φconv3(·)extract features using the third convolu-
tional layer of the shared feature extraction network, ω(·)
denotes the proposed fine-grained aware module. We use a
logistic loss which is the same with Eq. 2 to train this branch.
Fine-grained correlation features. To get the fine-grained
correlation features, we design a fine-grained aware net-
work which consists of a holistic correlation module and a
pixel-level correlation module. Fig. 2 depicts the architec-
ture. Given an input feature map XRH×W×C, the fine-
grained aware module can be formulated as:
ω(X) = fc(ϕh(X), ϕp(X)),(5)
where ϕh(·)denotes the holistic correlation module which
formulates the relationship between local regions, ϕp(·)de-
notes the pixel-level correlation module which is used to for-
mulate the relationship between all feature units, and fc(·,·)
ۨ
Conv
Conv
Conv
HWHW
HW
Conv
HWCHWC
HWC
HWC
HWC
Conv
Conv
Deconv
Deconv
Sigmoid
(a) Holistic correlation
(b) Pixel-level correlation
Softmax
HWC
Reshape
CHW
Reshape & transpose
Reshape HWC
HWC
Concat
Conv
Figure 2: Architecture of the proposed Fine-grained Aware
Network (FANet). It consists of a holistic correlation mod-
ule and a pixel-level correlation module. The input and out-
put are a H×W×Cfeature map, denotes the broad-
cast element-wise multiplication, denotes the batch ma-
trix multiplication, and is the broadcast element-wise ad-
dition.
is cascaded by a concat and a 1×1convolutional lay-
ers, which integrates these two complementary correlations.
Fig. 3 compares the TIR-specific discriminative feature and
the fine-grained correlation feature using visualizations of
the feature maps.
To formulate the relationships between local regions,
we use an encoder-decoder architecture based on a self-
attention mechanism. We first exploit two large convolution
kernels to find out discriminative local regions. Then, we use
two deconvolution layers to locate them. After that, a corre-
lation map is generated using a Sigmoid activation function.
The map denotes the importance of every local region. Fi-
nally, we weight the original feature map using this corre-
lation map for making it focus on the local discriminative
regions. The weighed feature map is computed as:
ϕh(X) = Xexp(WX)
exp(WX)+1,(6)
where Wdenotes the transform matrix which is constituted
by two convolution and two deconvolution layers.
As pixel-level context information is crucial for represent-
ing TIR objects, we exploit a pixel-level correlation mod-
ule to formulate the relationships between every feature unit
for obtaining more fine-grained correlation information. The
pixel-level correlation model is similar to the non-local net-
work (Wang et al. 2018c) which captures long-range depen-
dencies. Specifically, we first formulate the pixel-level re-
lationships with a spatial correlation map SRHW ×H W ,
which is computed as:
sij =exp(WqxiNWkxj)
PN
n=1 exp(WqxiNWkxn),(7)
Input image Discriminative Low-level Fine-grained
Figure 3: Visualization of the TIR-specific discriminative
features and fine-grained correlation features. The visualized
feature maps are generated by summing all the channels.
From left to right, each column shows the original images,
the TIR-specific discriminative features (Conv5), the low
level features (Conv3), and the learned fine-grained correla-
tion features from Conv3 respectively. This figure shows that
TIR-specific discriminative feature is too coarse to achieve
accurate localization , while the fine-grained correlation fea-
ture map focuses on local prominent regions which con-
tributes to accurate localization.
where sij Sdenotes the relationship between the i-th fea-
ture unit and the j-th feature unit, Wqand Wkrepresent the
two 1×1convolutional layers respectively, xiis the i-th fea-
ture unit in X, and X={xi}N
i=1, where N=H W . Then,
we apply this correlation map on the input feature map to
obtain the pixel-level correlation feature which can be for-
mulated as:
Sp=
N
X
j=1
N
X
i=1
sij (Wgxj),(8)
where Wgis a transform matrix which is implemented
with a 1×1convolutional operator. Finally, we perform a
weighted sum to the pixel-level correlation feature map and
the origin low-level feature map to get the comprehensive
correlation feature map using a residual-like connection:
ϕp(X) = X+δSp,(9)
where δis a scale factor which can be learned automatically.
TIR dataset
To better adapt the proposed model to the TIR domain,
we construct a large-scale TIR dataset for training the pro-
posed network. The dataset consists of 30 classes and over
1100 sequences. We annotate the object in every frame
of each sequence with bounding box and class labels us-
ing a semi-automatic tracking application according to the
VID2015 (Russakovsky et al. 2015) style. Some examples
of the annotated videos and comparison with existing track-
ing datasets are shown in the supplementary material. The
dataset includes more than 450,000 frames and 530,000
bounding boxes. Since most of our sequences are collected
from Youtube website, it has a wide range of shotting de-
vices, shotting scenes, and shotting view angles which en-
sure the diversity of the dataset. For examples, there are
four kinds of shotting devices and view angles: hand-held,
vehicle-mounted, surveillance (static), and drone-mounted.
We store these TIR images with a white-hot style and an 8
bits depth.
Multi-domain aggregation
We find that the grayscale image sample can provide rich de-
tailed information, e.g., texture and structure, which is help-
ful to the TIR tracking task. As such, we explore to use both
the grayscale and TIR domains to boost the TIR tracking
performance. To find an effective way to combine them, we
test three multi-domain aggregation learning strategies.
Re-training. We first train the proposed network on the
VID2015 (Russakovsky et al. 2015) grayscale dataset
with a multi-task loss:
L=λ1Ldis +λ2Lcls +λ3Lf in,(10)
where Lfin denotes the fine-grained similarity loss which
is same as Ldis. Then, we re-train the overall network on
the TIR dataset.
Fine-tuning. We also use the trained model on VID2015
as initial parameters of the proposed network and freeze
the first three layers of the shared feature extracted net-
work and the fine-grained matching branch for retaining
the detail information. Then, we use a smaller learning
rate to fine-tune the network on the TIR dataset.
Mix-training. We first mix the VID2015 and TIR dataset
together and get a new mixed dataset. Then, we freeze the
classification branch and train the proposed network from
scratch on the mixed dataset.
In the Ablation studies section, we report and analyze the
results of each strategy.
Tracking process
Once the multi-task matching network is learned, we prune
the classification branch and use the rest part for online TIR
tracking without updating. Fig. 1 shows the testing frame-
work. Given a target instance Zt1at the (t1)-th frame
and a search image Ytat the t-th frame, the prediction in the
t-th frame can be computed as:
ˆ
yt,i = arg max
yt,i
fdis(Zt1,Yt) + ff in (Zt1,Yt),(11)
where yt,i Ytis the i-th candidate in the search region
Yt. We use a scale-pyramid mechanism (Bertinetto et al.
2016) to estimate the size change of the object.
Experimental Results
Implementation details
We conduct the experiment using the MatConvNet (Vedaldi
and Lenc 2015) toolbox on a PC with an i7 4.0 GHz CPU
and a GTX-1080 GPU. The average speed is about 19 FPS.
We remove all the paddings of AlexNet (Krizhevsky et al.
2012) and use it as the base feature extractor. We train
the proposed network using a Stochastic Gradient Descent
(SGD) method with the batch size of 8and momentum of
0.9. At the first stage, we train the network with 60 epochs
on the VID2015 dataset and the learning rate exponentially
decays from 102to 105. We set λ1=λ2=λ3= 1
of Eq. 10 at all training stages. At the re-training and fine-
tuning stages, we train the network 30 epochs with the learn-
ing rate exponentially decays from 103to 105on the con-
structed TIR dataset. In the mix-training process, we train
the network 70 epochs using the same parameters with the
training on VID2015 dataset.
Ablation studies
Datasets. The VOT-TIR2015 (Felsberg et al. 2015) and
VOT-TIR2017 (Kristan et al. 2017) benchmarks are widely
used for evaluating TIR trackers. These two datasets contain
six kinds of challenges, such as dynamics change, camera
motion, and occlusion. Each challenge has a corresponding
subset which can be used to evaluate the ability of a tracker
to handle the challenge. In addition to the VOT-TIR2015 and
VOT-TIR2017 datasets, we also use a TIR pedestrian track-
ing dataset, PTB-TIR (Liu et al. 2019a), to evaluate the pro-
posed algorithm. PTB-TIR is a recently published tracking
benchmark that contains 60 sequences with 9 different chal-
lenges, such as background clutter, occlusion, out-of-view,
and scale variation.
Evaluation criteria. VOT-TIR2015 and VOT-TIR2017 use
Accuracy (Acc) and Robustness (Rob) (Kristan et al. 2016)
to evaluate the performance of a tracker from two aspects.
Accuracy is the average overlap rate between the predicted
bounding box and the ground truth bounding box. Robust-
ness denotes the average frequency of tracking failure on
the overall dataset. In addition, Expected Average Overlap
(EAO) is often used to evaluate the overall performance of
a tracker, which is computed based on Acc and Rob. PTB-
TIR uses the Precision (Pre) and Success (Suc) plots to eval-
uate the performance of a tracker. The precision plot mea-
sures the percentage of frames whose Center Location Error
(CLE) is within a given threshold (20 pixels), the success
plot measures the percentage of frames whose Overlap Ra-
tion (OR) is larger than a given threshold. The Area Under
the Curve (AUC) of the precision and success plots are often
used to rank methods.
Network architecture. Table 1 shows the results of ablation
study. From the first two rows, we can see that the classi-
fication branch (Cls) improves the robustness of the tracker
with more than 2% gains of EAO score on both benchmarks.
This shows the effectiveness of the TIR-specific discrim-
inative features. From the second to fourth rows, we can
see that the fine-grained matching branch using the holis-
tic correlation module (Fine-Hc) improves the accuracy by
7% and 3% on these two benchmarks respectively, while the
fine-grained matching branch using the pixel-level correla-
tion module (Fine-Pc) improves the accuracy by 6% and 4%
on these two benchmarks respectively. The last row shows
that the fine-grained matching branch using both the holistic
and pixel-level correlation modules further improves the ac-
curacy by more than 2% on both benchmarks. We attribute
Table 1: Ablation studies of the proposed model on the VOT-TIR2015 and VOT-TIR2017 benchmarks. Dis, Cls, Fine-Hc,
and Fine-Pc denote the discriminative matching branch, the classification branch, the fine-grained matching with the holistic
correlation module, and the fine-grained matching with the pixel-level correlation module respectively.
Tracker VOT-TIR2015 VOT-TIR2017
Dis Cls Fine-Hc Fine-Pc EAO Acc Rob EAO Acc Rob
X0.282 0.55 2.82 0.254 0.52 3.45
X X 0.307 0.51 2.41 0.274 0.52 3.20
X X X 0.322 0.58 2.14 0.296 0.55 2.96
X X X 0.326 0.57 2.30 0.279 0.56 3.24
X X X X 0.332 0.60 2.26 0.320 0.58 2.91
Table 2: Comparison of the different models using two
single-domain learning methods and three multi-domain ag-
gregation learning strategies on the VOT-TIR2015 and PTB-
TIR benchmarks.
VOT-TIR2015 PTB-TIR
Strategy EAO Acc Rob Pre Suc
Only-VID 0.332 0.60 2.26 0.661 0.502
Only-TIR 0.311 0.55 2.47 0.694 0.519
Re-training 0.300 0.58 2.37 0.730 0.521
Fine-tuning 0.322 0.58 2.16 0.729 0.525
Mix-training 0.344 0.61 2.09 0.759 0.539
these gains to the fine-grained correlation features, which are
effective in distinguishing similar objects, and the comple-
ment advantages of the holistic correlation and pixel-level
correlation modules, which provide more powerful features
for target localization.
Multi-domain aggregation. Table 2 shows the results of
the proposed model using different training strategies. Com-
pared with only training on the VID2015 dataset (Only-
VID), the mix-training learning strategy achieves a 1.2%
EAO score gain on VOT-TIR2015 and a 3.7% success rate
gain on PTB-TIR. Compared with only training on the TIR
dataset (Only-TIR), the mix-training strategy also improves
the EAO score by 3% on VOT-TIR2015 and the success rate
by 2% on PTB-TIR. These results demonstrate that the mix-
training can make full use of the property of grayscale and
TIR images to get more powerful features for TIR tracking.
Compared with Only-VID, the fine-tuning strategy achieves
a2.3% gain of the success rate and a 6.8% gain of preci-
sion on PTB-TIR. It also improves the robustness on VOT-
TIR2016. These results demonstrate that the fine-grained
feature learned from the grayscale dataset are useful to TIR
tracking. The re-train with the TIR dataset does not improve
the performance significantly on both datasets. This is be-
cause TIR images lack detailed features for precise locating.
Comparison with state-of-the-arts
Compared trackers. We compare the proposed method
with the state-of-the-art trackers including hand-crafted fea-
ture based correlation filter trackers, such as DSST (Danell-
jan et al. 2014), SRDCF (Danelljan et al. 2015), and
Staple-TIR (Felsberg et al. 2016); the deep feature based
correlation filter trackers, such as HDT (Qi et al. 2016),
0 5 10 15 20 25 30 35 40 45 50
Location error threshold
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Precision
Precision plots of OPE
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Overlap threshold
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Success rate
Success plots of OPE
Figure 4: Comparison of ten trackers on the PTB-TIR bench-
mark.
      
6HTXHQFHOHQJWK









([SHFWHGRYHUODS
($2IRUEDVHOLQHG\QDPLFVBFKDQJH
      
6HTXHQFHOHQJWK









([SHFWHGRYHUODS
($2IRUEDVHOLQHFDPHUDBPRWLRQ
Figure 5: EAO scores of the top ten trackers on two chal-
lenges of the VOT-TIR2017 benchmark.
deepMKCF (Tang and Feng 2015), CREST (Song et al.
2017), MCFTS (Liu et al. 2017), ECO-deep (Danelljan et
al. 2017), and DeepSTRCF (Li et al. 2018b); matching
based deep trackers, such as CFNet (Valmadre et al. 2017),
Siamese-FC (Bertinetto et al. 2016), SiamRPN (Li et al.
2018a), HSSNet (Li et al. 2019a), MLSSNet (Liu et al.
2019b), and TADT (Li et al. 2019b); and other deep track-
ers, such as TCNN (Nam et al. 2016), MDNet-N (Felsberg
et al. 2016) and VITAL (Song et al. 2018).
Results on PTB-TIR. Fig. 4 shows that the proposed algo-
rithm achieves the best success rate of 0.539 and precision of
0.759 on PTB-TIR. Compared with CFNet which just uses
a single matching branch, the proposed method (MMNet-
Mix) achieves a 10% relative gain of the success rate. Al-
though the proposed method (MMNet-VID) is not trained on
the TIR dataset, it also improves the success rate by 6%. This
demonstrates the effectiveness of the proposed network ar-
chitecture and the mix-training learning strategy. Compared
with the correlation filter based deep trackers, the proposed
Table 3: Comparison of our tracker and the state-of-the-art methods on VOT-TIR2017 and VOT-TIR2015. The bold and under-
line denote the best and the second-best scores, respectively. The notation “*” denotes the speed is reported by the authors.
Category Tracker VOT-TIR2017 VOT-TIR2015 Speed
EAO Acc Rob EAO Acc Rob FPS
Hand-crafted
feature based CF
SRDCF (Danelljan et al. 2015) 0.197 0.59 3.84 0.225 0.62 3.06 12.3
Staple-TIR (Felsberg et al. 2016) 0.264 0.65 3.31 - - - 80.0*
Deep feature
based CF
MCFTS (Liu et al. 2017) 0.193 0.55 4.72 0.218 0.59 4.12 4.7
HDT (Qi et al. 2016) 0.196 0.51 4.93 0.188 0.53 5.22 10.6
deepMKCF (Tang and Feng 2015) 0.213 0.61 3.90 - - - 5.0*
CREST (Song et al. 2017) 0.252 0.59 3.26 0.258 0.62 3.11 0.6
DeepSTRCF (Li et al. 2018b) 0.262 0.62 3.32 0.257 0.63 2.93 5.5
ECO-deep (Danelljan et al. 2017) 0.267 0.61 2.73 0.286 0.64 2.36 16.3
Other deep
tracker
MDNet-N (Felsberg et al. 2016) 0.243 0.57 3.33 - - - 1.0*
VITAL (Song et al. 2018) 0.272 0.64 2.68 0.289 0.63 2.18 4.7
TCNN (Nam et al. 2016) 0.287 0.62 2.79 - - - 1.5*
Matching based
deep tracker
Siamese-FC (Bertinetto et al. 2016) 0.225 0.57 4.29 0.219 0.60 4.10 66.9
SiamRPN (Li et al. 2018a) 0.242 0.60 3.19 0.267 0.63 2.53 160.0*
CFNet (Valmadre et al. 2017) 0.254 0.52 3.45 0.282 0.55 2.82 37.0
HSSNet (Li et al. 2019a) 0.262 0.58 3.33 0.311 0.67 2.53 10.0*
TADT (Li et al. 2019b) 0.262 0.60 3.18 0.234 0.61 3.33 42.7
MLSSNet (Liu et al. 2019b) 0.278 0.56 2.95 0.316 0.57 2.32 18.0
MMNet (Ours) 0.320 0.58 2.91 0.344 0.61 2.09 18.9
method obtains a better success rate. We attribute the good
performance to the specifically designed TIR feature model
and the constructed large-scale TIR dataset.
Results on VOT-TIRs. As shown in Table 3, the proposed
method (MMNet) achieves the best EAO scores of 0.320
and 0.344 on VOT-TIR2017 and VOT-TIR2015, respec-
tively. Compared with other matching based deep trackers,
the proposed multi-task matching network learns more ef-
fective TIR features for matching. Although TADT online
selects more compact and target-aware features from a pre-
trained CNN for matching, the proposed method still ob-
tains a better performance on both benchmarks. Compared
with the best correlation filter based deep tracker, ECO-
deep, which uses the classification-based pre-trained fea-
ture, the proposed method obtains better robustness on VOT-
TIR2015. This benefits from the learned fine-grained cor-
relation features which help the multi-task matching net-
work distinguish similar distractors. Compared with the best
deep tracker, TCNN, which uses multiple CNNs to repre-
sent objects, the proposed method achieves a better perfor-
mance on VOT-TIR2017 while running faster. We attribute
the good performance to the proposed TIR-special feature
model which is more effective in representing TIR objects.
Fig. 5 shows that our method achieves the best EAO on
the dynamic change and camera motion challenges of VOT-
TIR2017. Compared with the second best matching based
tracker, CFNet, the proposed method achieves a 9.3% EAO
score gain on the dynamics change challenge. This shows
that the proposed TIR-special feature model is more robust
to the appearance variation of the target. Furthermore, the
proposed method achieves a higher EAO score than the sec-
ond best method (TCNN) by 4.2% on the camera motion
challenge. Some more attribute-based results can be found
in the supplementary material. These results demonstrate
the effectiveness of the proposed algorithm.
Conclusions
In this paper, we propose to learn a TIR-specific feature
model for robust TIR tracking. The feature model includes
a TIR-specific discriminative feature module and a fine-
grained correlation feature module. To use these two feature
models simultaneously, we integrate them into a multi-task
matching framework. The TIR-specific discriminative fea-
tures, generated with an auxiliary multi-classification task,
are able to distinguish inter-class TIR objects. The fine-
grained correlation features are obtained with a fine-grained
aware network consisting of a holistic correlation module
and a pixel-level correlation module. These two kinds of fea-
tures complement each other and distinguish TIR objects in
the levels of inter-class and intra-class, respectively. In addi-
tion, we develop a large-scale TIR training dataset for adapt-
ing the model to the TIR domain, which can be also easily
applied to other TIR tasks. Extensive experimental results
on three benchmarks demonstrate that the proposed method
performs favorably against the state-of-the-art methods.
Acknowledgment
This work is supported by the National Natural Sci-
ence Foundation of China (Grant No. 61672183), by
the Natural Science Foundation of Guangdong Province
(Grant No.2015A030313544), by the Shenzhen Re-
search Council (Grant No. JCYJ20170413104556946,
JCYJ20170815113552036), and by the project ”The Verifi-
cation Platform of Multi-tier Coverage Communication Net-
work for oceans (PCL2018KP002)”.
References
Bertinetto, L.; Valmadre, J.; Henriques, J. F.; Vedaldi, A.; and Torr,
P. H. 2016. Fully-convolutional siamese networks for object track-
ing. In ECCV Workshops, 850–865.
Bhattarai, B.; Sharma, G.; Jurie, F.; et al. 2016. Cp-mtml: Coupled
projection multi-task metric learning for large scale face retrieval.
In CVPR, 4226–4235.
Chen, W.; Chen, X.; Zhang, J.; and Huang, K. 2017. A multi-task
deep network for person re-identification. In AAAI, 3988–3994.
Danelljan, M.; H¨
ager, G.; Khan, F.; and Felsberg, M. 2014. Accu-
rate scale estimation for robust visual tracking. In BMVC.
Danelljan, M.; Hager, G.; Shahbaz Khan, F.; and Felsberg, M.
2015. Learning spatially regularized correlation filters for visual
tracking. In ICCV, 4310–4318.
Danelljan, M.; Bhat, G.; Shahbaz Khan, F.; and Felsberg, M. 2017.
Eco: efficient convolution operators for tracking. In CVPR, 6638–
6646.
Dong, X., and Shen, J. 2018. Triplet loss in siamese network for
object tracking. In ECCV, 459–474.
Felsberg, M.; Berg, A.; Hager, G.; Ahlberg, J.; et al. 2015. The ther-
mal infrared visual object tracking vot-tir2015 challenge results. In
ICCV Workshops, 76–88.
Felsberg, M.; Kristan, M.; Matas, J.; Leonardis, A.; et al. 2016.
The thermal infrared visual object tracking vot-tir2016 challenge
results. In ECCV Workshops, 824–849.
Gade, R., and Moeslund, T. B. 2014. Thermal cameras and appli-
cations: a survey. Machine vision and applications 25(1):245–262.
Gao, P.; Ma, Y.; Song, K.; Li, C.; Wang, F.; and Xiao, L. 2018.
Large margin structured convolution operator for thermal infrared
object tracking. In ICPR, 2380–2385.
Gundogdu, E.; Koc, A.; Solmaz, B.; et al. 2016. Evaluation of
feature channels for correlation-filter-based visual object tracking
in infrared spectrum. In CVPR Workshops, 24–32.
Guo, Q.; Feng, W.; Zhou, C.; et al. 2017. Learning dynamic
siamese network for visual object tracking. In ICCV, 1763–1771.
He, A.; Luo, C.; Tian, X.; and Zeng, W. 2018. A twofold siamese
network for real-time object tracking. In CVPR, 4834–4843.
Kristan, M.; Matas, J.; Leonardis, A.; et al. 2016. A novel per-
formance evaluation methodology for single-target trackers. IEEE
TPAMI 38(11):2137–2155.
Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; et al. 2017. The
visual object tracking vot2017 challenge results. In ICCV Work-
shops, 1949–1972.
Krizhevsky, A.; Sutskever, I.; Hinton, G. E.; et al. 2012. Ima-
genet classification with deep convolutional neural networks. In
NeurIPS, 1097–1105.
Li, B.; Yan, J.; Wu, W.; Zhu, Z.; and Hu, X. 2018a. High perfor-
mance visual tracking with siamese region proposal network. In
CVPR, 8971–8980.
Li, F.; Tian, C.; Zuo, W.; Zhang, L.; and Yang, M.-H. 2018b. Learn-
ing spatial-temporal regularized correlation filters for visual track-
ing. In CVPR, 4904–4913.
Li, X.; Liu, Q.; Fan, N.; et al. 2019a. Hierarchical spatial-aware
siamese network for thermal infrared object tracking. Knowledge-
Based Systems 166:71–81.
Li, X.; Ma, C.; Wu, B.; He, Z.; and Yang, M.-H. 2019b. Target-
aware deep tracking. In CVPR.
Liu, Q.; Lu, X.; He, Z.; et al. 2017. Deep convolutional neural
networks for thermal infrared object tracking. Knowledge-Based
Systems 134:189–198.
Liu, Q.; He, Z.; Li, X.; and Zheng, Y. 2019a. Ptb-tir: A thermal
infrared pedestrian tracking benchmark. IEEE TMM.
Liu, Q.; Li, X.; He, Z.; Fan, N.; Yuan, D.; and Wang, H. 2019b.
Learning deep multi-level similarity for thermal infrared object
tracking. arXiv preprint arXiv:1906.03568.
Nam, H.; Baek, M.; Han, B.; et al. 2016. Modeling and propa-
gating cnns in a tree structure for visual tracking. arXiv preprint
arXiv:1608.07242.
Qi, Y.; Zhang, S.; Qin, L.; et al. 2016. Hedged deep tracking. In
CVPR, 4303–4311.
Russakovsky, O.; Deng, J.; Su, H.; et al. 2015. Imagenet large scale
visual recognition challenge. IJCV 115(3):211–252.
Shen, C.; Jin, Z.; Zhao, Y.; Fu, Z.; Jiang, R.; Chen, Y.; and Hua,
X.-S. 2017. Deep siamese network with multi-level similarity per-
ception for person re-identification. In MM, 1942–1950.
Simonyan, K., and Zisserman, A. 2014. Very deep convolu-
tional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556.
Song, Y.; Ma, C.; Gong, L.; et al. 2017. Crest: Convolutional
residual learning for visual tracking. In ICCV, 2574–2583.
Song, Y.; Ma, C.; Wu, X.; Gong, L.; et al. 2018. Vital: Visual
tracking via adversarial learning. In CVPR, 8990–8999.
Tang, M., and Feng, J. 2015. Multi-kernel correlation filter for
visual tracking. In ICCV, 3038–3046.
Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; and Torr,
P. H. 2017. End-to-end representation learning for correlation filter
based tracking. In CVPR, 5000–5008.
Vedaldi, A., and Lenc, K. 2015. Matconvnet: Convolutional neural
networks for matlab. In MM, 689–692.
Wang, Q.; Teng, Z.; Xing, J.; Gao, J.; et al. 2018a. Learning atten-
tions: residual attentional siamese network for high performance
online visual tracking. In CVPR, 4854–4863.
Wang, Q.; Zhang, M.; Xing, J.; Gao, J.; Hu, W.; and Maybank, S.
2018b. Do not lose the details: reinforced representation learning
for high performance visual tracking. In IJCAI.
Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018c. Non-local
neural networks. In CVPR, 7794–7803.
Zhang, Y.; Wang, L.; Qi, J.; et al. 2018. Structured siamese network
for real-time visual tracking. In ECCV, 351–366.
Zhang, L.; Gonzalez-Garcia, A.; van de Weijer, J.; Danelljan, M.;
and Khan, F. S. 2019. Synthetic data generation for end-to-end
thermal infrared tracking. TIP 28(4):1837–1850.
Zhipeng, Z.; Houwen, P.; Qiang, W.; et al. 2019. Deeper and wider
siamese networks for real-time visual tracking. In CVPR.
Zhu, Z.; Wu, W.; Zou, W.; and Yan, J. 2018. End-to-end flow
correlation tracking with spatial-temporal attention. In CVPR, 548–
557.
... Cascading CF and DNN can achieve the robust tracking [69], [281]. ACFN adds a subset of correlation filter trackers, and designs an attention network composed of prediction and selection sub-networks, realizing the selection of trackers adaptively [281]. ...
... ACFN adds a subset of correlation filter trackers, and designs an attention network composed of prediction and selection sub-networks, realizing the selection of trackers adaptively [281]. MMNet proposes a fine-grained perception module before CF [69]. It performs a self-attention mechanism on the shallow features to obtain more fine-grained correlation information. ...
... It reduces the influence of similar object interference. In the feature extraction of object regions, attention mechanisms have been added to suppress the distractors influence, obtain finer-grained information, or emphasize the importance of different channel features [41], [69], [71], [74], [280]. The local and multiscale transformers perform well in feature information fusion, which can obtain features with sufficient spatial details [2], [15], [106], [144]. ...
Article
Full-text available
Transformer has shown excellent performance in remote sensing field with long-range modeling capabilities. Remote sensing video (RSV) moving object detection and tracking play indispensable roles in military activities as well as urban monitoring. However, transformers in these fields are still at the exploratory stage. In this survey, we comprehensively summarize the research prospects of transformers in RSV moving object detection and tracking. The core designs of remote sensing transformers and advanced transformers are first analyzed. It mainly includes the attention mechanism evolution for specific tasks, the fitting ability design of input mapping, diverse feature representation, model optimization, etc. The architectural characteristics of RSV detection and tracking are then described across two aspects. One is moving object detection for motion-based traditional background subtractions and appearance-based deep learning models. The other is object tracking for single and multi-targets. The research difficulties mainly include the blurred foreground in RSV data, the irregular objects movement in traditional background subtraction and the severe objects occlusion in object tracking. Following that, the potential significance of transformers is discussed according to some thorny problems in RSV. Finally, we summarize ten open challenges of transformers in RSV, which may be used as a reference for promoting future research.
... Under the Siamese network-based TIR tracking framework, Liu et. al. [72] developed a multi-task framework for learning specific identification features to track the target in the TIR scenario. The GFSNet [73] tracker presents an adaptive structure into the Siamese network-based TIR tracking framework and presents an effective thermal infrared tracking method. ...
... MCFTS [64], UDCT [75] and DAS [87]), ii) Siamese network-based tracking methods (e.g. MMNet [72], GFSNet [73], HSSNet [86] and SiamSAV [74]), and iii) other deep learning-based tracking methods (e.g. STAMT [41] and CMD-DiMP [82]). ...
Conference Paper
Full-text available
Thermal infrared (TIR) target tracking task is not affected by illumination changes and can be tracked at night, on rainy days, foggy days, and other extreme weather, so it is widely used in night auxiliary driving, unmanned aerial vehicle reconnaissance, video surveillance, and other scenes. Thermal infrared target tracking task still faces many challenges, such as occlusion, deformation, similarity interference, etc. To solve the challenge in the TIR target tracking scenarios, a large number of TIR target tracking methods have appeared in recent years. The purpose of this paper is to give a comprehensive review and summary of the research status of thermal infrared target tracking methods. We first introduce some basic principles and representative work of the thermal infrared target tracking methods. And then, some benchmarks for performance testing of thermal infrared target tracking methods are introduced. Subsequently, we demonstrate the tracking results of several representative tracking methods on some benchmarks. Finally, the future research direction of thermal infrared target tracking is discussed.
... Performance comparison of the proposed algorithm in scenarios with occlusions is performed with two classes of state-of-the-art trackers: discriminative correlation filters and deep Siamese networks, which have been recognized as the dominant video tracking paradigms [18]. We selected traditional discriminative correlation filters: Staple [47], KCF [19], and STRCF [48], as well as deep learning based discriminative correlation filters trained for visual object tracking: HCF [49], ECO [50], ECO-HC [50], and STRCFdeep [48], and trained for thermal object tracking: MCFTS [51], ECO-stir [52], and MMNet [53]. From the class of deep Siamese networks, trackers trained for visual object tracking were selected: SiamFC [54], DSiam [55], SiamRPN [56], SiamMASK [57], SiamCAR [58], and SiamBAN [59], as well as those trained for thermal infrared tracking: HSSNet [60] and MLSSNet [61]. ...
Article
Full-text available
Short-wave infrared (SWIR) imaging has significant advantages in challenging propagation conditions where the effectiveness of visible-light and thermal imaging is limited. Object tracking in SWIR imaging is particularly difficult due to lack of color information, but also because of occlusions and maneuvers of the tracked object. This paper proposes a new algorithm for object tracking in SWIR imaging, using a kernelized correlation filter (KCF) as a basic tracker. To overcome occlusions, the paper proposes the use of the Kalman filter as a predictor and a method to expand the object search area. Expanding the object search area helps in better re-detection of the object after occlusion, but also leads to the occasional appearance of errors in measurement data that can lead to object loss. These errors can be treated as outliers. To cope with outliers, Huber’s M-robust approach is applied, so this paper proposes robustification of the Kalman filter by introducing a nonlinear Huber’s influence function in the Kalman filter estimation step. However, robustness to outliers comes at the cost of reduced estimator efficiency. To make a balance between desired estimator efficiency and resistance to outliers, a new adaptive M-robustified Kalman filter is proposed. This is achieved by adjusting the saturation threshold of the influence function using the detection confidence information from the basic KCF tracker. Experimental results on the created dataset of SWIR video sequences indicate that the proposed algorithm achieves a better performance than state-of-the-art trackers in tracking the maneuvering object in the presence of occlusions.
... Beside being utilized for navigation, thermal sensor has been used in agriculture to monitor crops (Speth et al., 2022), infrastructure monitoring (Chokkalingham et al., 2012;Fuentes et al., 2021;Stypułkowski et al., 2021;Wu et al., 2018), objects detection and tracking (Leira et al., 2021;Liu, Li, et al., 2020;Liu et al., 2017Liu et al., , 2022. ...
Article
Full-text available
The study explores the feasibility of optical flow-based neural network from real-world thermal aerial imagery. While traditional optical flow techniques have shown adequate performance, sparse techniques do not work well during cold-soaked low-contrast conditions, and dense algorithms are more accurate in low-contrast conditions but suffer from the aperture problem in some scenes. On the other hand, optical flow from convolutional neural networks has demonstrated good performance with strong generalization from several synthetic public data set benchmarks. Ground truth was generated from real-world thermal data estimated with traditional dense optical flow techniques. The state-of-the-art Recurrent All-Pairs Field Transform for the Optical Flow model was trained with both color synthetic data and the captured real-world thermal data across various thermal contrast conditions. The results showed strong performance of the deep-learning network against established sparse and dense optical flow techniques in various environments and weather conditions, at the cost of higher computational demand. K E Y W O R D S deep learning, LWIR, navigation, optical flow, thermal imaging, UAVs
... In order to assess the tracking generalization of STFTrack to general infrared targets, we conduct a comparison study with nine other tracking methods, including GlobalTrack, ECO-deep-TIR, ECO_TIR, MDNet [39], ATOM, SiamRPN++, SiamMask [40], Siamese-FC-TIR, and MMNet [41], on the LSOTB-TIR testset. The tracking performance comparison of each tracker is illustrated in Figure 10. ...
Article
Full-text available
The rapid popularity of UAVs has encouraged the development of Anti-UAV technology. Infrared-detector-based visual tracking for UAVs provides an encouraging solution for Anti-UAVs. However, it still faces the problem of tracking instability caused by environmental thermal crossover and similar distractors. To address these issues, we propose a spatio-temporal-focused Siamese network for infrared UAV tracking, called STFTrack. This method employs a two-level target focusing strategy from global to local. First, a feature pyramid-based Siamese backbone is constructed to enhance the feature expression of infrared UAVs through cross-scale feature fusion. By combining template and motion features, we guide prior anchor boxes towards the suspicious region to enable adaptive search region selection, thus effectively suppressing background interference and generating high-quality candidates. Furthermore, we propose an instance-discriminative RCNN based on metric learning to focus on the target UAV among candidates. By measuring calculating the feature distance between the candidates and the template, it assists in discriminating the optimal target from the candidates, thus improving the discrimination of the proposed method to infrared UAV. Extensive experiments on the Anti-UAV dataset demonstrate that the proposed method achieves outstanding performance for infrared tracking, with 91.2% precision, 66.6% success rate, and 67.7% average overlap accuracy, and it exceeded the baseline algorithm by 2.3%, 2.7%, and 3.5%, respectively. The attribute-based evaluation demonstrates that the proposed method achieves robust tracking effects on challenging scenes such as fast motion, thermal crossover, and similar distractors. Evaluation on the LSOTB-TIR dataset shows that the proposed method reaches a precision of 77.2% and a success rate of 63.4%, outperforming other advanced trackers.
Article
Full-text available
In the field of drone-based object tracking, utilization of the infrared modality can improve the robustness of the tracker in scenes with severe illumination change and occlusions and expand the applicable scene of the drone object tracking task. Inspired by the great achievements of Transformer structure in the field of RGB object tracking, we design a dual-modality object tracking network based on Transformer. To better address the problem of visible-infrared information fusion, we propose a Dual-Feature Aggregation Network that utilizes attention mechanisms in both spatial and channel dimensions to aggregate heterogeneous modality feature information. The proposed algorithm has achieved better performance by comparing with the mainstream algorithms in the drone-based dual-modality object tracking dataset VTUAV. Additionally, the algorithm is lightweight and can be easily deployed and executed on a drone edge computing platform. In summary, the proposed algorithm is mainly applicable to the field of drone dual-modality object tracking and the algorithm is optimized so that it can be deployed on the drone edge computing platform. The effectiveness of the algorithm is proved by experiments and the scope of drone object tracking is extended effectively.
Article
We address the problem of multi-modal object tracking in video and explore various options available for fusing the complementary information conveyed by the visible (RGB) and thermal infrared (TIR) modalities, including pixel-level, feature-level and decision-level fusion. Specifically, in contrast to the existing approaches, we propose and develop the paradigm for combining multi-modal information for image fusion at pixel level. At the feature level, two different kinds of fusion strategies are investigated for completeness, i.e., the attention-based online fusion strategy and the offline-trained fusion block. At the decision level, a novel fusion strategy is put forward, inspired by the success of the simple averaging configuration which has shown so much promise. The effectiveness of the proposed decision-level fusion strategy owes to a number of innovative contributions, including a dynamic weighting of the RGB and TIR contributions and a linear template update operation. A variant of the proposed decision fusion method produced the winning tracker at the Visual Object Tracking Challenge 2020 (VOT-RGBT2020). A comprehensive comparison of the innovative pixel and feature-level fusion strategies with the proposed decision-level fusion method highlights the advantages fusing multimodal information at the decision score level. Extensive experimental results on five challenging datasets, i.e., GTOT, VOT-RGBT2019, RGBT234, LasHeR and VOT-RGBT2020, demonstrate the effectiveness and robustness of the proposed method, compared to the state-of-the-art approaches. The Code is available at https://github.com/Zhangyong-Tang/DFAT.
Article
Salient object detection (SOD) is an important task in computer vision that aims to identify visually conspicuous regions in images. RGB-Thermal SOD combines two spectra to achieve better segmentation results. However, most existing methods for RGB-T SOD use boundary maps to learn sharp boundaries, which lead to sub-optimal performance as they ignore the interactions between isolated boundary pixels and other confident pixels. To address this issue, we propose a novel position-aware relation learning network (PRLNet) for RGB-T SOD. PRLNet explores the distance and direction relationships between pixels by designing an auxiliary task and optimizing the feature structure to strengthen intra-class compactness and inter-class separation. Our method consists of two main components: A signed distance map auxiliary module (SDMAM), and a feature refinement approach with direction field (FRDF). SDMAM improves the encoder feature representation by considering the distance relationship between foreground-background pixels and boundaries, which increases the inter-class separation between foreground and background features. FRDF rectifies the features of boundary neighborhoods by exploiting the features inside salient objects. It utilizes the direction relationship of object pixels to enhance the intra-class compactness of salient features. In addition, we constitute a transformer-based decoder to decode multispectral feature representation. Experimental results on three public RGB-T SOD datasets demonstrate that our proposed method not only outperforms the state-of-the-art methods, but also can be integrated with different backbone networks in a plug-and-play manner. Ablation study and visualizations further prove the validity and interpretability of our method.
Article
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Article
Full-text available
Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Conference Paper
Full-text available
Compared with visible object tracking, thermal infrared (TIR) object tracking can track an arbitrary target in total darkness since it cannot be influenced by illumination variations. However, there are many unwanted attributes that constrain the potentials of TIR tracking, such as the absence of visual color patterns and low resolutions. Recently, structured output support vector machine (SOSVM) and discriminative correlation filter (DCF) have been successfully applied to visible object tracking, respectively. Motivated by these, in this paper, we propose a large margin structured convolution operator (LMSCO) to achieve efficient TIR object tracking. To improve the tracking performance, we employ the spatial regularization and implicit interpolation to obtain continuous deep feature maps, including deep appearance features and deep motion features, of the TIR targets. Finally, a collaborative optimization strategy is exploited to significantly update the operators. Our approach not only inherits the advantage of the strong discriminative capability of SOSVM but also achieves accurate and robust tracking with higher-dimensional features and more dense samples. To the best of our knowledge, we are the first to incorporate the advantages of DCF and SOSVM for TIR object tracking. Comprehensive evaluations on two thermal infrared tracking benchmarks, i.e. VOT-TIR2015 and VOT-TIR2016, clearly demonstrate that our LMSCO tracker achieves impressive results and outperforms most state-of-the-art trackers in terms of accuracy and robustness with sufficient frame rate.
Article
The usage of both off-the-shelf and end-to-end trained deep networks have significantly improved performance of visual tracking on RGB videos. However, the lack of large labeled datasets hampers the usage of convolutional neural networks for tracking in thermal infrared (TIR) images. Therefore, most state of the art methods on tracking for TIR data are still based on hand-crafted features. To address this problem, we propose to use image-to-image translation models. These models allow us to translate the abundantly available labeled RGB data to synthetic TIR data. We explore both the usage of paired and unpaired image translation models for this purpose. These methods provide us with a large labeled dataset of synthetic TIR sequences, on which we can train end-to-end optimal features for tracking. To the best of our knowledge we are the first to train end-to-end features for TIR tracking. We perform extensive experiments on VOT-TIR2017 dataset. We show that a network trained on a large dataset of synthetic TIR data obtains better performance than one trained on the available real TIR data. Combining both data sources leads to further improvement. In addition, when we combine the network with motion features we outperform the state of the art with a relative gain of over 10%, clearly showing the efficiency of using synthetic data to train end-to-end TIR trackers.
Conference Paper
This work presents a novel end-to-end trainable CNN model for high performance visual object tracking. It learns both low-level fine-grained representations and a high-level semantic embedding space in a mutual reinforced way, and a multi-task learning strategy is proposed to perform the correlation analysis on representations from both levels. In particular, a fully convolutional encoder-decoder network is designed to reconstruct the original visual features from the semantic projections to preserve all the geometric information. Moreover, the correlation filter layer working on the fine-grained representations leverages a global context constraint for accurate object appearance modeling. The correlation filter in this layer is updated online efficiently without network fine-tuning. Therefore, the proposed tracker benefits from two complementary effects: the adaptability of the fine-grained correlation analysis and the generalization capability of the semantic embedding. Extensive experimental evaluations on four popular benchmarks demonstrate its state-of-the-art performance.