ArticlePDF Available

Learning Dual-Level Deep Representation for Thermal Infrared Tracking

Authors:

Abstract and Figures

The feature models used by existing Thermal InfraRed (TIR) tracking methods are usually learned from RGB images due to the lack of a large-scale TIR image training dataset. However, these feature models are less effective in representing TIR objects and they are difficult to effectively distinguish distractors because they do not contain fine-grained discriminative information. To this end, we propose a dual-level feature model containing the TIR-specific discriminative feature and fine-grained correlation feature for robust TIR object tracking. Specifically, to distinguish inter-class TIR objects, we first design an auxiliary multi-classification network to learn the TIR-specific discriminative feature. Then, to recognize intra-class TIR objects, we propose a fine-grained aware module to learn the fine-grained correlation feature. These two kinds of features complement each other and represent TIR objects in the levels of inter-class and intra-class respectively. These two feature models are constructed using a multi-task matching framework and are jointly optimized on the TIR object tracking task. In addition, we develop a large-scale TIR image dataset to train the network for learning TIR-specific feature patterns. To the best of our knowledge, this is the largest TIR tracking training dataset with the richest object class and scenario. To verify the effectiveness of the proposed dual-level feature model, we propose an offline TIR tracker (MMNet) and an online TIR tracker (ECO-MM) based on the feature model and evaluate them on three TIR tracking benchmarks. Extensive experimental results on these benchmarks demonstrate that the proposed algorithms perform favorably against the state-of-the-art methods.
Content may be subject to copyright.
A preview of the PDF is not available
... In addition, for tracking tasks, it is desirable to use a feature model that can both distinguish between different classes of targets and identify differences between targets of the same class. However, due to the unfavorable properties of thermal infrared images, similar thermal infrared targets usually have a high degree of visual and semantic similarity [6], [7], [8]. The existing feature models are not designed and optimized for this problem, so it is difficult to effectively distinguish the differences between these similar targets, which leads to the tracking algorithm easily drifting to similar interferers during the tracking process [9], [10], [11], [12]. ...
Article
Thermal infrared (TIR) target tracking is susceptible to occlusion and similarity interference, which obviously affects the tracking results. To resolve this problem, we develop an Aligned Spatial-Temporal Memory network-based Tracking method (ASTMT) for the TIR target tracking task. Specifically, we model the scene information in the TIR target tracking scenario using the spatial-temporal memory network, which can effectively store the scene information and decrease the interference of similarity interference that is beneficial to the target. In addition, we use an aligned matching module to correct the parameters of the spatial-temporal memory network model, which can effectively alleviate the impact of occlusion on the target estimation, hence boosting the tracking accuracy even further. Through ablation study experiments, we have demonstrated that the spatial-temporal memory network and the aligned matching module in the proposed ASTMT tracker are exceptionally successful. Our ASTMT tracking method performs well on the PTB-TIR and LSOTB-TIR benchmarks contrasted with other tracking methods.
... Liu et al. 86 introduced a dual-level feature framework including thermal infrared specific discriminative and correlation features. These two feature models are jointly optimized in the thermal infrared for tracking with the help of a multi-task matching framework. ...
... Bertinetto et al. [12] first developed a two-branch fully-convolutional Siamese network for tracking (SiamFC), which compares the similarity between target template and search region, and finds the tracking results by a score map. Liu et al. [22] proposed a multi-level similarity network in the framework of Siamese framework for thermal infrared visual tracking (MLSSNet). In order to solve the scale variation of target effectively, Li et al. [11] introduced region proposal network (RPN) into SiamFC tracking framework (SiamRPN) and trained classification branch and regression branch simultaneously. ...
Full-text available
Article
Recently, deep learning (DL) based trackers have attracted tremendous interest for their high performance. Despite the remarkable success, most trackers utilizing deep convolution features commonly neglect tracking speed, which is crucial for aerial tracking on mobile devices. In this paper, we propose an efficient and effective transformer based aerial tracker in the framework of Siamese, which inherits the merits from both transformer and Siamese architectures. Specifically, the outputs from multiple convolution layers are fed into transformer to construct robust features of template patch and search patch, respectively. Consequently, the interdependencies between low-level information and semantic information are interactively fused to improve the ability of encoding target appearance. Finally, traditional depth-wise cross correlation is introduced to generate a similarity map for object location and bounding box regression. Extensive experimental results on three popular benchmarks (DTB70, UAV123@10fps, and UAV20L) have demonstrated that our proposed tracker outperforms other 12 state-of-the-art trackers and achieves a real-time tracking speed of 71.3 frames per second (FPS) on GPU, which can be applied in mobile platform.
Full-text available
Article
Siamese-based trackers have achieved excellent performance and attracted extensive attention, which regard the tracking task as a similarity learning between the target template and search regions. However, most Siamese-based trackers do not effectively exploit correlations of the spatial and channel-wise information to represent targets. Meanwhile, the cross-correlation is a linear matching method and neglects the structured and part-level information. In this paper, we propose a novel tracking algorithm for feature extraction of target templates and search region images. Based on convolutional neural networks and shuffle attention, the tracking algorithm computes the similarity between the template and a search region through a graph attention matching. The proposed tracking algorithm exploits the correlations between the spatial and channel-wise information to highlight the target region. Moreover, the graph matching can greatly alleviate the influences of appearance variations such as partial occlusions. Extensive experiments demonstrate that the proposed tracking algorithm achieves excellent tracking results on multiple challenging benchmarks. Compared with other state-of-the-art methods, the proposed tracking algorithm achieves excellent tracking performance.
Article
Tracking-by-regression is a new paradigm for online Multi-Object Tracking (MOT). It unifies detection and tracking into a single network by associating targets through regression, significantly reducing the complexity of data association. However, owing to noisy features from nearby occlusions and distractors, the regression is vulnerable and unaware of the inter-object occlusions and intra-class distractors. Thus the regressed bounding boxes can be wrongly suppressed or easily drift. Meanwhile, the commonly used bounding box-based post-processing is unable to remedy false negatives and false assignments caused by regression. To address these challenges, we present to leverage regression tubes as input for the regression-based tracker, which provides spatial–temporal information to enhance the tracking performance. Specially, we propose a novel tube re-localization strategy that obtains robust regressions and recovers missed targets. A tube-based NMS (T-NMS) strategy to manage the regressions at the tube level is also proposed, including a tube IoU (T-IoU) scheme for measuring positional relation and tube re-scoring (T-RS) to evaluate the quality of candidate tubes. Finally, a tube re-assignment strategy is further employed for robust cost measurement and to revise false assignments using motion cues. We evaluate our method on benchmarks, including MOT16, MOT17, and MOT20. The results show that our method can significantly improve the baseline, mitigate the challenges of the regression-based tracker, and achieve very competitive tracking performance.
Article
Infrared object tracking is an important research field of computer vision. Existing tracking-by-detection algorithms usually use appearance information to distinguish between the target and the surrounding environment. However, the complex environment and the colorless or textureless infrared target make object tracking difficult, causing existing algorithms to lose the target. In this paper, to mine the correlation between video frames, we propose a simple and effective way to exploit the motion information by frame differences. Specifically, using adjacent frame differencing, we propose a dynamic infrared object tracking framework to assist existing semantic aware tracking algorithms. Experiments on the Anti-UAV infrared datasets, VOT-TIR2015, VOT-TIR2016, PTB-IR and LSOTB-IR datasets show its robustness to the different challenges of the infrared scenes with high efficiency.
Article
Most of the existing Siamese tracking methods follow the overall framework of SiamRPN, adopting its general network architecture and the local and linear cross-correlation operation to integrate search and template features, which restricts the introduction of more sophisticated structures for expressive appearance representation as well as the further improvements on tracking performance. Motivated by the recent progresses in vision Transformer and MLP, we first explore to accomplish a global, nonlinear and scale-invariant similarity measuring manner called Dynamic Cross-Attention (DCA). Specifically, template features are first decomposed along the spatial and channel dimension and then the Transformer Encoders are applied to adaptively excavate the long-range feature interdependency, producing reinforced kernels. As the kernels are successively multiplied to the search feature map, similarity scores between all the pixels on feature maps are estimated at once while the spatial scale of search features remains constant. Furthermore, we redesign each part of our Siamese network to further remedy the framework limitation with the assistant of DCA. Comprehensive experimental results on large-scale benchmarks indicate that our Siamese method realizes the efficient feature extraction, aggregation, refinement and interaction, outperforming state-of-the-art trackers.
Full-text available
Article
Siamese tracking paradigm has achieved great success, providing effective appearance discrimination and size estimation by classification and regression. While such a paradigm typically optimizes the classification and regression independently, leading to task misalignment (accurate prediction boxes have no high target confidence scores). In this paper, to alleviate this misalignment, we propose a novel tracking paradigm, called SiamLA. Within this paradigm, a series of simple, yet effective localization-aware components are introduced to generate localization-aware target confidence scores. Specifically, with the proposed localization-aware dynamic label (LADL) loss and localization-aware label smoothing (LALS) strategy, collaborative optimization between the classification and regression is achieved, enabling classification scores to be aware of location state, not just appearance similarity. Besides, we propose a separate localization-aware quality prediction (LAQP) branch to produce location quality scores to further modify the classification scores. To guide a more reliable modification, a novel localization-aware feature aggregation (LAFA) module is designed and embedded into this branch. Consequently, the resulting target confidence scores are more discriminative for the location state, allowing accurate prediction boxes tend to be predicted as high scores. Extensive experiments are conducted on six challenging benchmarks, including GOT-10k, TrackingNet, LaSOT, TNL2K, OTB100 and VOT2018. Our SiamLA achieves competitive performance in terms of both accuracy and efficiency. Furthermore, a stability analysis reveals that our tracking paradigm is relatively stable, implying that the paradigm is potential for real-world applications.
Full-text available
Article
The development of a real-time and robust RGB-T tracker is an extremely challenging task because the tracked object may suffer from shared and specific challenges in RGB and thermal (T) modalities. In this work, we observe that the implicit attribute information can boost the model discriminability, and propose a novel attribute-driven representation network to improve the RGB-T tracking performance. First, according to appearance change in RGB-T tracking scenarios, we divide the major and special challenges into four typical attributes: extreme illumination, occlusion, motion blur, and thermal crossover. Second, we design an attribute-driven residual branch for each heterogeneous attribute to mine the attribute-specific property and therefore build a powerful residual representation for object modeling. Furthermore, we aggregate these representations in channel and pixel levels by using the proposed attribute ensemble network (AENet) to adaptively fit the attribute-agnostic tracking process. The AENet can effectively make aware of appearance change while suppressing the distractors. Finally, we conduct numerous experiments on three RGB-T tracking benchmarks to compare the proposed trackers with other state-of-the-art methods. Experimental results show that our tracker achieves very competitive results with a real-time tracking speed. Code will be available at https://github.com/zhang-pengyu/ADRNet.
Full-text available
Article
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Article
In this study, we propose a novel RGB-T tracking framework by jointly modeling both appearance and motion cues. First, to obtain a robust appearance model, we develop a novel late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities. The fusion weights are determined by using offline-trained global and local multimodal fusion networks, and then adopted to linearly combine the response maps of RGB and T modalities. Second, when the appearance cue is unreliable, we comprehensively take motion cues, i.e., target and camera motions, into account to make the tracker robust. We further propose a tracker switcher to switch the appearance and motion trackers flexibly. Numerous results on three recent RGB-T tracking datasets show that the proposed tracker performs significantly better than other state-of-the-art algorithms.
Article
Recent deep trackers have shown superior performance in visual tracking. In this article, we propose a cascaded correlation refinement approach to facilitate the robustness of deep tracking. The core idea is to address accurate target localization and reliable model update in a collaborative way. To this end, our approach cascades multiple stages of correlation refinement to progressively refine target localization. Thus, the localized object could be used to learn an accurate on-the-fly model for improving the reliability of model update. Meanwhile, we introduce an explicit measure to identify the tracking failure and then leverage a simple yet effective look-back scheme to adaptively incorporate the initial model and on-the-fly model to update the tracking model. As a result, the tracking model can be used to localize the target more accurately. Extensive experiments on OTB2013, OTB2015, VOT2016, VOT2018, UAV123, and GOT-10k demonstrate that the proposed tracker achieves the best robustness against the state of the arts.