Article

# Structural target-aware model for thermal infrared tracking

Authors:
• Harbin Institute of Technology (Shenzhen)
To read the full-text of this research, you can request a copy directly from the authors.

## Abstract

Thermal InfraRed (TIR) target trackers are easy to be interfered by similar objects, while susceptible to the influence of the target occlusion. To solve these problems, we propose a structural target-aware model (STAMT) for the thermal infrared target tracking tasks. Specifically, the proposed STAMT tracker can learn a target-aware model, which can add more attention to the target area to accurately identify the target from similar objects. In addition, considering the situation that the target is partially occluded in the tracking process, a structural weight model is proposed to locate the target through the unoccluded reliable target part. Ablation studies show the effectiveness of each component in the proposed tracker. Without bells and whistles, the experimental results demonstrate that our STAMT tracker performs favorably against state-of-the-art trackers on PTB-TIR and LSOTB-TIR datasets.

## No full-text available

... C. State-of-the-art comparison PTB-TIR [33]: This is a thermal infrared pedestrian tracking benchmark containing 60 testing sequences. We first present the comparative experimental results of our ASTMT and MCFTS [15], HSSNet [16], SRDCF [35], MMNet [36], STAMT [37], TADT [38], MLSSNet [39], CREST [40], UDT [41], SiamTri [42] trackers on this PTB-TIR [33] benchmark in Fig. 4. From the experimental results, we can know our ASTMT tracker obtains the best score on the success plots. Although our ASTMT tracker score lower than the SRDCF [35], MMNet [36], STAMT [37] tracker on the precision plots, our ASTMT tracker outperforms these trackers in success This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. ...
... We first present the comparative experimental results of our ASTMT and MCFTS [15], HSSNet [16], SRDCF [35], MMNet [36], STAMT [37], TADT [38], MLSSNet [39], CREST [40], UDT [41], SiamTri [42] trackers on this PTB-TIR [33] benchmark in Fig. 4. From the experimental results, we can know our ASTMT tracker obtains the best score on the success plots. Although our ASTMT tracker score lower than the SRDCF [35], MMNet [36], STAMT [37] tracker on the precision plots, our ASTMT tracker outperforms these trackers in success This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and content may change prior to final publication. ...
... LSOTB-TIR [34]: As a general used TIR tracking test benchmark, LSOTB-TIR [34] containing 120 testing video sequences. Fig. 5 illustrates the experimental comparison of our ASTMT and MCFTS [15], HSSNet [16], SRDCF [35], STAMT [37], TADT [38], MLSSNet [39], CREST [40], UDT [41], SiamTri [42], ATOM [43], SiamRPN++ [44], SiamMask [45] trackers on this LSOTB-TIR benchmark. Fig. 5 shows our tracker obtains the highest score in the normalized precision, and success plots. ...
Article
Thermal infrared (TIR) target tracking is susceptible to occlusion and similarity interference, which obviously affects the tracking results. To resolve this problem, we develop an Aligned Spatial-Temporal Memory network-based Tracking method (ASTMT) for the TIR target tracking task. Specifically, we model the scene information in the TIR target tracking scenario using the spatial-temporal memory network, which can effectively store the scene information and decrease the interference of similarity interference that is beneficial to the target. In addition, we use an aligned matching module to correct the parameters of the spatial-temporal memory network model, which can effectively alleviate the impact of occlusion on the target estimation, hence boosting the tracking accuracy even further. Through ablation study experiments, we have demonstrated that the spatial-temporal memory network and the aligned matching module in the proposed ASTMT tracker are exceptionally successful. Our ASTMT tracking method performs well on the PTB-TIR and LSOTB-TIR benchmarks contrasted with other tracking methods.
... Further tailoring of the acquisition and sensor degradation models supports online calibration using a range of input variables (sensor temperature, ambient temperature, degradation indicators and others), i.e., enables robust shutter-less cameras. The proposed modeling approach can be combined with modern AI-based methods [23], e.g., object tracking task for infrared images [24] as well as to improve quality assurance and inspection applications such as infrared thermography [25], drones for the prevention of fire detection [26], material defect detection [27] and even wind turbine erosion detection [28] while incorporating physical models of different materials. ...
Article
Full-text available
A key challenge in further improving infrared (IR) sensor capabilities is the development of efficient data pre-processing algorithms. This paper addresses this challenge by providing a mathematical model and synthetic data generation framework for an uncooled IR sensor. The developed model is capable of generating synthetic data for the design of data pre-processing algorithms of uncooled IR sensors. The mathematical model accounts for the physical characteristics of the focal plane array, bolometer readout, optics and the environment. The framework permits the sensor simulation with a range of sensor configurations, pixel defectiveness, non-uniformity and noise parameters.
... Dong et al. [45] introduces a triplet loss to extract expressive deep features for visual tracking tasks by adding them into the Siamese network framework instead of pairwise loss for model training. In [48], a structured target-aware model has been proposed to improve the target tracking performance in the TIR scenarios. ...
Article
Full-text available
Article
Full-text available
Visual target tracking is one of the most sought-after yet challenging research topics in computer vision. Given the ill-posed nature of the problem and its popularity in a broad range of real-world scenarios, a number of large-scale benchmark datasets have been established, on which considerable methods have been developed and demonstrated with significant progress in recent years - predominantly by recent deep learning (DL)-based methods. This survey aims to systematically investigate the current DL-based visual tracking methods, benchmark datasets, and evaluation metrics. It also extensively evaluates and analyzes the leading visual tracking methods. First, the fundamental characteristics, primary motivations, and contributions of DL-based methods are summarized from nine key aspects of: network architecture, network exploitation, network training for visual tracking, network objective, network output, exploitation of correlation filter advantages, aerial-view tracking, long-term tracking, and online tracking. Second, popular visual tracking benchmarks and their respective properties are compared, and their evaluation metrics are summarized. Third, the state-of-the-art DL-based methods are comprehensively examined on a set of well-established benchmarks of OTB2013, OTB2015, VOT2018, LaSOT, UAV123, UAVDT, and VisDrone2019. Finally, by conducting critical analyses of these state-of-the-art trackers quantitatively and qualitatively, their pros and cons under various common scenarios are investigated. It may serve as a gentle use guide for practitioners to weigh when and under what conditions to choose which method(s). It also facilitates a discussion on ongoing issues and sheds light on promising research directions.
Article
Full-text available
The training of a feature extraction network typically requires abundant manually annotated training samples, making this a time-consuming and costly process. Accordingly, we propose an effective self-supervised learning-based tracker in a deep correlation framework (named: self-SDCT). Motivated by the forward-backward tracking consistency of a robust tracker, we propose a multi-cycle consistency loss as self-supervised information for learning feature extraction network from adjacent video frames. At the training stage, we generate pseudo-labels of consecutive video frames by forward-backward prediction under a Siamese correlation tracking framework and utilize the proposed multi-cycle consistency loss to learn a feature extraction network. Furthermore, we propose a similarity dropout strategy to enable some low-quality training sample pairs to be dropped and also adopt a cycle trajectory consistency loss in each sample pair to improve the training loss function. At the tracking stage, we employ the pre-trained feature extraction network to extract features and utilize a Siamese correlation tracking framework to locate the target using forward tracking alone. Extensive experimental results indicate that the proposed self-supervised deep correlation tracker (self-SDCT) achieves competitive tracking performance contrasted to state-of-the-art supervised and unsupervised tracking methods on standard evaluation benchmarks.
Article
Full-text available
Correlation filter-based trackers (CFTs) have recently shown remarkable performance in the field of visual object tracking. The advantage of these trackers originates from their ability to convert time-domain calculations into frequency domain calculations. However, a significant problem of these CFTs is that the model is insufficiently robust when the tracking scenarios are too complicated, meaning that the ideal tracking performance cannot be acquired. Recent work has attempted to resolve this problem by reducing the boundary effects from modeling the foreground and background of the object target effectively (e.g., CFLB, BACF, and CACF). Although these methods have demonstrated reasonable performance, they are often affected by occlusion, deformation, scale variation, and other challenging scenes. In this study, considering the relationship between the current frame and the previous frame of a moving object target in a time series, we propose a temporal regularization strategy to improve the BACF tracker (denoted as TRBACF), a typical representative of the aforementioned trackers. The TRBACF tracker can efficiently adjust the model to adapt the change of the tracking scenes, thereby enhancing its robustness and accuracy. Moreover, the objective function of our TRBACF tracker can be solved by an improved alternating direction method of multipliers, which can speed up the calculation in the Fourier domain. Extensive experimental results demonstrate that the proposed TRBACF tracker achieves competitive tracking performance compared with state-of-the-art trackers.
Conference Paper
Full-text available
In this paper, we present a Large-Scale and high-diversity general Thermal InfraRed (TIR) Object Tracking Benchmark, called LSOTB-TIR, which consists of an evaluation dataset and a training dataset with a total of 1,400 TIR sequences and more than 600K frames. We annotate the bounding box of objects in every frame of all sequences and generate over 730K bounding boxes in total. To the best of our knowledge, LSOTB-TIR is the largest and most diverse TIR object tracking benchmark to date. To evaluate a tracker on different attributes, we define 4 scenario attributes and 12 challenge attributes in the evaluation dataset. By releasing LSOTB-TIR, we encourage the community to develop deep learning based TIR trackers and evaluate them fairly and comprehensively. We evaluate and analyze more than 30 trackers on LSOTB-TIR to provide a series of baselines, and the results show that deep trackers achieve promising performance. Furthermore, we retrain several representative deep trackers on LSOTB-TIR, and their results demonstrate that the proposed training dataset significantly improves the performance of deep TIR trackers. Codes and dataset are available at https://github.com/QiaoLiuHit/LSOTB-TIR.
Article
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However, these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However , these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Article
Full-text available
Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Chapter
Full-text available
Object tracking is still a critical and challenging problem with many applications in computer vision. For this challenge, more and more researchers pay attention to applying deep learning to get powerful feature for better tracking accuracy. In this paper, a novel triplet loss is proposed to extract expressive deep feature for object tracking by adding it into Siamese network framework instead of pairwise loss for training. Without adding any inputs, our approach is able to utilize more elements for training to achieve more powerful feature via the combination of original samples. Furthermore, we propose a theoretical analysis by combining comparison of gradients and back-propagation, to prove the effectiveness of our method. In experiments, we apply the proposed triplet loss for three real-time trackers based on Siamese network. And the results on several popular tracking benchmarks show our variants operate at almost the same frame-rate with baseline trackers and achieve superior tracking performance than them, as well as the comparable accuracy with recent state-of-the-art real-time trackers.
Article
Full-text available
Robust and accurate visual tracking is a challenging problem in computer vision. In this paper, we exploit spatial and semantic convolutional features extracted from convolutional neural networks in continuous object tracking. The spatial features retain higher resolution for precise localization and semantic features capture more semantic information and less fine-grained spatial details. Therefore, we localize the target by fusing these different features, which improves the tracking accuracy. Besides, we construct the multi-scale pyramid correlation filter of the target and extract its spatial features. This filter determines the scale level effectively and tackles target scale estimation. Finally, we further present a novel model updating strategy, and exploit peak sidelobe ratio (PSR) and skewness to measure the comprehensive fluctuation of response map for efficient tracking performance. Each contribution above is validated on 50 image sequences of tracking benchmark OTB-2013. The experimental comparison shows that our algorithm performs favorably against 12 state-of-the-art trackers.
Conference Paper
Full-text available
Compared with visible object tracking, thermal infrared (TIR) object tracking can track an arbitrary target in total darkness since it cannot be influenced by illumination variations. However, there are many unwanted attributes that constrain the potentials of TIR tracking, such as the absence of visual color patterns and low resolutions. Recently, structured output support vector machine (SOSVM) and discriminative correlation filter (DCF) have been successfully applied to visible object tracking, respectively. Motivated by these, in this paper, we propose a large margin structured convolution operator (LMSCO) to achieve efficient TIR object tracking. To improve the tracking performance, we employ the spatial regularization and implicit interpolation to obtain continuous deep feature maps, including deep appearance features and deep motion features, of the TIR targets. Finally, a collaborative optimization strategy is exploited to significantly update the operators. Our approach not only inherits the advantage of the strong discriminative capability of SOSVM but also achieves accurate and robust tracking with higher-dimensional features and more dense samples. To the best of our knowledge, we are the first to incorporate the advantages of DCF and SOSVM for TIR object tracking. Comprehensive evaluations on two thermal infrared tracking benchmarks, i.e. VOT-TIR2015 and VOT-TIR2016, clearly demonstrate that our LMSCO tracker achieves impressive results and outperforms most state-of-the-art trackers in terms of accuracy and robustness with sufficient frame rate.
Article
Full-text available
Discriminative correlation filters (DCFs) have been shown to perform superiorly in visual tracking. They only need a small set of training samples from the initial frame to generate an appearance model. However, existing DCFs learn the filters separately from feature extraction, and update these filters using a moving average operation with an empirical weight. These DCF trackers hardly benefit from the end-to-end training. In this paper, we propose the CREST algorithm to reformulate DCFs as a one-layer convolutional neural network. Our method integrates feature extraction, response map generation as well as model update into the neural networks for an end-to-end training. To reduce model degradation during online update, we apply residual learning to take appearance changes into account. Extensive experiments on the benchmark datasets demonstrate that our CREST tracker performs favorably against state-of-the-art trackers.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
Full-text available
The Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame. Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task. This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter. Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates.
Article
RGBT tracking receives a surge of interest in the computer vision community, but this research field lacks a large-scale and high-diversity benchmark dataset, which is essential for both the training of deep RGBT trackers and the comprehensive evaluation of RGBT tracking methods. To this end, we present a La rge- ${s}$ cale ${H}$ igh-diversity $\text{b}{e}$ nchmark for short-term ${R}$ GBT tracking (LasHeR) in this work. LasHeR consists of 1224 visible and thermal infrared video pairs with more than 730K frame pairs in total. Each frame pair is spatially aligned and manually annotated with a bounding box, making the dataset well and densely annotated. LasHeR is highly diverse capturing from a broad range of object categories, camera viewpoints, scene complexities and environmental factors across seasons, weathers, day and night. We conduct a comprehensive performance evaluation of 12 RGBT tracking algorithms on the LasHeR dataset and present detailed analysis. In addition, we release the unaligned version of LasHeR to attract the research interest for alignment-free RGBT tracking, which is a more practical task in real-world applications. The datasets and evaluation protocols are available at: https://github.com/mmic-lcl/Datasets-and-benchmark-code .
Article
Classifying hard samples in the course of RGBT tracking is a quite challenging problem. Existing methods only focus on enlarging the boundary between positive and negative samples, but ignore the relations of multilevel hard samples, which are crucial for the robustness of hard sample classification. To handle this problem, we propose a novel Multi-Modal Multi-Margin Metric Learning framework named M <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">5</sup> L for RGBT tracking. In particular, we divided all samples into four parts including normal positive, normal negative, hard positive and hard negative ones, and aim to leverage their relations to improve the robustness of feature embeddings, e.g., normal positive samples are closer to the ground truth than hard positive ones. To this end, we design a multi-modal multi-margin structural loss to preserve the relations of multilevel hard samples in the training stage. In addition, we introduce an attention-based fusion module to achieve quality-aware integration of different source data. Extensive experiments on large-scale datasets testify that our framework clearly improves the tracking performance and performs favorably the state-of-the-art RGBT trackers.
Article
Tracking in the unmanned aerial vehicle (UAV) scenarios is one of the main components of target tracking tasks. Different from the target tracking task in the general scenarios, the target tracking task in the UAV scenarios is very challenging because of factors such as small scale and aerial view. Although the DCFs-based tracker has achieved good results in tracking tasks in general scenarios, the boundary effect caused by the dense sampling method will reduce the tracking accuracy, especially in UAV tracking scenarios. In this work, we propose learning an adaptive spatial-temporal context-aware (ASTCA) model in the DCFs-based tracking framework to improve the tracking accuracy and reduce the influence of boundary effect, thereby enabling our tracker to more appropriately handle UAV tracking tasks. Specifically, our ASTCA model can learn a spatial-temporal context weight, which can precisely distinguish the target and background in the UAV tracking scenarios. Besides, considering the small target scale and the aerial view in UAV tracking scenarios, our ASTCA model incorporates spatial context information within the DCFs-based tracker, which could effectively alleviate background interference. Extensive experiments demonstrate that our ASTCA method performs favorably against state-of-the-art tracking methods on some standard UAV datasets.
Article
Accurate segmentation is difficult for liver computed tomography (CT) images, since the liver CT images do not always have obvious and smooth boundaries. The location of the tumor is not specified and the image intensity is similar to that of the liver. Although manual and automatic segmentation methods, traditional and deep learning models currently exist, none can be specifically and effectively applied to segment liver CT images. In this paper, we propose a new model based on a level set framework for liver CT images in which the energy functional contains three terms including the data fitting term, the length term and the bound term. Then we apply the split Bregman method to minimize the energy functional that leads the energy functional converge faster. The proposed model is robust to initial contours and can segment liver CT images with intensity inhomogeneity and unclear boundaries. In the bound term, we use the U-Net to get constraint information which has a considerable influence on effective and accurate segmentation. We improve a multi-phase level set of our model to get contours of tumor and liver at the same time. Finally, a parallel algorithm is proposed to improve segmentation efficiency. Results and comparisons of experiments are shown to demonstrate the merits of the proposed model including robustness, accuracy, efficiency and intelligence.
Article
Medical image segmentation has a huge challenge due to intensity inhomogeneity and the similarity of the background and the object. To meet this challenge, we propose an improved active contour model, in which we combine the level set method and the split Bregman method, and provide the two-phase formulation, the multi-phase formulation and 3D formulation. In this paper, the proposed model is presented in a level set framework by including the neighbor region information for segmenting medical images in which the energy functional contains the data fitting term and the length term. The neighbor region and the local intensity variances in the data fitting term are designed to optimize the minimization process. To minimize the energy functional then we apply the split Bregman method which contributes to get faster convergence. Besides, we extend our model to the multi-phase segmentation model and the 3D segmentation model for cardiac MR images, which have all achieved good results. Experimental results show that the new model not only has strong robustness to other cardiac tissue effects and image intensity inhomogeneity, but it also can much better conduce to the extraction of effective tissues. As we expected, our model has higher segmentation accuracy and efficiency for medical image segmentation.
Article
Many RGBT trackers utilize adaptive weighting mechanism to treat dual modalities differently and obtain more robust feature representations for tracking. Although these trackers work well under certain conditions, however, they ignore the information interactions in feature learning, which might limit tracking performance. In this paper, we propose a novel cross-modality message passing model to interactively learn robust deep representations of dual modalities for RGBT tracking. Specifically, we extract features of dual modalities by backbone network and take each channel of these features as a node of a graph. Therefore, all channels of dual modalities can explicitly communicate with each other by the graph learning, and the outputted features are thus more diverse and discriminative. Moreover, we introduce the gate mechanism to control the propagation of information flow to achieve more intelligent fusion. The features generated from the interactive cross-modality message passing model will be passed selectively through the gate layer and concatenated with original features as the final representation. We extend the ATOM tracker into its dual-modality version and combine it with our proposed module for final tracking. Extensive experiments on two RGBT benchmark datasets validate the effectiveness and efficiency of our proposed algorithm.
Article
Article
Single-object tracking is regarded as a challenging task in computer vision, especially in complex spatio-temporal contexts. The changes in the environment and object deformation make it difficult to track. In the last 10 years, the application of correlation filters and deep learning enhances the performance of trackers to a large extent. This paper summarizes single-object tracking algorithms based on correlation filters and deep learning. Firstly, we explain the definition of single-object tracking and analyze the components of general object tracking algorithms. Secondly, the single-object tracking algorithms proposed in the past decade are summarized according to different categories. Finally, this paper summarizes the achievements and problems of existing algorithms by analyzing experimental results and discusses the development trends.
Article
Long-wave infrared(thermal) images distinguish the target and background according to different thermal radiation. They are insensitive to light conditions, and cannot present details obtained from reflected light. By contrast, the visible images have high spatial resolution and texture details, but they are easily affected by the occlusion and light conditions. Combining the advantages of the two sources may generate a new image with clear targets and high resolution, which satisfy requirements in all-weather and all-day/night conditions. Most of the existing methods cannot fully capture the underlying characteristics in the infrared and visible images, and ignore complementary information between the sources. In this paper, we propose an endto- end model (TSFNet) for infrared and visible image fusion, which is able to handle the sources simultaneously. In addition, it adopts an adaptive weight allocation strategy to capture the informative global features. Experiments on public datasets demonstrate the proposed fusion method achieves state-of-the-art performance, in both global visual quality and quantitative comparison.
Article
Infrared object tracking is a key technology for infrared imaging guidance. Blurred imaging, strong ego-motion and frequent occlusion make it difficult to maintain robust tracking. We observe that the features trained on ImageNet are not suitable for aircraft tracking with infrared imagery. In addition, for deep feature-based tracking, the main computational burden comes from the feedforward pass through the pretrained deep network. To this end, we present an airborne infrared target tracking algorithm that employs feature embedding learning and correlation filters to obtain improved performance. We develop a shallow network and a contrastive center loss function to learn the prototypical representation of the aircraft in the embedding space. The feature embedding module is lightweight and integrated into the efficient convolution operator framework for aircraft tracking. Finally, to demonstrate the effectiveness of our tracking algorithm, we conduct extensive experiments on airborne infrared imagery and benchmark trackers.
Article
Most of the excellent methods for visual object tracking are based on RGB videos. With the popularity of depth cameras, the research on RGB-D(RGB+depth) tracking has gradually gained extensive attention. The depth map provides more available information for dealing with intractable tracking problems. How to make full use of depth maps to construct a better tracker is the foremost problem to be settled. The fully-convolutional siamese network shows excellent performance in 2D tracking, but still cannot achieve satisfying tracking performance in complex scenarios. Therefore, we have proposed the RGB-D tracker integrated with the single-scale siamese network as well as the adaptive bounding boxes, which achieves stable tracking performance under the challenges such as occlusion, scale variation and background clutter. Our proposed adaptive strategy enables the bounding box to adjust automatically when the target appearance changes during the tracking, instead of multi-scale input in the siamese network. We design an effective algorithm to quickly obtain the target depth and construct the 3D local visual field to eliminate the interference from background and similar objects. In addition, the total occlusion handling approach combined with RGB and depth information has achieved more reliable occlusion detection and target recovery. Our presented object tracker, including the strategies of 3D local visual field, adaptive bounding boxes and occlusion handling, has been evaluated on two widely utilized RGB-D tracking benchmarks and achieves suprior performance especially for the situations of occlusion and pedestrian detection.
Article
Correlation filter (CF) trackers have performed impressive performance with high frame rates. However, the limited information in both spatial and temporal domains is only used in the learning of correlation filters, which might limit the tracking performance. To handle this problem, we propose a novel spatio-temporal correlation filter approach, which employs both spatial and temporal cues in the learning, for visual tracking. In particular, we explore the spatial contexts from background whose contents are ambiguous to the target and integrate them into the correlation filter model for more discriminative learning. Moreover, to capture the appearance variations in temporal domain, we also compute a set of target templates and incorporate them into our model. At the same time, the solution of the proposed spatio-temporal correlation filter is closed-form and the tracking efficiency is thus guaranteed. Experimental experiments on benchmark datasets demonstrate the effectiveness of the proposed tracker against several CF ones.
Chapter
RGB and thermal source data suffer from both shared and specific challenges, and how to explore and exploit them plays a critical role to represent the target appearance in RGBT tracking. In this paper, we propose a novel challenge-aware neural network to handle the modality-shared challenges (e.g., fast motion, scale variation and occlusion) and the modality-specific ones (e.g., illumination variation and thermal crossover) for RGBT tracking. In particular, we design several parameter-shared branches in each layer to model the target appearance under the modality-shared challenges, and several parameter-independent branches under the modality-specific ones. Based on the observation that the modality-specific cues of different modalities usually contains the complementary advantages, we propose a guidance module to transfer discriminative features from one modality to another one, which could enhance the discriminative ability of some weak modality. Moreover, all branches are aggregated together in an adaptive manner and parallel embedded in the backbone network to efficiently form more discriminative target representations. These challenge-aware branches are able to model the target appearance under certain challenges so that the target representations can be learnt by a few parameters even in the situation of insufficient training data. From the experimental results we will show that our method operates at a real-time speed while performing well against the state-of-the-art methods on three benchmark datasets.
Article
To overcome the shortcomings of low signal-to-noise ratio and less available information of infrared images, as well as the challenges of fast camera motion and partial occlusion, a robust tracker via correlation filter and particle filter is proposed for infrared target. Firstly, to explore the strength of the particle-filter-based tracker, a Lp-norm based low-rank sparse tracker is proposed. Then, a robust tracker is proposed by complementing the advantages of both correlation-filter-based and particle-filter-based trackers, which can not only handle the camera motion challenge, but also improve tracking accuracy and robustness. Finally, to address the tracking drift problem and deal with the partial occlusion challenge, an effective template update approach is designed according to different characteristics of correlation-filter-based and particle-filter-based trackers. Experimental results on the VOT-TIR2015 benchmark set demonstrate that the proposed tracker can not only outperform several state-of-the-art trackers in terms of both accuracy and robustness, but also effectively handle the challenges such as camera motion, partial occlusion, size change and motion change.
Article
With the development of deep learning, the performance of many computer vision tasks has been greatly improved. For visual tracking, deep learning methods mainly focus on extracting better features or designing end-to-end trackers. However, during tracking specific targets most of the existing trackers based on deep learning are less discriminative and time-consuming. In this paper, a cascade based tracking algorithm is proposed to promote the robustness of the tracker and reduce time consumption. First, we propose a novel deep network for feature extraction. Since some pruning strategies are applied, the speed of the feature extraction stage can be more than 50 frames per second. Then, a cascade tracker named DCCT is presented to improve the performance and enhance the robustness by utilizing both texture and semantic features. Similar to the cascade classifier, the proposed DCCT tracker consists of several weaker trackers. Each weak tracker rejects some false candidates of the tracked object, and the final tracking results are obtained by synthesizing these weak trackers. Intensive experiments are conducted in some public datasets and the results have demonstrated the effectiveness of the proposed framework.
Conference Paper
Article
Discriminative correlation filters (DCFs) have been widely used in the visual tracking community in recent years. The DCFs-based trackers determine the target location through a response map generated by the correlation filters and determine the target scale by a fixed scale factor. However, the response map is vulnerable to noise interference and the fixed scale factor also cannot reflect the real scale change of the target, which can obviously reduce the tracking performance. In this paper, to solve the aforementioned drawbacks, we propose to learn a metric learning model in correlation filters framework for visual tracking (called CFML). This model can use a metric learning function to solve the target scale problem. In particular, we adopt a hard negative mining strategy to alleviate the influence of the noise on the response map, which can effectively improve the tracking accuracy. Extensive experimental results demonstrate that the proposed CFML tracker achieves competitive performance compared with the state-of-the-art trackers.
Article
Convolutional Neural Networks (CNN) have been demonstrated to achieve state-of-the-art performance in visual object tracking task. However, existing CNN-based trackers usually use holistic target samples to train their networks. Once the target undergoes complicated situations (e.g., occlusion, background clutter, and deformation), the tracking performance degrades badly. In this paper, we propose an adaptive structural convolutional filter model to enhance the robustness of deep regression trackers (named: ASCT). Specifically, we first design a mask set to generate local filters to capture local structures of the target. Meanwhile, we adopt an adaptive weighting fusion strategy for these local filters to adapt to the changes in the target appearance, which can enhance the robustness of the tracker effectively. Besides, we develop an end-to-end trainable network comprising feature extraction, decision making, and model updating modules for effective training. Extensive experimental results on large benchmark datasets demonstrate the effectiveness of the proposed ASCT tracker performs favorably against the state-of-the-art trackers.
Article
Discriminative correlation filters (DCFs) have been widely used in the tracking community recently. DCFs-based trackers utilize samples generated by circularly shifting from an image patch to train a ridge regression model, and estimate target location using a response map generated by the correlation filters. However, the generated samples produce some negative effects and the response map is vulnerable to noise interference, which degrades tracking performance. In this paper, to solve the aforementioned drawbacks, we propose a target-focusing convolutional regression (CR) model for visual object tracking tasks (called TFCR). This model uses a target-focusing loss function to alleviate the influence of background noise on the response map of the current tracking image frame, which effectively improves the tracking accuracy. In particular, it can effectively balance the disequilibrium of positive and negative samples by reducing some effects of the negative samples that act on the object appearance model. Extensive experimental results illustrate that our TFCR tracker achieves competitive performance compared with state-of-the-art trackers.
Conference Paper
Conference Paper
Article
Thermal infrared (TIR) object tracking is one of the most challenging tasks in computer vision. This paper proposes a robust TIR tracker based on the continuous correlation filters and adaptive feature fusion (RCCF-TIR). Firstly, the Efficient Convolution Operators (ECO) framework is selected to build the new tracker. Secondly, an optimized feature set for TIR tracking is adopted in the framework. Finally, a new strategy of feature fusion based on average peak-to-correlation energy (APCE) is employed. Experiments on the VOT-TIR2016 (Visual Object Tracking-TIR2016) and PTB-TIR (A Thermal Infrared Pedestrian Tracking Benchmark) dataset are carried out and the results indicate that the proposed RCCF-TIR tracker combines good accuracy and robustness, performs better than the state-of-the-art trackers and has the ability to handle various challenges.
Article
This paper studies how to perform RGB-T object tracking in the correlation filter framework. Given the input RGB and thermal videos, we utilize the correlation filter for each modality due to its high performance in both of accuracy and speed. To take the interdependency between RGB and thermal modalities, we introduce the low-rank constraint to learn filters collaboratively, based on the observation that different modality features should have similar filters to make them have consistent localization of the target object. For optimization, we design an efficient ADMM (Alternating Direction Method of Multipliers) algorithm to solve the proposed model. Experimental results on the benchmark datasets (i.e., GTOT, RGBT210 and OSU-CT) suggest that the proposed approach performs favorably in both accuracy and efficiency against the state-of-the-art RGB-T methods.
Article
The usage of both off-the-shelf and end-to-end trained deep networks have significantly improved performance of visual tracking on RGB videos. However, the lack of large labeled datasets hampers the usage of convolutional neural networks for tracking in thermal infrared (TIR) images. Therefore, most state of the art methods on tracking for TIR data are still based on hand-crafted features. To address this problem, we propose to use image-to-image translation models. These models allow us to translate the abundantly available labeled RGB data to synthetic TIR data. We explore both the usage of paired and unpaired image translation models for this purpose. These methods provide us with a large labeled dataset of synthetic TIR sequences, on which we can train end-to-end optimal features for tracking. To the best of our knowledge we are the first to train end-to-end features for TIR tracking. We perform extensive experiments on VOT-TIR2017 dataset. We show that a network trained on a large dataset of synthetic TIR data obtains better performance than one trained on the available real TIR data. Combining both data sources leads to further improvement. In addition, when we combine the network with motion features we outperform the state of the art with a relative gain of over 10%, clearly showing the efficiency of using synthetic data to train end-to-end TIR trackers.
Article
Most thermal infrared (TIR) tracking methods are discriminative, which treat the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is well coupled to the objective of tracking task. We propose a TIR tracker via a hierarchical Siamese convolutional neural network (CNN), named HSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN coalescing the multiple hierarchical convolutional layers. Then, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the most similar one as the tracked target. Extensive experimental results on the benchmarks: VOT-TIR 2015 and VOT-TIR 2016, show that our proposed method achieves favorable performance against the state-of-the-art methods.
Article
Recently, deep learning has achieved great success in visual tracking. The goal of this paper is to review the state-of-the-art tracking methods based on deep learning. First, we introduce the background of deep visual tracking, including the fundamental concepts of visual tracking and related deep learning algorithms. Second, we categorize the existing deep-learning-based trackers into three classes according to network structure, network function and network training. For each categorize, we explain its analysis of the network perspective and analyze papers in different categories. Then, we conduct extensive experiments to compare the representative methods on the popular OTB-100, TC-128 and VOT2015 benchmarks. Based on our observations, we conclude that: (1) The usage of the convolutional neural network (CNN) model could significantly improve the tracking performance. (2) The trackers using the convolutional neural network (CNN) model to distinguish the tracked object from its surrounding background could get more accurate results, while using the CNN model for template matching is usually faster. (3) The trackers with deep features perform much better than those with low-level hand-crafted features. (4) Deep features from different convolutional layers have different characteristics and the effective combination of them usually results in a more robust tracker. (5) The deep visual trackers using end-to-end networks usually perform better than the trackers merely using feature extraction networks. (6) For visual tracking, the most suitable network training method is to per-train networks with video information and online fine-tune them with subsequent observations. Finally, we summarize our manuscript and highlight our insights, and point out the further trends for deep visual tracking.
Conference Paper
How to effectively learn temporal variation of target appearance, to exclude the interference of cluttered background , while maintaining real-time response, is an essential problem of visual object tracking. Recently, Siamese networks have shown great potentials of matching based trackers in achieving balanced accuracy and beyond real-time speed. However, they still have a big gap to classification & updating based trackers in tolerating the temporal changes of objects and imaging conditions. In this paper, we propose dynamic Siamese network, via a fast transformation learning model that enables effective online learning of target appearance variation and background suppression from previous frames. We then present elementwise multi-layer fusion to adaptively integrate the network outputs using multi-level deep features. Unlike state-of-the-art trackers, our approach allows the usage of any feasible generally-or particularly-trained features, such as SiamFC and VGG. More importantly, the proposed dynamic Siamese network can be jointly trained as a whole directly on the labeled video sequences, thus can take full advantage of the rich spatial temporal information of moving objects. As a result, our approach achieves state-of-the-art performance on OTB-2013 and VOT-2015 benchmarks, while exhibits superiorly balanced accuracy and real-time response over state-of-the-art competitors.
Article
The process of designing an efficient tracker for thermal infrared imagery is one of the most challenging tasks in computer vision. Although a lot of advancement has been achieved in RGB videos over the decades, textureless and colorless properties of objects in thermal imagery pose hard constraints in the design of an efficient tracker. Tracking of an object using a single feature or a technique often fails to achieve greater accuracy. Here, we propose an effective method to track an object in infrared imagery based on a combination of discriminative and generative approaches. The discriminative technique makes use of two complementary methods such as kernelized correlation filter with spatial feature and AdaBoost classifier with pixel intesity features to operate in parallel. After obtaining optimized locations through discriminative approaches, the generative technique is applied to determine the best target location using a linear search method. Unlike the baseline algorithms, the proposed method estimates the scale of the target by Lucas-Kanade homography estimation. To evaluate the proposed method, extensive experiments are conducted on 17 challenging infrared image sequences obtained from LTIR dataset and a significant improvement of mean distance precision and mean overlap precision is accomplished as compared with the existing trackers. Further, a quantitative and qualitative assessment of the proposed approach with the state-of-the-art trackers is illustrated to clearly demonstrate an overall increase in performance.
Article
Correlation Filters (CFs) have recently demonstrated excellent performance in terms of rapidly tracking objects under challenging photometric and geometric variations. The strength of the approach comes from its ability to efficiently learn - "on the fly" - how the object is changing over time. A fundamental drawback to CFs, however, is that the background of the object is not be modelled over time which can result in suboptimal results. In this paper we propose a Background-Aware CF that can model how both the foreground and background of the object varies over time. Our approach, like conventional CFs, is extremely computationally efficient - and extensive experiments over multiple tracking benchmarks demonstrate the superior accuracy and real-time performance of our method compared to the state-of-the-art trackers including those based on a deep learning paradigm.