Article

Structural target-aware model for thermal infrared tracking

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Thermal InfraRed (TIR) target trackers are easy to be interfered by similar objects, while susceptible to the influence of the target occlusion. To solve these problems, we propose a structural target-aware model (STAMT) for the thermal infrared target tracking tasks. Specifically, the proposed STAMT tracker can learn a target-aware model, which can add more attention to the target area to accurately identify the target from similar objects. In addition, considering the situation that the target is partially occluded in the tracking process, a structural weight model is proposed to locate the target through the unoccluded reliable target part. Ablation studies show the effectiveness of each component in the proposed tracker. Without bells and whistles, the experimental results demonstrate that our STAMT tracker performs favorably against state-of-the-art trackers on PTB-TIR and LSOTB-TIR datasets.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In [20], Algabri and Choi propose a robust recognizer combined with deep learning techniques to adapt to different lighting in scene ambient lighting and employ an enhanced character recognition model online update strategy to solve the problem of target character appearance change drifting during tracking. Inspired by the success of deep learning-based RGB target tracking methods, some TIR target tracking methods are trying to improve their performance by leveraging an off-the-shelf network to extract target features [21][22][23]. The STAMT [21] tracking method leverages a pre-trained convolutional feature extraction network to extract these deep features for a better target representation. ...
... Inspired by the success of deep learning-based RGB target tracking methods, some TIR target tracking methods are trying to improve their performance by leveraging an off-the-shelf network to extract target features [21][22][23]. The STAMT [21] tracking method leverages a pre-trained convolutional feature extraction network to extract these deep features for a better target representation. The HSSNet [22] tracker refers to the Siamese architecture [24][25][26][27] which views the tracking task as similarity template matching and uses the Siamese-based convolutional neural network with multi-convolutional layers to obtain the semantic and spatial features of the tracking TIR target. ...
... The HSSNet [22] tracker refers to the Siamese architecture [24][25][26][27] which views the tracking task as similarity template matching and uses the Siamese-based convolutional neural network with multi-convolutional layers to obtain the semantic and spatial features of the tracking TIR target. These TIR target trackers utilize the off-the-shelf network for feature extraction which achieved some performance improvements; however, the tracking performance of these trackers is still significantly limited because this network is not trained for the TIR target-tracking task [21,28,29]. To make the feature extraction network adapt to the TIR tracking task, it is necessary to use TIR training samples for their convolutional feature extraction network training [30][31][32]. ...
Article
Full-text available
The limited availability of thermal infrared (TIR) training samples leads to suboptimal target representation by convolutional feature extraction networks, which adversely impacts the accuracy of TIR target tracking methods. To address this issue, we propose an unsupervised cross-domain model (UCDT) for TIR tracking. Our approach leverages labeled training samples from the RGB domain (source domain) to train a general feature extraction network. We then employ a cross-domain model to adapt this network for effective target feature extraction in the TIR domain (target domain). This cross-domain strategy addresses the challenge of limited TIR training samples effectively. Additionally, we utilize an unsupervised learning technique to generate pseudo-labels for unlabeled training samples in the source domain, which helps overcome the limitations imposed by the scarcity of annotated training data. Extensive experiments demonstrate that our UCDT tracking method outperforms existing tracking approaches on the PTB-TIR and LSOTB-TIR benchmarks.
... B. Quantitative Analysis 1) Experiments on PTB-TIR Benchmark [45]: Fig. 6 shows the experimental results of our Siamese network with a hierarchical attention mechanism (SiamHAN) tracker with some other state-of-the-art tracking methods, which including MDNet [57], SiamRPN++ [16], AMFT [53], SRDCF [51], DSiam [55], TADT [47], MMNet [35], VITAL [58], MCCT [49], GFSDCF [54], UDT [56], CREST [52], MLSSNet [48], MCFTS [32], SiamFC [14], HSSNet [18], SiamTri [34], CFNet [36], and STAMT [46] on the PTB-TIR [45] benchmark. Compared to these deep DCF-based tracking methods [46], [52], [54], our SiamHAN tracking method uses a hierarchical attention Siamese network for feature extraction which gets improvements in tracking accuracy. ...
... B. Quantitative Analysis 1) Experiments on PTB-TIR Benchmark [45]: Fig. 6 shows the experimental results of our Siamese network with a hierarchical attention mechanism (SiamHAN) tracker with some other state-of-the-art tracking methods, which including MDNet [57], SiamRPN++ [16], AMFT [53], SRDCF [51], DSiam [55], TADT [47], MMNet [35], VITAL [58], MCCT [49], GFSDCF [54], UDT [56], CREST [52], MLSSNet [48], MCFTS [32], SiamFC [14], HSSNet [18], SiamTri [34], CFNet [36], and STAMT [46] on the PTB-TIR [45] benchmark. Compared to these deep DCF-based tracking methods [46], [52], [54], our SiamHAN tracking method uses a hierarchical attention Siamese network for feature extraction which gets improvements in tracking accuracy. Compared to AMFT [53], our SiamHAN improves 2.5% in precision and 4.6% in success. ...
... 2) Experiments on LSOTB-TIR Benchmark [44]: To further evaluate our SiamHAN tracker, we made some comparison of it with some other trackers (including GFSDCF [54], AMFT [53], ATOM [3], and DSiam [55]), STAMT [46], SiamRPN++ [16], TADT [47], SiamMask [30], SiamFC [14], SiamTri [34], BACF [50], SRDCF [51], MCCT [49], UDT [56], CREST [52], MLSSNet [48], CFNet [36], HSS-Net [18], and MCFTS [32] on LSOTB-TIR [44] benchmark. Fig. 8 shows the tracking methods' comparison results. ...
Article
Full-text available
Thermal infrared (TIR) target tracking is an important topic in the computer vision area. The TIR images are not affected by ambient light and have strong environmental adaptability, making them widely used in battlefield perception, video surveillance, assisted driving, etc. However, TIR target tracking faces problems such as relatively insufficient information and lack of target texture information, which significantly affects the tracking accuracy of the TIR tracking methods. To solve the above problems, we propose a TIR target tracking method based on a Siamese network with a hierarchical attention mechanism (called: SiamHAN). Specifically, the CIoU Loss is introduced to make full use of the regression box information to calculate the loss function more accurately. The GCNet attention mechanism is introduced to reconstruct the feature extraction structure of fine-grained information for the fine-grained information of thermal infrared images. Meanwhile, for the feature information of the hierarchical backbone network of the Siamese network, the ECANet attention mechanism is used for hierarchical feature fusion, so that it can fully utilize the feature information of the multi-layer backbone network to represent the target. On the LSOTB-TIR, the hierarchical attention Siamese network achieved a 2.9% increase in success rate and a 4.3% increase in precision relative to the baseline tracker. Experiments show that the proposed SiamHAN method has achieved competitive tracking results on the thermal infrared testing datasets.
... This task can be described as given the initial state of the target in the first frame of the tracking sequence, and the tracking method needs to predict the target state in other frames [16], [17], [18], [19], [20]. There are many machine learning methods are widely used in TIR tracking tasks, such as mean-shift [21], [22], [23], sparse representation [24], [25], [26], particle filters [27], [28], multimodal [29], [30], [31], [32], [33], [34], correlation filters [35], [36], [37], [38], Siamese networks [39], [40], [41], [42], [43], convolutional neural networks [44], [45], [46], [47], and so on. The traditional methods (such as mean-shift, sparse representation, and particle filters)-based TIR trackers often have certain limitations in the tracking performance due to their simple tracking models and use of manually designed features. ...
... The former can capture the local information of the TIR target, and the latter can capture the global information of the TIR target, which enhances the discriminant ability of the network to deal with interference. Yuan et al. [47] found that the performance of TIR trackers declined in the case of interference and target occlusion. They propose a structureaware model (STAMT), which pays more attention to the target region, so as to accurately identify the target among similar objects. ...
... 3) Results on VOT-TIR17 Dataset: [72], Staple-TIR [165], MCFTS [121], HDT [156], and so on); 2) tracking methods based on twin networks (such as MMNet [138], GFSNet [139], HSSNet [135], SiamSAV [140], CFNet [136], MLSSNet [137], and so on); 3) transformer-based tracking methods (such as TransT [75] and DFG [147]); and 4) Based on other deep learning methods (STAMT [47], ATOM [148], DiMP [58], TANT [162], VITAL [163], Ocean [164], and so on). The top three EAO rankings are ECO-TIR [152], UDCT [141], and DiMP [58], among which ECO-TIR [152] belongs to correlation filters methods, UDCT [141] and DiMP [58] belong to deep learning methods, which indicates that convolutional neural networks have strong feature representation capabilities, and the designed trackers have excellent average performance in video sequences. ...
Article
Full-text available
Thermal infrared (TIR) target tracking task is not affected by illumination changes and can be tracked at night, on rainy days, foggy days, and other extreme weather, so it is widely used in auxiliary driving, unmanned aerial vehicle reconnaissance, video surveillance, and other scenes. However, the TIR target tracking task also presents some challenges, such as intensity change, occlusion, deformation, similarity interference, etc. These challenges significantly affect the performance of the TIR target tracking methods. To resolve these challenges in the TIR target tracking scenarios, numerous tracking methods have appeared in recent years. The purpose of this paper is to give a comprehensive review and summary of the research status of the TIR target tracking methods. We first classify the thermal infrared target tracking methods according to their frameworks and briefly summarize the advantages and disadvantages of different tracking methods, which can better understand the current research progress of the TIR target tracking methods. Next, the public datasets/benchmarks for performance testing of the TIR target tracking methods are introduced. Subsequently, we demonstrate the tracking results of several representative tracking methods on some datasets/benchmarks to more intuitively show the progress of the TIR target tracking methods made in the current research. Finally, we discussed the future research direction in an attempt to promote the better development of the TIR target-tracking task.
... This task can be described as given the initial state of the target in the first frame of the tracking sequence, and the tracking method needs to predict the target state in other frames [10]- [14]. There are many machine learning methods are widely used in TIR tracking tasks, such as mean-shift [15]- [17], sparse representation [18]- [20], particle filters [21], [22], multi-modal [23]- [28], correlation filters [29]- [32], Siamese networks [33]- [37], convolutional neural networks [38]- [41], etc. The most popular thermal infrared target tracking frameworks in the last decade are the correlation filters-based framework and the Siamese network-based framework. ...
... The TCNN [39] tracker proposes a multi-model full convolutional neural network to select the optimal target tracking result depend on the comprehensive score of appearance similarity, predicted position and model reliability. The STAMT [41] tracker can obtain accurate TIR tracking results by positioning the target by using a structural weight model. To effectively distinguish the interference factors in thermal infrared target tracking, Liu et. ...
... MMNet [72], GFSNet [73], HSSNet [86] and SiamSAV [74]), and iii) other deep learning-based tracking methods (e.g. STAMT [41] and CMD-DiMP [82]). VOT-TIR15 [52] and VOT-TIR17 [54] mainly adopt three evaluation indexes: robust, accuracy, and EAO. ...
Conference Paper
Full-text available
Thermal infrared (TIR) target tracking task is not affected by illumination changes and can be tracked at night, on rainy days, foggy days, and other extreme weather, so it is widely used in night auxiliary driving, unmanned aerial vehicle reconnaissance, video surveillance, and other scenes. Thermal infrared target tracking task still faces many challenges, such as occlusion, deformation, similarity interference, etc. To solve the challenge in the TIR target tracking scenarios, a large number of TIR target tracking methods have appeared in recent years. The purpose of this paper is to give a comprehensive review and summary of the research status of thermal infrared target tracking methods. We first introduce some basic principles and representative work of the thermal infrared target tracking methods. And then, some benchmarks for performance testing of thermal infrared target tracking methods are introduced. Subsequently, we demonstrate the tracking results of several representative tracking methods on some benchmarks. Finally, the future research direction of thermal infrared target tracking is discussed.
... C. State-of-the-art comparison PTB-TIR [33]: This is a thermal infrared pedestrian tracking benchmark containing 60 testing sequences. We first present the comparative experimental results of our ASTMT and MCFTS [15], HSSNet [16], SRDCF [35], MMNet [36], STAMT [37], TADT [38], MLSSNet [39], CREST [40], UDT [41], SiamTri [42] trackers on this PTB-TIR [33] benchmark in Fig. 4. From the experimental results, we can know our ASTMT tracker obtains the best score on the success plots. Although our ASTMT tracker score lower than the SRDCF [35], MMNet [36], STAMT [37] tracker on the precision plots, our ASTMT tracker outperforms these trackers in success This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. ...
... We first present the comparative experimental results of our ASTMT and MCFTS [15], HSSNet [16], SRDCF [35], MMNet [36], STAMT [37], TADT [38], MLSSNet [39], CREST [40], UDT [41], SiamTri [42] trackers on this PTB-TIR [33] benchmark in Fig. 4. From the experimental results, we can know our ASTMT tracker obtains the best score on the success plots. Although our ASTMT tracker score lower than the SRDCF [35], MMNet [36], STAMT [37] tracker on the precision plots, our ASTMT tracker outperforms these trackers in success This article has been accepted for publication in IEEE Transactions on Circuits and Systems--II: Express Briefs. This is the author's version which has not been fully edited and content may change prior to final publication. ...
... LSOTB-TIR [34]: As a general used TIR tracking test benchmark, LSOTB-TIR [34] containing 120 testing video sequences. Fig. 5 illustrates the experimental comparison of our ASTMT and MCFTS [15], HSSNet [16], SRDCF [35], STAMT [37], TADT [38], MLSSNet [39], CREST [40], UDT [41], SiamTri [42], ATOM [43], SiamRPN++ [44], SiamMask [45] trackers on this LSOTB-TIR benchmark. Fig. 5 shows our tracker obtains the highest score in the normalized precision, and success plots. ...
Article
Thermal infrared (TIR) target tracking is susceptible to occlusion and similarity interference, which obviously affects the tracking results. To resolve this problem, we develop an Aligned Spatial-Temporal Memory network-based Tracking method (ASTMT) for the TIR target tracking task. Specifically, we model the scene information in the TIR target tracking scenario using the spatial-temporal memory network, which can effectively store the scene information and decrease the interference of similarity interference that is beneficial to the target. In addition, we use an aligned matching module to correct the parameters of the spatial-temporal memory network model, which can effectively alleviate the impact of occlusion on the target estimation, hence boosting the tracking accuracy even further. Through ablation study experiments, we have demonstrated that the spatial-temporal memory network and the aligned matching module in the proposed ASTMT tracker are exceptionally successful. Our ASTMT tracking method performs well on the PTB-TIR and LSOTB-TIR benchmarks contrasted with other tracking methods.
... Deep trackers can capture the general characteristics of an object in different video frames and are robust to external interferences such as illumination, occlusion, and motion blur. One type of the most popular deep trackers is Siamese-network-based trackers [18][19][20][21][22]. Recently, Liu et al. [19] proposed a video prediction network to update the TIR pedestrian tracking template of SiamRPN [18]. ...
... Zheng et al. [20] proposed a real-time TIR pedestrian tracker which combines a CNN-based prediction model with SiamRPN [18] to improve the tracking performance. Though these Siamese-network-based trackers [18][19][20][21][22] achieve better tracking performance than traditional ones, they are trained end-to-end and thus consume a large of computing resources. To solve such problems, Correlation Filters (CFs) are introduced to deep trackers. ...
Article
Thermal infrared pedestrian tracking is a challenging task due to factors such as energy attenuation, sensor noise, occlusion, and complex backgrounds. In this paper, we design a two-level cascade model that tracks pedestrians in a thermal infrared video by the coarse-to-fine strategy to improve the tracking accuracy and success rate. The base tracker in the first level of our model is initialized and fine-tuned to get the first representation of a target which is then used to locate the target roughly. Aiming at finely locating a target, the second level consists of modality-specific part correlation filters that can capture patterns of thermal infrared pedestrians. The outputs of part correlation filters are aggregated together by normalized joint confidence that can effectively suppress low confidence predictions to make a final decision. We adaptively update each part filter by a weighted learning rate and accurately estimate pedestrian scale by a scale filter to improve tracking performance. The experimental results on the PTB-TIR benchmark show that the proposed cascade tracker further emphasizes crucial thermal infrared features. Thus it can effectively relieve the problem of object occlusion. Our experimental results show the superiority of the proposed tracker over the state-of-the-art trackers, including SRDCF, GFS-DCF, MCFTS, HDT, HCF, MLSSNet, HSSNet, SiamFC_tir, SVM, and L1APG.
... ASTMT [43] constructs a novel spatial-temporal memory architecture to store the scene information and decrease the interference of similarity interference in blurred scenes. STAMT [44] introduces a target-aware network and improves the discriminative ability of trackers in motion blur. Nevertheless, both RGB trackers and TIR trackers have limitations when it comes to handling motion blur in isolation [25,26]. ...
Article
Full-text available
RGBT tracking combines visible and thermal infrared images to achieve tracking and faces challenges due to motion blur caused by camera and target movement. In this study, we observe that the tracking in motion blur is significantly affected by both frequency and spatial aspects. And blurred targets exhibit sharp texture details that are represented as high-frequency information. But existing trackers capture low-frequency components while ignoring high-frequency information. To enhance the representation of sharp information in blurred scenes, we introduce multi-frequency and multi-spatial information in network, called FSBNet. First, we construct a modality-specific unsymmetrical architecture and integrate an adaptive soft threshold mechanism into a DCT-based multi-frequency channel attention adapter (DFDA). DFDA adaptively integrates rich multi-frequency information. Second, we propose a masked frequency-based translation adapter (MFTA) to refine drifting failure boxes caused by camera motion. Moreover, we find that small targets get more affected by motion blur compared to larger targets, and we mitigate this issue by designing a cross-scale mutual conversion adapter (CFCA) between the frequency and spatial domains. Extensive experiments on GTOT, RGBT234 and LasHeR benchmarks demonstrate the promising performance of our method in the presence of motion blur.
... It has the advantages of strong anti-interference ability and the capability to distinguish between targets and backgrounds. Therefore, super-resolution reconstruction (SR) has been widely applied in fields such as remote sensing imaging [2][3][4], target tracking [5][6][7], and autonomous driving [8,9], etc. However, compared with visible light imaging, infrared imaging equipment usually has limited spatial resolution, resulting in lower imaging quality. ...
Article
Full-text available
When traditional super-resolution reconstruction methods are applied to infrared thermal images, they often ignore the problem of poor image quality caused by the imaging mechanism, which makes it difficult to obtain high-quality reconstruction results even with the training of simulated degraded inverse processes. To address these issues, we proposed a thermal infrared image super-resolution reconstruction method based on multimodal sensor fusion, aiming to enhance the resolution of thermal infrared images and rely on multimodal sensor information to reconstruct high-frequency details in the images, thereby overcoming the limitations of imaging mechanisms. First, we designed a novel super-resolution reconstruction network, which consisted of primary feature encoding, super-resolution reconstruction, and high-frequency detail fusion subnetwork, to enhance the resolution of thermal infrared images and rely on multimodal sensor information to reconstruct high-frequency details in the images, thereby overcoming limitations of imaging mechanisms. We designed hierarchical dilated distillation modules and a cross-attention transformation module to extract and transmit image features, enhancing the network’s ability to express complex patterns. Then, we proposed a hybrid loss function to guide the network in extracting salient features from thermal infrared images and reference images while maintaining accurate thermal information. Finally, we proposed a learning strategy to ensure the high-quality super-resolution reconstruction performance of the network, even in the absence of reference images. Extensive experimental results show that the proposed method exhibits superior reconstruction image quality compared to other contrastive methods, demonstrating its effectiveness.
... Further tailoring of the acquisition and sensor degradation models supports online calibration using a range of input variables (sensor temperature, ambient temperature, degradation indicators and others), i.e., enables robust shutter-less cameras. The proposed modeling approach can be combined with modern AI-based methods [23], e.g., object tracking task for infrared images [24] as well as to improve quality assurance and inspection applications such as infrared thermography [25], drones for the prevention of fire detection [26], material defect detection [27] and even wind turbine erosion detection [28] while incorporating physical models of different materials. ...
Article
Full-text available
A key challenge in further improving infrared (IR) sensor capabilities is the development of efficient data pre-processing algorithms. This paper addresses this challenge by providing a mathematical model and synthetic data generation framework for an uncooled IR sensor. The developed model is capable of generating synthetic data for the design of data pre-processing algorithms of uncooled IR sensors. The mathematical model accounts for the physical characteristics of the focal plane array, bolometer readout, optics and the environment. The framework permits the sensor simulation with a range of sensor configurations, pixel defectiveness, non-uniformity and noise parameters.
... Dong et al. [45] introduces a triplet loss to extract expressive deep features for visual tracking tasks by adding them into the Siamese network framework instead of pairwise loss for model training. In [48], a structured target-aware model has been proposed to improve the target tracking performance in the TIR scenarios. ...
Article
Full-text available
When dealing with complex thermal infrared (TIR) tracking scenarios, the single category feature is not sufficient to portray the appearance of the target, which drastically affects the accuracy of the TIR target tracking method. In order to address these problems, we propose an adaptively multi-feature fusion model (AMFT) for the TIR tracking task. Specifically, our AMFT tracking method adaptively integrates hand-crafted features and deep convolutional neural network (CNN) features. In order to accurately locate the target position, it takes advantage of the complementarity between different features. Additionally, the model is updated using a simple but effective model update strategy to adapt to changes in the target during tracking. In addition, a simple but effective model update strategy is adopted to adapt the model to the changes of the target during the tracking process. We have shown through ablation studies that the adaptively multi-feature fusion model in our AMFT tracking method is very effective. Our AMFT tracker performs favorably on PTB-TIR and LSOTB-TIR benchmarks compared with state-of-the-art trackers.
Article
The Infrared Object Tracking (IOT) task aims to locate objects in infrared sequences. Since color and texture information is unavailable in infrared modality, most existing infrared trackers merely rely on capturing spatial contexts from the image to enhance feature representation, where other complementary information is rarely deployed. To fill this gap, we in this paper propose a novel Asymmetric Deformable Spatio-Temporal Framework (ADSF) to fully exploit collaborative shape and temporal clues in terms of the objects. Firstly, an asymmetric deformable cross-attention module is designed to extract shape information, which attends to the deformable correlations between distinct frames in an asymmetric manner. Secondly, a spatio-temporal tracking framework is coined to learn the temporal variance trend of the object during the training process and store the template information closest to the tracking frame when testing. Comprehensive experiments demonstrate that ADSF outperforms state-of-the-art methods on three public datasets. Extensive ablation experiments further confirm the effectiveness of each component in ADSF. Furthermore, we conduct generalization validation to demonstrate that the proposed method also achieves performance gains in RGB-based tracking scenarios.
Article
Full-text available
UAV tracking plays a crucial role in computer vision by enabling real-time monitoring UAVs, enhancing safety and operational capabilities while expanding the potential applications of drone technology. Off-the-shelf deep learning based trackers have not been able to effectively address challenges such as occlusion, complex motion, and background clutter for UAV objects in infrared modality. To overcome these limitations, we propose a novel tracker for UAV object tracking, named MAMC. To be specific, the proposed method first employs a data augmentation strategy to enhance the training dataset. We then introduce a candidate target association matching method to deal with the problem of interference caused by the presence of a large number of similar targets in the infrared pattern. Next, it leverages a motion estimation algorithm with window jitter compensation to address the tracking instability due to background clutter and occlusion. In addition, a simple yet effective object research and update strategy is used to address the complex motion and localization problem of UAV objects. Experimental results demonstrate that the proposed tracker achieves state-of-the-art performance on the Anti-UAV and LSOTB-TIR dataset.
Article
Similar target distractor and background occlusion in the complex ground environment can result in infrared target tracking drift or even failure. To solve this problem, this study proposes an infrared ground target tracker based on a two-stage spatio-temporal feature correlation network. Firstly, a spatio-temporal context fusion feature correlation network is proposed, which fuses appearance features and spatio-temporal context information, and improves the stable tracking ability of the tracker under similar target distractor conditions. Secondly, a uni-directional trajectory feature correlation network is proposed, which ensures the accurate prediction of ground target trajectories by effectively using the temporal context information and optimizing training and application methods. Finally, a two-stage anti-occlusion strategy of “occlusion-prediction-recapture” is proposed, which improves the anti-long-term occlusion performance of the tracker. Qualitative and quantitative experiments on image sequences under similar target distractor and background occlusion conditions verify the effectiveness of the proposed tracker.
Article
Full-text available
Tracking small objects in infrared videos is challenging due to a complex background, weak information and target mobility. To deal with these difficulties, an infrared small target tracking algorithm is proposed, which utilizes the Spatial-Temporal Regularized Correlation Filter (STRCF) as the backbone. First, the local image patch that refers to the target and its neighborhood background is given in the STRCF. Then, the guided local contrast mechanism is designed to eliminate noise and distinguish the target from the background. Furthermore, the robustness of the tracking is improved by using the detection model with an adaptive factor of search range, helping to alleviate the problem of tracking migration. Experimental results on entire public near-infrared videos show the superior performance of the proposed algorithm (named STRCFD), compared to several related algorithms in visual effects and objective evaluation. It should be mentioned that the proposed STRCFD achieves an overall precision of 81.1%.
Article
RGBT tracking is gaining popularity due to its ability to provide effective tracking results in a variety of weather conditions. However, feature specificity and complementarity have not been fully used in existing models that directly fuse the correlation filtering response, which leads to poor tracker performance. In this paper, we propose correlation filters with adaptive modality weight and cross-modality learning (AWCM) ability to solve multimodality tracking tasks. First, we use weighted activation to fuse thermal infrared and visible modalities, and the fusion modality is used as an auxiliary modality to suppress noise and increase the learning ability of shared modal features. Second, we design modal weights through average peak-to-correlation energy (APCE) coefficients to improve model reliability. Third, we propose consistency in using the fusion modality as an intermediate variable for joint learning consistency, thereby increasing tracker robustness via interactive cross-modal learning. Finally, we use the alternating direction method of multipliers (ADMM) algorithm to produce a closed solution and conduct extensive experiments on the RGBT234, VOT-TIR2019, and GTOT tracking benchmark datasets to demonstrate the superior performance of the proposed AWCM against compared to existing tracking algorithms. The code developed in this study is available at the following website.
Article
Infrared (IR) ship tracking is becoming increasingly important in various applications. However, it remains a challenging task as the information that can be obtained from IR images is limited. Aiming at enhancing IR ship tracking accuracy, we propose an innovative approach by presenting feature integration module (FIM) and backup matching module (BMM). FIM takes appearance feature, complete intersection over union (CIoU), and motion direction metrics (MDMs) into account. Regarding appearance feature extraction, an end-to-end characteristic learning strategy with a cross-guided multigranularity fusion network is proposed to obtain more integral appearance features and enhance reidentification (re-ID) accuracy, which helps to distinguish individual IR ship targets better. Besides, a backup matching strategy is then used to match the unmatched tracks and detections after cascaded matching. Virtual trajectories are generated for the matched tracks to optimize parameters by parameter optimization module (POM). The accumulation of errors caused by the lack of observations in the Kalman filter (KF) is reduced. Thus, the position of IR ships can be estimated more accurately, and more robust IR ship tracking can be achieved. In addition, we present a sequential frame IR ship tracking (SFIST) dataset, providing the first public benchmark for testing IR ship tracking performance. Experimental results indicate that the multiple object tracking accuracy (MOTA), multiple objects tracking precision (MOTP), and identity switch (IDs) of the proposed method are 73.441, 80.826, and 32, respectively, outperforming other state-of-the-art methods. This demonstrates the superior robustness of the proposed method, particularly when the IR ships are occluded or the target texture information is lacking. Our dataset is available at https://github.com/echo-sky/SFIST .
Article
Full-text available
Visual target tracking is one of the most sought-after yet challenging research topics in computer vision. Given the ill-posed nature of the problem and its popularity in a broad range of real-world scenarios, a number of large-scale benchmark datasets have been established, on which considerable methods have been developed and demonstrated with significant progress in recent years - predominantly by recent deep learning (DL)-based methods. This survey aims to systematically investigate the current DL-based visual tracking methods, benchmark datasets, and evaluation metrics. It also extensively evaluates and analyzes the leading visual tracking methods. First, the fundamental characteristics, primary motivations, and contributions of DL-based methods are summarized from nine key aspects of: network architecture, network exploitation, network training for visual tracking, network objective, network output, exploitation of correlation filter advantages, aerial-view tracking, long-term tracking, and online tracking. Second, popular visual tracking benchmarks and their respective properties are compared, and their evaluation metrics are summarized. Third, the state-of-the-art DL-based methods are comprehensively examined on a set of well-established benchmarks of OTB2013, OTB2015, VOT2018, LaSOT, UAV123, UAVDT, and VisDrone2019. Finally, by conducting critical analyses of these state-of-the-art trackers quantitatively and qualitatively, their pros and cons under various common scenarios are investigated. It may serve as a gentle use guide for practitioners to weigh when and under what conditions to choose which method(s). It also facilitates a discussion on ongoing issues and sheds light on promising research directions.
Article
Full-text available
The training of a feature extraction network typically requires abundant manually annotated training samples, making this a time-consuming and costly process. Accordingly, we propose an effective self-supervised learning-based tracker in a deep correlation framework (named: self-SDCT). Motivated by the forward-backward tracking consistency of a robust tracker, we propose a multi-cycle consistency loss as self-supervised information for learning feature extraction network from adjacent video frames. At the training stage, we generate pseudo-labels of consecutive video frames by forward-backward prediction under a Siamese correlation tracking framework and utilize the proposed multi-cycle consistency loss to learn a feature extraction network. Furthermore, we propose a similarity dropout strategy to enable some low-quality training sample pairs to be dropped and also adopt a cycle trajectory consistency loss in each sample pair to improve the training loss function. At the tracking stage, we employ the pre-trained feature extraction network to extract features and utilize a Siamese correlation tracking framework to locate the target using forward tracking alone. Extensive experimental results indicate that the proposed self-supervised deep correlation tracker (self-SDCT) achieves competitive tracking performance contrasted to state-of-the-art supervised and unsupervised tracking methods on standard evaluation benchmarks.
Article
Full-text available
Correlation filter-based trackers (CFTs) have recently shown remarkable performance in the field of visual object tracking. The advantage of these trackers originates from their ability to convert time-domain calculations into frequency domain calculations. However, a significant problem of these CFTs is that the model is insufficiently robust when the tracking scenarios are too complicated, meaning that the ideal tracking performance cannot be acquired. Recent work has attempted to resolve this problem by reducing the boundary effects from modeling the foreground and background of the object target effectively (e.g., CFLB, BACF, and CACF). Although these methods have demonstrated reasonable performance, they are often affected by occlusion, deformation, scale variation, and other challenging scenes. In this study, considering the relationship between the current frame and the previous frame of a moving object target in a time series, we propose a temporal regularization strategy to improve the BACF tracker (denoted as TRBACF), a typical representative of the aforementioned trackers. The TRBACF tracker can efficiently adjust the model to adapt the change of the tracking scenes, thereby enhancing its robustness and accuracy. Moreover, the objective function of our TRBACF tracker can be solved by an improved alternating direction method of multipliers, which can speed up the calculation in the Fourier domain. Extensive experimental results demonstrate that the proposed TRBACF tracker achieves competitive tracking performance compared with state-of-the-art trackers.
Conference Paper
Full-text available
In this paper, we present a Large-Scale and high-diversity general Thermal InfraRed (TIR) Object Tracking Benchmark, called LSOTB-TIR, which consists of an evaluation dataset and a training dataset with a total of 1,400 TIR sequences and more than 600K frames. We annotate the bounding box of objects in every frame of all sequences and generate over 730K bounding boxes in total. To the best of our knowledge, LSOTB-TIR is the largest and most diverse TIR object tracking benchmark to date. To evaluate a tracker on different attributes, we define 4 scenario attributes and 12 challenge attributes in the evaluation dataset. By releasing LSOTB-TIR, we encourage the community to develop deep learning based TIR trackers and evaluate them fairly and comprehensively. We evaluate and analyze more than 30 trackers on LSOTB-TIR to provide a series of baselines, and the results show that deep trackers achieve promising performance. Furthermore, we retrain several representative deep trackers on LSOTB-TIR, and their results demonstrate that the proposed training dataset significantly improves the performance of deep TIR trackers. Codes and dataset are available at https://github.com/QiaoLiuHit/LSOTB-TIR.
Article
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However, these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However , these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Article
Full-text available
Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Chapter
Full-text available
Object tracking is still a critical and challenging problem with many applications in computer vision. For this challenge, more and more researchers pay attention to applying deep learning to get powerful feature for better tracking accuracy. In this paper, a novel triplet loss is proposed to extract expressive deep feature for object tracking by adding it into Siamese network framework instead of pairwise loss for training. Without adding any inputs, our approach is able to utilize more elements for training to achieve more powerful feature via the combination of original samples. Furthermore, we propose a theoretical analysis by combining comparison of gradients and back-propagation, to prove the effectiveness of our method. In experiments, we apply the proposed triplet loss for three real-time trackers based on Siamese network. And the results on several popular tracking benchmarks show our variants operate at almost the same frame-rate with baseline trackers and achieve superior tracking performance than them, as well as the comparable accuracy with recent state-of-the-art real-time trackers.
Article
Full-text available
Robust and accurate visual tracking is a challenging problem in computer vision. In this paper, we exploit spatial and semantic convolutional features extracted from convolutional neural networks in continuous object tracking. The spatial features retain higher resolution for precise localization and semantic features capture more semantic information and less fine-grained spatial details. Therefore, we localize the target by fusing these different features, which improves the tracking accuracy. Besides, we construct the multi-scale pyramid correlation filter of the target and extract its spatial features. This filter determines the scale level effectively and tackles target scale estimation. Finally, we further present a novel model updating strategy, and exploit peak sidelobe ratio (PSR) and skewness to measure the comprehensive fluctuation of response map for efficient tracking performance. Each contribution above is validated on 50 image sequences of tracking benchmark OTB-2013. The experimental comparison shows that our algorithm performs favorably against 12 state-of-the-art trackers.
Conference Paper
Full-text available
Compared with visible object tracking, thermal infrared (TIR) object tracking can track an arbitrary target in total darkness since it cannot be influenced by illumination variations. However, there are many unwanted attributes that constrain the potentials of TIR tracking, such as the absence of visual color patterns and low resolutions. Recently, structured output support vector machine (SOSVM) and discriminative correlation filter (DCF) have been successfully applied to visible object tracking, respectively. Motivated by these, in this paper, we propose a large margin structured convolution operator (LMSCO) to achieve efficient TIR object tracking. To improve the tracking performance, we employ the spatial regularization and implicit interpolation to obtain continuous deep feature maps, including deep appearance features and deep motion features, of the TIR targets. Finally, a collaborative optimization strategy is exploited to significantly update the operators. Our approach not only inherits the advantage of the strong discriminative capability of SOSVM but also achieves accurate and robust tracking with higher-dimensional features and more dense samples. To the best of our knowledge, we are the first to incorporate the advantages of DCF and SOSVM for TIR object tracking. Comprehensive evaluations on two thermal infrared tracking benchmarks, i.e. VOT-TIR2015 and VOT-TIR2016, clearly demonstrate that our LMSCO tracker achieves impressive results and outperforms most state-of-the-art trackers in terms of accuracy and robustness with sufficient frame rate.
Article
Full-text available
Discriminative correlation filters (DCFs) have been shown to perform superiorly in visual tracking. They only need a small set of training samples from the initial frame to generate an appearance model. However, existing DCFs learn the filters separately from feature extraction, and update these filters using a moving average operation with an empirical weight. These DCF trackers hardly benefit from the end-to-end training. In this paper, we propose the CREST algorithm to reformulate DCFs as a one-layer convolutional neural network. Our method integrates feature extraction, response map generation as well as model update into the neural networks for an end-to-end training. To reduce model degradation during online update, we apply residual learning to take appearance changes into account. Extensive experiments on the benchmark datasets demonstrate that our CREST tracker performs favorably against state-of-the-art trackers.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
Full-text available
The Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame. Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task. This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter. Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates.
Article
RGBT tracking receives a surge of interest in the computer vision community, but this research field lacks a large-scale and high-diversity benchmark dataset, which is essential for both the training of deep RGBT trackers and the comprehensive evaluation of RGBT tracking methods. To this end, we present a La rge- s{s} cale H{H} igh-diversity be\text{b}{e} nchmark for short-term R{R} GBT tracking (LasHeR) in this work. LasHeR consists of 1224 visible and thermal infrared video pairs with more than 730K frame pairs in total. Each frame pair is spatially aligned and manually annotated with a bounding box, making the dataset well and densely annotated. LasHeR is highly diverse capturing from a broad range of object categories, camera viewpoints, scene complexities and environmental factors across seasons, weathers, day and night. We conduct a comprehensive performance evaluation of 12 RGBT tracking algorithms on the LasHeR dataset and present detailed analysis. In addition, we release the unaligned version of LasHeR to attract the research interest for alignment-free RGBT tracking, which is a more practical task in real-world applications. The datasets and evaluation protocols are available at: https://github.com/mmic-lcl/Datasets-and-benchmark-code .
Article
Classifying hard samples in the course of RGBT tracking is a quite challenging problem. Existing methods only focus on enlarging the boundary between positive and negative samples, but ignore the relations of multilevel hard samples, which are crucial for the robustness of hard sample classification. To handle this problem, we propose a novel Multi-Modal Multi-Margin Metric Learning framework named M 5 L for RGBT tracking. In particular, we divided all samples into four parts including normal positive, normal negative, hard positive and hard negative ones, and aim to leverage their relations to improve the robustness of feature embeddings, e.g., normal positive samples are closer to the ground truth than hard positive ones. To this end, we design a multi-modal multi-margin structural loss to preserve the relations of multilevel hard samples in the training stage. In addition, we introduce an attention-based fusion module to achieve quality-aware integration of different source data. Extensive experiments on large-scale datasets testify that our framework clearly improves the tracking performance and performs favorably the state-of-the-art RGBT trackers.
Article
Tracking in the unmanned aerial vehicle (UAV) scenarios is one of the main components of target tracking tasks. Different from the target tracking task in the general scenarios, the target tracking task in the UAV scenarios is very challenging because of factors such as small scale and aerial view. Although the DCFs-based tracker has achieved good results in tracking tasks in general scenarios, the boundary effect caused by the dense sampling method will reduce the tracking accuracy, especially in UAV tracking scenarios. In this work, we propose learning an adaptive spatial-temporal context-aware (ASTCA) model in the DCFs-based tracking framework to improve the tracking accuracy and reduce the influence of boundary effect, thereby enabling our tracker to more appropriately handle UAV tracking tasks. Specifically, our ASTCA model can learn a spatial-temporal context weight, which can precisely distinguish the target and background in the UAV tracking scenarios. Besides, considering the small target scale and the aerial view in UAV tracking scenarios, our ASTCA model incorporates spatial context information within the DCFs-based tracker, which could effectively alleviate background interference. Extensive experiments demonstrate that our ASTCA method performs favorably against state-of-the-art tracking methods on some standard UAV datasets.
Article
Accurate segmentation is difficult for liver computed tomography (CT) images, since the liver CT images do not always have obvious and smooth boundaries. The location of the tumor is not specified and the image intensity is similar to that of the liver. Although manual and automatic segmentation methods, traditional and deep learning models currently exist, none can be specifically and effectively applied to segment liver CT images. In this paper, we propose a new model based on a level set framework for liver CT images in which the energy functional contains three terms including the data fitting term, the length term and the bound term. Then we apply the split Bregman method to minimize the energy functional that leads the energy functional converge faster. The proposed model is robust to initial contours and can segment liver CT images with intensity inhomogeneity and unclear boundaries. In the bound term, we use the U-Net to get constraint information which has a considerable influence on effective and accurate segmentation. We improve a multi-phase level set of our model to get contours of tumor and liver at the same time. Finally, a parallel algorithm is proposed to improve segmentation efficiency. Results and comparisons of experiments are shown to demonstrate the merits of the proposed model including robustness, accuracy, efficiency and intelligence.
Article
Medical image segmentation has a huge challenge due to intensity inhomogeneity and the similarity of the background and the object. To meet this challenge, we propose an improved active contour model, in which we combine the level set method and the split Bregman method, and provide the two-phase formulation, the multi-phase formulation and 3D formulation. In this paper, the proposed model is presented in a level set framework by including the neighbor region information for segmenting medical images in which the energy functional contains the data fitting term and the length term. The neighbor region and the local intensity variances in the data fitting term are designed to optimize the minimization process. To minimize the energy functional then we apply the split Bregman method which contributes to get faster convergence. Besides, we extend our model to the multi-phase segmentation model and the 3D segmentation model for cardiac MR images, which have all achieved good results. Experimental results show that the new model not only has strong robustness to other cardiac tissue effects and image intensity inhomogeneity, but it also can much better conduce to the extraction of effective tissues. As we expected, our model has higher segmentation accuracy and efficiency for medical image segmentation.
Article
Many RGBT trackers utilize adaptive weighting mechanism to treat dual modalities differently and obtain more robust feature representations for tracking. Although these trackers work well under certain conditions, however, they ignore the information interactions in feature learning, which might limit tracking performance. In this paper, we propose a novel cross-modality message passing model to interactively learn robust deep representations of dual modalities for RGBT tracking. Specifically, we extract features of dual modalities by backbone network and take each channel of these features as a node of a graph. Therefore, all channels of dual modalities can explicitly communicate with each other by the graph learning, and the outputted features are thus more diverse and discriminative. Moreover, we introduce the gate mechanism to control the propagation of information flow to achieve more intelligent fusion. The features generated from the interactive cross-modality message passing model will be passed selectively through the gate layer and concatenated with original features as the final representation. We extend the ATOM tracker into its dual-modality version and combine it with our proposed module for final tracking. Extensive experiments on two RGBT benchmark datasets validate the effectiveness and efficiency of our proposed algorithm.
Article
RGBT tracking has attracted increasing attention since RGB and thermal infrared data have strong complementary advantages, which could make trackers all-day and all-weather work. Existing works usually focus on extracting modality-shared or modality-specific information, but the potentials of these two cues are not well explored and exploited in RGBT tracking. In this paper, we propose a novel multi-adapter network to jointly perform modality-shared, modality-specific and instance-aware target representation learning for RGBT tracking. To this end, we design three kinds of adapters within an end-to-end deep learning framework. In specific, we use the modified VGG-M as the generality adapter to extract the modality-shared target representations. To extract the modality-specific features while reducing the computational complexity, we design a modality adapter, which adds a small block to the generality adapter in each layer and each modality in a parallel manner. Such a design could learn multilevel modality-specific representations with a modest number of parameters as the vast majority of parameters are shared with the generality adapter. We also design instance adapter to capture the appearance properties and temporal variations of a certain target. Moreover, to enhance the shared and specific features, we employ the loss of multiple kernel maximum mean discrepancy to measure the distribution divergence of different modal features and integrate it into each layer for more robust representation learning. Extensive experiments on two RGBT tracking benchmark datasets demonstrate the outstanding performance of the proposed tracker against the state-of-the-art methods.
Article
Single-object tracking is regarded as a challenging task in computer vision, especially in complex spatio-temporal contexts. The changes in the environment and object deformation make it difficult to track. In the last 10 years, the application of correlation filters and deep learning enhances the performance of trackers to a large extent. This paper summarizes single-object tracking algorithms based on correlation filters and deep learning. Firstly, we explain the definition of single-object tracking and analyze the components of general object tracking algorithms. Secondly, the single-object tracking algorithms proposed in the past decade are summarized according to different categories. Finally, this paper summarizes the achievements and problems of existing algorithms by analyzing experimental results and discusses the development trends.
Article
Long-wave infrared(thermal) images distinguish the target and background according to different thermal radiation. They are insensitive to light conditions, and cannot present details obtained from reflected light. By contrast, the visible images have high spatial resolution and texture details, but they are easily affected by the occlusion and light conditions. Combining the advantages of the two sources may generate a new image with clear targets and high resolution, which satisfy requirements in all-weather and all-day/night conditions. Most of the existing methods cannot fully capture the underlying characteristics in the infrared and visible images, and ignore complementary information between the sources. In this paper, we propose an endto- end model (TSFNet) for infrared and visible image fusion, which is able to handle the sources simultaneously. In addition, it adopts an adaptive weight allocation strategy to capture the informative global features. Experiments on public datasets demonstrate the proposed fusion method achieves state-of-the-art performance, in both global visual quality and quantitative comparison.
Article
Infrared object tracking is a key technology for infrared imaging guidance. Blurred imaging, strong ego-motion and frequent occlusion make it difficult to maintain robust tracking. We observe that the features trained on ImageNet are not suitable for aircraft tracking with infrared imagery. In addition, for deep feature-based tracking, the main computational burden comes from the feedforward pass through the pretrained deep network. To this end, we present an airborne infrared target tracking algorithm that employs feature embedding learning and correlation filters to obtain improved performance. We develop a shallow network and a contrastive center loss function to learn the prototypical representation of the aircraft in the embedding space. The feature embedding module is lightweight and integrated into the efficient convolution operator framework for aircraft tracking. Finally, to demonstrate the effectiveness of our tracking algorithm, we conduct extensive experiments on airborne infrared imagery and benchmark trackers.
Article
Most of the excellent methods for visual object tracking are based on RGB videos. With the popularity of depth cameras, the research on RGB-D(RGB+depth) tracking has gradually gained extensive attention. The depth map provides more available information for dealing with intractable tracking problems. How to make full use of depth maps to construct a better tracker is the foremost problem to be settled. The fully-convolutional siamese network shows excellent performance in 2D tracking, but still cannot achieve satisfying tracking performance in complex scenarios. Therefore, we have proposed the RGB-D tracker integrated with the single-scale siamese network as well as the adaptive bounding boxes, which achieves stable tracking performance under the challenges such as occlusion, scale variation and background clutter. Our proposed adaptive strategy enables the bounding box to adjust automatically when the target appearance changes during the tracking, instead of multi-scale input in the siamese network. We design an effective algorithm to quickly obtain the target depth and construct the 3D local visual field to eliminate the interference from background and similar objects. In addition, the total occlusion handling approach combined with RGB and depth information has achieved more reliable occlusion detection and target recovery. Our presented object tracker, including the strategies of 3D local visual field, adaptive bounding boxes and occlusion handling, has been evaluated on two widely utilized RGB-D tracking benchmarks and achieves suprior performance especially for the situations of occlusion and pedestrian detection.
Article
Correlation filter (CF) trackers have performed impressive performance with high frame rates. However, the limited information in both spatial and temporal domains is only used in the learning of correlation filters, which might limit the tracking performance. To handle this problem, we propose a novel spatio-temporal correlation filter approach, which employs both spatial and temporal cues in the learning, for visual tracking. In particular, we explore the spatial contexts from background whose contents are ambiguous to the target and integrate them into the correlation filter model for more discriminative learning. Moreover, to capture the appearance variations in temporal domain, we also compute a set of target templates and incorporate them into our model. At the same time, the solution of the proposed spatio-temporal correlation filter is closed-form and the tracking efficiency is thus guaranteed. Experimental experiments on benchmark datasets demonstrate the effectiveness of the proposed tracker against several CF ones.
Chapter
RGB and thermal source data suffer from both shared and specific challenges, and how to explore and exploit them plays a critical role to represent the target appearance in RGBT tracking. In this paper, we propose a novel challenge-aware neural network to handle the modality-shared challenges (e.g., fast motion, scale variation and occlusion) and the modality-specific ones (e.g., illumination variation and thermal crossover) for RGBT tracking. In particular, we design several parameter-shared branches in each layer to model the target appearance under the modality-shared challenges, and several parameter-independent branches under the modality-specific ones. Based on the observation that the modality-specific cues of different modalities usually contains the complementary advantages, we propose a guidance module to transfer discriminative features from one modality to another one, which could enhance the discriminative ability of some weak modality. Moreover, all branches are aggregated together in an adaptive manner and parallel embedded in the backbone network to efficiently form more discriminative target representations. These challenge-aware branches are able to model the target appearance under certain challenges so that the target representations can be learnt by a few parameters even in the situation of insufficient training data. From the experimental results we will show that our method operates at a real-time speed while performing well against the state-of-the-art methods on three benchmark datasets.
Article
To overcome the shortcomings of low signal-to-noise ratio and less available information of infrared images, as well as the challenges of fast camera motion and partial occlusion, a robust tracker via correlation filter and particle filter is proposed for infrared target. Firstly, to explore the strength of the particle-filter-based tracker, a Lp-norm based low-rank sparse tracker is proposed. Then, a robust tracker is proposed by complementing the advantages of both correlation-filter-based and particle-filter-based trackers, which can not only handle the camera motion challenge, but also improve tracking accuracy and robustness. Finally, to address the tracking drift problem and deal with the partial occlusion challenge, an effective template update approach is designed according to different characteristics of correlation-filter-based and particle-filter-based trackers. Experimental results on the VOT-TIR2015 benchmark set demonstrate that the proposed tracker can not only outperform several state-of-the-art trackers in terms of both accuracy and robustness, but also effectively handle the challenges such as camera motion, partial occlusion, size change and motion change.
Article
With the development of deep learning, the performance of many computer vision tasks has been greatly improved. For visual tracking, deep learning methods mainly focus on extracting better features or designing end-to-end trackers. However, during tracking specific targets most of the existing trackers based on deep learning are less discriminative and time-consuming. In this paper, a cascade based tracking algorithm is proposed to promote the robustness of the tracker and reduce time consumption. First, we propose a novel deep network for feature extraction. Since some pruning strategies are applied, the speed of the feature extraction stage can be more than 50 frames per second. Then, a cascade tracker named DCCT is presented to improve the performance and enhance the robustness by utilizing both texture and semantic features. Similar to the cascade classifier, the proposed DCCT tracker consists of several weaker trackers. Each weak tracker rejects some false candidates of the tracked object, and the final tracking results are obtained by synthesizing these weak trackers. Intensive experiments are conducted in some public datasets and the results have demonstrated the effectiveness of the proposed framework.
Article
Discriminative correlation filters (DCFs) have been widely used in the visual tracking community in recent years. The DCFs-based trackers determine the target location through a response map generated by the correlation filters and determine the target scale by a fixed scale factor. However, the response map is vulnerable to noise interference and the fixed scale factor also cannot reflect the real scale change of the target, which can obviously reduce the tracking performance. In this paper, to solve the aforementioned drawbacks, we propose to learn a metric learning model in correlation filters framework for visual tracking (called CFML). This model can use a metric learning function to solve the target scale problem. In particular, we adopt a hard negative mining strategy to alleviate the influence of the noise on the response map, which can effectively improve the tracking accuracy. Extensive experimental results demonstrate that the proposed CFML tracker achieves competitive performance compared with the state-of-the-art trackers.
Article
Convolutional Neural Networks (CNN) have been demonstrated to achieve state-of-the-art performance in visual object tracking task. However, existing CNN-based trackers usually use holistic target samples to train their networks. Once the target undergoes complicated situations (e.g., occlusion, background clutter, and deformation), the tracking performance degrades badly. In this paper, we propose an adaptive structural convolutional filter model to enhance the robustness of deep regression trackers (named: ASCT). Specifically, we first design a mask set to generate local filters to capture local structures of the target. Meanwhile, we adopt an adaptive weighting fusion strategy for these local filters to adapt to the changes in the target appearance, which can enhance the robustness of the tracker effectively. Besides, we develop an end-to-end trainable network comprising feature extraction, decision making, and model updating modules for effective training. Extensive experimental results on large benchmark datasets demonstrate the effectiveness of the proposed ASCT tracker performs favorably against the state-of-the-art trackers.
Article
Discriminative correlation filters (DCFs) have been widely used in the tracking community recently. DCFs-based trackers utilize samples generated by circularly shifting from an image patch to train a ridge regression model, and estimate target location using a response map generated by the correlation filters. However, the generated samples produce some negative effects and the response map is vulnerable to noise interference, which degrades tracking performance. In this paper, to solve the aforementioned drawbacks, we propose a target-focusing convolutional regression (CR) model for visual object tracking tasks (called TFCR). This model uses a target-focusing loss function to alleviate the influence of background noise on the response map of the current tracking image frame, which effectively improves the tracking accuracy. In particular, it can effectively balance the disequilibrium of positive and negative samples by reducing some effects of the negative samples that act on the object appearance model. Extensive experimental results illustrate that our TFCR tracker achieves competitive performance compared with state-of-the-art trackers.
Article
Thermal infrared (TIR) object tracking is one of the most challenging tasks in computer vision. This paper proposes a robust TIR tracker based on the continuous correlation filters and adaptive feature fusion (RCCF-TIR). Firstly, the Efficient Convolution Operators (ECO) framework is selected to build the new tracker. Secondly, an optimized feature set for TIR tracking is adopted in the framework. Finally, a new strategy of feature fusion based on average peak-to-correlation energy (APCE) is employed. Experiments on the VOT-TIR2016 (Visual Object Tracking-TIR2016) and PTB-TIR (A Thermal Infrared Pedestrian Tracking Benchmark) dataset are carried out and the results indicate that the proposed RCCF-TIR tracker combines good accuracy and robustness, performs better than the state-of-the-art trackers and has the ability to handle various challenges.
Article
This paper studies how to perform RGB-T object tracking in the correlation filter framework. Given the input RGB and thermal videos, we utilize the correlation filter for each modality due to its high performance in both of accuracy and speed. To take the interdependency between RGB and thermal modalities, we introduce the low-rank constraint to learn filters collaboratively, based on the observation that different modality features should have similar filters to make them have consistent localization of the target object. For optimization, we design an efficient ADMM (Alternating Direction Method of Multipliers) algorithm to solve the proposed model. Experimental results on the benchmark datasets (i.e., GTOT, RGBT210 and OSU-CT) suggest that the proposed approach performs favorably in both accuracy and efficiency against the state-of-the-art RGB-T methods.
Article
The usage of both off-the-shelf and end-to-end trained deep networks have significantly improved performance of visual tracking on RGB videos. However, the lack of large labeled datasets hampers the usage of convolutional neural networks for tracking in thermal infrared (TIR) images. Therefore, most state of the art methods on tracking for TIR data are still based on hand-crafted features. To address this problem, we propose to use image-to-image translation models. These models allow us to translate the abundantly available labeled RGB data to synthetic TIR data. We explore both the usage of paired and unpaired image translation models for this purpose. These methods provide us with a large labeled dataset of synthetic TIR sequences, on which we can train end-to-end optimal features for tracking. To the best of our knowledge we are the first to train end-to-end features for TIR tracking. We perform extensive experiments on VOT-TIR2017 dataset. We show that a network trained on a large dataset of synthetic TIR data obtains better performance than one trained on the available real TIR data. Combining both data sources leads to further improvement. In addition, when we combine the network with motion features we outperform the state of the art with a relative gain of over 10%, clearly showing the efficiency of using synthetic data to train end-to-end TIR trackers.
Article
Most thermal infrared (TIR) tracking methods are discriminative, which treat the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is well coupled to the objective of tracking task. We propose a TIR tracker via a hierarchical Siamese convolutional neural network (CNN), named HSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN coalescing the multiple hierarchical convolutional layers. Then, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the most similar one as the tracked target. Extensive experimental results on the benchmarks: VOT-TIR 2015 and VOT-TIR 2016, show that our proposed method achieves favorable performance against the state-of-the-art methods.
Article
Recently, deep learning has achieved great success in visual tracking. The goal of this paper is to review the state-of-the-art tracking methods based on deep learning. First, we introduce the background of deep visual tracking, including the fundamental concepts of visual tracking and related deep learning algorithms. Second, we categorize the existing deep-learning-based trackers into three classes according to network structure, network function and network training. For each categorize, we explain its analysis of the network perspective and analyze papers in different categories. Then, we conduct extensive experiments to compare the representative methods on the popular OTB-100, TC-128 and VOT2015 benchmarks. Based on our observations, we conclude that: (1) The usage of the convolutional neural network (CNN) model could significantly improve the tracking performance. (2) The trackers using the convolutional neural network (CNN) model to distinguish the tracked object from its surrounding background could get more accurate results, while using the CNN model for template matching is usually faster. (3) The trackers with deep features perform much better than those with low-level hand-crafted features. (4) Deep features from different convolutional layers have different characteristics and the effective combination of them usually results in a more robust tracker. (5) The deep visual trackers using end-to-end networks usually perform better than the trackers merely using feature extraction networks. (6) For visual tracking, the most suitable network training method is to per-train networks with video information and online fine-tune them with subsequent observations. Finally, we summarize our manuscript and highlight our insights, and point out the further trends for deep visual tracking.
Conference Paper
How to effectively learn temporal variation of target appearance, to exclude the interference of cluttered background , while maintaining real-time response, is an essential problem of visual object tracking. Recently, Siamese networks have shown great potentials of matching based trackers in achieving balanced accuracy and beyond real-time speed. However, they still have a big gap to classification & updating based trackers in tolerating the temporal changes of objects and imaging conditions. In this paper, we propose dynamic Siamese network, via a fast transformation learning model that enables effective online learning of target appearance variation and background suppression from previous frames. We then present elementwise multi-layer fusion to adaptively integrate the network outputs using multi-level deep features. Unlike state-of-the-art trackers, our approach allows the usage of any feasible generally-or particularly-trained features, such as SiamFC and VGG. More importantly, the proposed dynamic Siamese network can be jointly trained as a whole directly on the labeled video sequences, thus can take full advantage of the rich spatial temporal information of moving objects. As a result, our approach achieves state-of-the-art performance on OTB-2013 and VOT-2015 benchmarks, while exhibits superiorly balanced accuracy and real-time response over state-of-the-art competitors.
Article
The process of designing an efficient tracker for thermal infrared imagery is one of the most challenging tasks in computer vision. Although a lot of advancement has been achieved in RGB videos over the decades, textureless and colorless properties of objects in thermal imagery pose hard constraints in the design of an efficient tracker. Tracking of an object using a single feature or a technique often fails to achieve greater accuracy. Here, we propose an effective method to track an object in infrared imagery based on a combination of discriminative and generative approaches. The discriminative technique makes use of two complementary methods such as kernelized correlation filter with spatial feature and AdaBoost classifier with pixel intesity features to operate in parallel. After obtaining optimized locations through discriminative approaches, the generative technique is applied to determine the best target location using a linear search method. Unlike the baseline algorithms, the proposed method estimates the scale of the target by Lucas-Kanade homography estimation. To evaluate the proposed method, extensive experiments are conducted on 17 challenging infrared image sequences obtained from LTIR dataset and a significant improvement of mean distance precision and mean overlap precision is accomplished as compared with the existing trackers. Further, a quantitative and qualitative assessment of the proposed approach with the state-of-the-art trackers is illustrated to clearly demonstrate an overall increase in performance.
Article
Correlation Filters (CFs) have recently demonstrated excellent performance in terms of rapidly tracking objects under challenging photometric and geometric variations. The strength of the approach comes from its ability to efficiently learn - "on the fly" - how the object is changing over time. A fundamental drawback to CFs, however, is that the background of the object is not be modelled over time which can result in suboptimal results. In this paper we propose a Background-Aware CF that can model how both the foreground and background of the object varies over time. Our approach, like conventional CFs, is extremely computationally efficient - and extensive experiments over multiple tracking benchmarks demonstrate the superior accuracy and real-time performance of our method compared to the state-of-the-art trackers including those based on a deep learning paradigm.