ArticlePDF Available

Hierarchical Spatial-aware Siamese Network for Thermal Infrared Object Tracking

Authors:
A preview of the PDF is not available
... Similar to VOT, the goal of TIR tracking is to track the object in adjacent TIR images. Recently, several TIR trackers [6][7][8] have transplanted Siamese-based trackers for TIR tracking and achieved promising performance. However, it is time-consuming in a system without a high-performance graphics processing unit (GPU). ...
... Compared trackers: We select several TIR and visual Trackers to conduct the experiment. TIR trackers contain ECO-stir [20], MLSSNet [7], MMNet [6], ECO_LS(ours), ECOHG_LS(ours), and HSSNet [8]. While visual trackers include ECO [1], DeepSTRCF [25], MDNet [26], SRDCF [17], VITAL [27], ECO-HC [1], MCCT [2], UDT [28], and SiamFC-tri [29]. ...
... For CF-D, there are DeepSTRCF [25], MCCT [2], ECO [1], and ECO_LS(ours). Trackers in S-T contain HSSNet [8], MMNet [6], and MLSSNet [7]. ...
Article
Full-text available
Existing thermal infrared (TIR) trackers based on correlation filters cannot adapt to the abrupt scale variation of nonrigid objects. This deficiency could even lead to tracking failure. To address this issue, we propose a TIR tracker, called ECO_LS, which improves the performance of efficient convolution operators (ECO) via the level set method. We first utilize the level set to segment the local region estimated by the ECO tracker to gain a more accurate size of the bounding box when the object changes its scale suddenly. Then, to accelerate the convergence speed of the level set contour, we leverage its historical information and continuously encode it to effectively decrease the number of iterations. In addition, our variant, ECOHG_LS, also achieves better performance via concatenating histogram of oriented gradient (HOG) and gray features to represent the object. Furthermore, experimental results on three infrared object tracking benchmarks show that the proposed approach performs better than other competing trackers. ECO_LS improves the EAO by 20.97% and 30.59% over the baseline ECO on VOT-TIR2016 and VOT-TIR2015, respectively.
... It has been widely used in video surveillance, driver assistance, and action recognition because of its good properties in dark environments [1,2,3,4,5]. Since the thermal infrared image has no color information, the target in this kind of scenario is easily disturbed by similar targets and susceptible to the influence of occlusion and other challenging factors [6,7,8,9]. Although some trackers are trying to improve the performance of the thermal infrared target tracking task, it is still a thorny problem. ...
... The deep learning-based trackers use the convolutional neural network (CNNs) to extract features to improve the tracking performance [17,20]. Inspired by the success of CNNs in the general target tracking task, some works attempt to improve the performance of thermal infrared target tracking tasks by using CNNs [3,9,24]. The MCFTS [3]tracker adopts a pre-trained network to extracting deep features and integrate them into a conventional tracking framework. ...
... The MCFTS [3]tracker adopts a pre-trained network to extracting deep features and integrate them into a conventional tracking framework. Similar to the Siamese network-based tracking framework [22,25,26], the HSSNet [9] tracker treat the tracking task as a matching problem and train a similarity verification network offline for online tracking. To make the tracker more suitable for the thermal infrared tracking task, the MMNet [27] tracker proposes to learn the thermal infrared specific discriminative features and fine-grained correlation features. ...
Article
Thermal InfraRed (TIR) target trackers are easy to be interfered by similar objects, while susceptible to the influence of the target occlusion. To solve these problems, we propose a structural target-aware model (STAMT) for the thermal infrared target tracking tasks. Specifically, the proposed STAMT tracker can learn a target-aware model, which can add more attention to the target area to accurately identify the target from similar objects. In addition, considering the situation that the target is partially occluded in the tracking process, a structural weight model is proposed to locate the target through the unoccluded reliable target part. Ablation studies show the effectiveness of each component in the proposed tracker. Without bells and whistles, the experimental results demonstrate that our STAMT tracker performs favorably against state-of-the-art trackers on PTB-TIR and LSOTB-TIR datasets.
... Despite the demonstrated success, their performance is limited by the pre-trained deep features which are learned from RGB images and are less effective in representing TIR objects. Matching-based deep TIR object tracking methods, e.g., HSSNet [20] and MLSS-Net [21], cast tracking as a matching problem and train a matching network offline for online tracking. These methods receive much attention recently because of their high efficiency and simplicity. ...
... DWSiam [38] uses a deeper and wider backbone network on a Siamese framework to obtain more accurate tracking results. HSSNet [20] integrates multiple convolution features and a spatial-variation-aware module in a Siamese framework for TIR object tracking. MLSSNet [21] proposes a local structural feature and a global semantic feature to distinguish TIR objects from the local and global levels. ...
... C. Comparison with state-of-the-arts Compared trackers. We compare the proposed offline tracker (MMNet) and the online tracker (ECO-MM) with the state-of-the-art trackers including hand-crafted feature based correlation filter trackers, such as TBOOST [64], SRDCF [60], and Staple-TIR [52]; the deep feature based correlation filter trackers, such as deepMKCF [65], HDT [8], CREST [66], MCFTS [13], DSST-TIR [12], ECO-deep [29], and DeepSTRCF [67]; matching based deep trackers, such as CFNet [31], Siamese-FC [59], SiamRPN [40], SiamRPN++ [75], HSSNet [20], MLSS-Net [21], and TADT [37]; and other more recently stateof-the-art deep trackers, such as TCNN [69], MDNet [9], VITAL [68], ATOM [71], DiMP [74], KYS [72], Ocean [73], and TransT [70]. ...
Article
Full-text available
The feature models used by existing Thermal InfraRed (TIR) tracking methods are usually learned from RGB images due to the lack of a large-scale TIR image training dataset. However, these feature models are less effective in representing TIR objects and they are difficult to effectively distinguish distractors because they do not contain fine-grained discriminative information. To this end, we propose a dual-level feature model containing the TIR-specific discriminative feature and fine-grained correlation feature for robust TIR object tracking. Specifically, to distinguish inter-class TIR objects, we first design an auxiliary multi-classification network to learn the TIR-specific discriminative feature. Then, to recognize intra-class TIR objects, we propose a fine-grained aware module to learn the fine-grained correlation feature. These two kinds of features complement each other and represent TIR objects in the levels of inter-class and intra-class respectively. These two feature models are constructed using a multi-task matching framework and are jointly optimized on the TIR object tracking task. In addition, we develop a large-scale TIR image dataset to train the network for learning TIR-specific feature patterns. To the best of our knowledge, this is the largest TIR tracking training dataset with the richest object class and scenario. To verify the effectiveness of the proposed dual-level feature model, we propose an offline TIR tracker (MMNet) and an online TIR tracker (ECO-MM) based on the feature model and evaluate them on three TIR tracking benchmarks. Extensive experimental results on these benchmarks demonstrate that the proposed algorithms perform favorably against the state-of-the-art methods.
... This may weaken the power of CNNs since there is a significant difference between classifying an object and predicting its location in the image [19]. The objective of the classifier is not coupled to the objective of the tracker [59]. The trackers may then suffer from inconsistency problems caused by task differences. ...
... Thus, similarity interference is just reinforced. For enhanced feature fusion, Li et al. [59] proposed their Hierarchical spatial-aware Siamese convolutional neural network (HSSNet) for TIR object tracking. Their method is based on coalescing multiple hierarchical convolutional layers in conjunction with a spatial-aware network. ...
... Li et al. [59] also remarked that different feature channels should not contribute equally to the tracking. However, the multi-level feature fusion procedure developed for their HSSNet tracker (described in section IV-C2) made them to do so. ...
Article
Full-text available
Object tracking belongs to active research areas in computer vision. We are interested in matching-based trackers exploiting deep machine learning known as Siamese trackers. Their powerful capabilities stem from similarity learning. This tracking paradigm is promising due to its inherent balance between performance and efficiency, so trackers of this type are suitable for real-time generic object tracking. There is an upsurge in research interest in Siamese trackers and the lack of available specialized surveys in this category. In this survey, we aim to identify and elaborate on the most significant challenges the Siamese trackers face. Our goal is to answer what design decisions the authors made and what problems they attempted to solve in the first place. We thus perform an in-depth analysis of the core principles on which Siamese trackers operate with a discussion of incentives behind them. Besides, we provide an up-to-date qualitative and quantitative comparison of the prominent Siamese trackers on established benchmarks. Among other things, we discuss current trends in developing Siamese trackers. Our survey could help absorb the details about the underlying principles of Siamese trackers and the challenges they face.
... TIR tracking is also a hot topic in visual object tracking [30,31]. Different from Siamese-based RGB trackers, most TIR trackers model the target with only handcrafted features rather than deep networks. ...
Preprint
Full-text available
We address the problem of multi-modal object tracking in video and explore various options of fusing the complementary information conveyed by the visible (RGB) and thermal infrared (TIR) modalities including pixel-level, feature-level and decision-level fusion. Specifically, different from the existing methods, paradigm of image fusion task is heeded for fusion at pixel level. Feature-level fusion is fulfilled by attention mechanism with channels excited optionally. Besides, at decision level, a novel fusion strategy is put forward since an effortless averaging configuration has shown the superiority. The effectiveness of the proposed decision-level fusion strategy owes to a number of innovative contributions, including a dynamic weighting of the RGB and TIR contributions and a linear template update operation. A variant of which produced the winning tracker at the Visual Object Tracking Challenge 2020 (VOT-RGBT2020). The concurrent exploration of innovative pixel- and feature-level fusion strategies highlights the advantages of the proposed decision-level fusion method. Extensive experimental results on three challenging datasets, \textit{i.e.}, GTOT, VOT-RGBT2019, and VOT-RGBT2020, demonstrate the effectiveness and robustness of the proposed method, compared to the state-of-the-art approaches. Code will be shared at \textcolor{blue}{\emph{https://github.com/Zhangyong-Tang/DFAT}.
... We compare our tracker with ten state-of-the-art trackers on these 60 thermal infrared pedestrian tracking video sequences. The evaluated trackers include TADT [88], MLSSNet [100], MCCT [26], CREST [101], UDT [89], MCFTS [96], HSSNet [102], CFNet [91], SiamFC-tri [103], and DSiam [104]. ...
Article
Integrating multi-feature based on multi-layer features from the convolutional network or based on multiple hand-crafted features has been proved to be an effective way for improving tracking performance. In this work, we investigate how to integrate multi-layer convolutional features with hand-crafted features. Specifically, an adaptive multi-feature fusion strategy is proposed based on convolutional features from ResNet-101 and hand-crafted features from HOG as well as Grayscale in spatial regularized correlation filter framework. We fully consider the complementary advantages of multi-layer convolutional features and hand-crafted features to construct a robust and reliable appearance representation of the target. Comprehensive experimental results on benchmark datasets demonstrate that our tracker has achieved significant performance improvements in various challenging environments. Compared to the trackers based only on multi-layer convolutional features or complete hand-crafted fusion features, the most important is that our proposed tracker obtains more competitive tracking performance. Our tracker is publicly available. You can find open-sourced code of our tracker at https://github.com/binger1225/HCDC-SRCF.
... Tracking in a complex visual scenery, including rain, smoke, or night, is one of the most difficult computer vision tasks [1,2], especially for visible-image-based tracking [2,3]. However, infrared sensors can work around the clock, infrared has a strong ability to penetrate smoke, which can supplement the deficiencies of visible images in bad visual conditions [4,5,6,7]. Therefore, RGBT tracking has attracted more and more attention. ...
Preprint
For both visible and infrared images have their own advantages and disadvantages, RGBT tracking has attracted more and more attention. The key points of RGBT tracking lie in feature extraction and feature fusion of visible and infrared images. Current RGBT tracking methods mostly pay attention to both individual features (features extracted from images of a single camera) and common features (features extracted and fused from an RGB camera and a thermal camera), while pay less attention to the different and dynamic contributions of individual features and common features for different sequences of registered image pairs. This paper proposes a novel RGBT tracking method, called Dynamic Fusion Network (DFNet), which adopts a two-stream structure, in which two non-shared convolution kernels are employed in each layer to extract individual features. Besides, DFNet has shared convolution kernels for each layer to extract common features. Non-shared convolution kernels and shared convolution kernels are adaptively weighted and summed according to different image pairs, so that DFNet can deal with different contributions for different sequences. DFNet has a fast speed, which is 28.658 FPS. The experimental results show that when DFNet only increases the Mult-Adds of 0.02% than the non-shared-convolution-kernel-based fusion method, Precision Rate (PR) and Success Rate (SR) reach 88.1% and 71.9% respectively.
Article
The lack of large labeled training datasets hinders the usage of deep neural network for Thermal Infrared (TIR) tracking. Regular practice is to train a tracking network with large-scale RGB datasets and then retrain it to the TIR domain with limited TIR data. However, we observe that existing Siamese-based trackers can hardly generalize to TIR images though they achieve outstanding performance on RGB tracking. Therefore, the main challenge is the generalization problem: How to design a generalization-friendly Siamese tracking network and what affects the network generalization. To tackle this problem, we introduce the self-adaption structure into Siamese network and propose an effective TIR tracking model, GFSNet. GFSNet is successfully generalized to different TIR tracking tasks, including ground target, aircraft and high-diversity object tracking tasks. To estimate generalization ability, we present a notion of Growth Rate, the improvement of overall performance after retraining. Experimental results show that the Growth Rates of GFSNet exceed state-of-the-art SiamRPN++ by more than 7 times, which indicates the great power of GFSNet in generalization. In addition to experimental validations, we provide the theoretical analysis of network generalization from a novel perspective, model sensitivity. By performing some tests to analyze the sensitivity, we conclude that the self-adaption structure helps GFSNet converge to a more sensitive minimum with better generalization to new tasks. Furthermore, when compared with popular tracking methods, GFSNet maintains comparable accuracy while achieving real-time tracking with the speed of 112 FPS, 5 times faster than other TIR trackers.
Article
Full-text available
Discriminative methods have been widely applied to construct the appearance model for visual tracking. Most existing methods incorporate online updating strategy to adapt to the appearance variations of targets. The focus of online updating for discriminative methods is to select the positive samples emerged in past frames to represent the appearances. However, the appearances of positive samples might be very dissimilar to each other; traditional online updating strategies easily overfit on some appearances and neglect the others. To address this problem, we propose an effective method to learn a discriminative template, which maintains the multiple appearances information of targets in the long-term variations. Our method is based on the obvious observation that the target appearances vary very little in a certain number of successive video frames. Therefore, we can use a few instances to represent the appearances in the scope of the successive video frames. We propose exclusive group sparse to describe the observation and provide a novel algorithm, called coefficients constrained exclusive group LASSO, to solve it in a single objective function. The experimental results on CVPR2013 benchmark datasets demonstrate that our approach achieves promising performance.
Article
Full-text available
In this paper, a novel circular and structural operator tracker (CSOT) is proposed for high performance visual tracking, it not only possesses the powerful discriminative capability of SOSVM but also efficiently inherits the superior computational efficiency of DCF. Based on the proposed circular and structural operators, a set of primal confidence score maps can be obtained by circular correlating feature maps with their corresponding structural correlation filters. Furthermore, an implicit interpolation is applied to convert the multi-resolution feature maps to the continuous domain and make all primal confidence score maps have the same spatial resolution. Then, we exploit an efficient ensemble post-processor based on relative entropy, which can coalesce primal confidence score maps and create an optimal confidence score map for more accurate localization. The target is localized on the peak of the optimal confidence score map. Besides, we introduce a collaborative optimization strategy to update circular and structural operators by iteratively training structural correlation filters, which significantly reduces computational complexity and improves robustness. Experimental results demonstrate that our approach achieves state-of-the-art performance in mean AUC scores of 71.5% and 69.4% on the OTB-2013 and OTB-2015 benchmarks respectively, and obtains a third-best expected average overlap (EAO) score of 29.8% on the VOT-2017 benchmark.
Article
Full-text available
Conventional graph based clustering methods treat all features equally even if they are redundant features or noise in the stage of graph learning, which is obviously unreasonable. In this paper, we propose a novel graph learning method named adaptive weighted nonnegative low-rank representation (AWNLRR) for data clustering. Based on the observation that noise and outliers usually cannot be represented well and suffer from larger reconstruction errors than the important features (clean features) in low-rank or sparse representation, we impose an adaptive weighted matrix on the data reconstruction errors to reinforce the role of the important features in the joint representation and thus a robust graph can be obtained. In addition, a locality constraint, i.e., distance regularization term, is introduced to capture the local structure of data and enable the obtained graph to be sparser. These appealing properties allow AWNLRR to well capture the intrinsic structure of data, and thus AWNLRR has potential to achieve a better clustering performance than other methods. Experimental results on synthetic and real databases show that the proposed method obtains the best clustering performance than some state-of-the-art methods.
Article
Full-text available
Linear discriminant analysis (LDA) is a very popular supervised feature extraction method and has been extended to different variants. However, classical LDA has the following problems: 1) The obtained discriminant projection does not have good interpretability for features. 2) LDA is sensitive to noise. 3) LDA is sensitive to the selection of number of projection directions. In this paper, a novel feature extraction method called robust sparse linear discriminant analysis (RSLDA) is proposed to solve the above problems. Specifically, RSLDA adaptively selects the most discriminative features for discriminant analysis by introducing the l2;1 norm. An orthogonal matrix and a sparse matrix are also simultaneously introduced to guarantee that the extracted features can hold the main energy of the original data and enhance the robustness to noise, and thus RSLDA has the potential to perform better than other discriminant methods. Extensive experiments on six databases demonstrate that the proposed method achieves the competitive performance compared with other state-of-the-art feature extraction methods. Moreover, the proposed method is robust to the noisy data.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
The convolutional network-based tracking (CNT) algorithm provides a training network with warped target regions in the first frame instead of large auxiliary datasets, which solves the problem of convolutional neural network (CNN)-based tracking requiring very long training time and a large number of auxiliary training samples. However, the two-layer CNT uses only gray feature that causes sensitivity to appearance variations. Besides, some samples with useless information should be removed to avoid drifting problems. For these reasons, a multi-layer convolutional network-based visual tracking algorithm via important region selection (IRST) is proposed in this paper. The proposed important region selection model is built via high entropy selection and background discrimination, which enables the training samples to be informative in order to provide enough stable information and also be discriminative so as to resist distractors. The feature maps are also obtained by weighting the template filters with cluster weights. Instead of simple gray features, IRST adds the Gabor layer to explore the texture feature of the target that is effective on coping with illumination and rotation variations. Extensive experiments show that the proposed algorithm achieves superior performances in many challenging visual tracking tasks.
Article
For the object loss problem in the tracking process caused by illumination, occlusion, pose variation, and motion blur, the tracking method based on dual fuzzy low-rank approximation in a particle filter framework is proposed in this paper. Firstly, multiple constraint regions are built to filter insignificant samples, and more distinguished candidate samples are selected. Secondly, dual fuzzy observation function of each candidate sample is created based on the designed low-rank approximation representations of object and background. Then the generalized tracking results are obtained by computing membership degrees of dual fuzzy observation functions. Finally, based on the spatial coherency principle, the final tracking result is determined from the generalized results by measuring similarities of consecutive objects. The proposed method shows good performance as compared with several state-of-the-art trackers on challenging benchmark sequences.
Article
Unmanned aerial vehicles (UAVs) equipped with high definition (HD) cameras can obtain a large number of detailed inspection images. The insulator is an indispensable component in the transmission lines. Detecting insulator in image video quickly and accurately can provide a reliable basis for the ranging and the obstacle avoidance flight of UAV close to the tower and transmission line. At the same time, the insulator is a serious threat to the safety of the power grid due to the multiple faults of the insulator, and the computer technology should be fully utilized to diagnose the fault. Detection of the insulator images with the complex aerial background is implemented by constructing a convolutional neural network (CNN), which has the classic architecture of five modules of convolution and pooling, two modules of fully connected layers. In this paper, we propose a recognition algorithm for explosion fault based on saliency detection, which uses the trained network model to extract the features. Then, we put the saliency maps into a self-organizing feature map (SOM) network and build the mathematical module via super pixel segmentation, contour detection and other image processing methods. The test shows that the algorithm can reduce the error that may be caused by manual analysis. It also demonstrates that the detection of the insulator and the recognition of explosion fault can effectively improve the efficiency and intelligence level.
Article
This paper investigates how to integrate the complementary information from RGB and thermal (RGB-T) sources for object tracking. We propose a novel Convolutional Neural Network (ConvNet) architecture, including a two-stream ConvNet and a FusionNet, to achieve adaptive fusion of different source data for robust RGB-T tracking. Both RGB and thermal streams extract generic semantic information of the target object. In particular, the thermal stream is pre-trained on the ImageNet dataset to encode rich semantic information, and then fine-tuned using thermal images to capture the specific properties of thermal information. For adaptive fusion of different modalities while avoiding redundant noises, the FusionNet is employed to select most discriminative feature maps from the outputs of the two-stream ConvNet, and updated online to adapt to appearance variations of the target object. Finally, the object locations are efficiently predicted by applying the multi-channel correlation filter on the fused feature maps. Extensive experiments on the recently public benchmark GTOT verify the effectiveness of the proposed approach against other state-of-the-art RGB-T trackers.
Article
Infrared object tracking is a key technology in many surveillance applications. General visual tracking algorithms designed for color images can not handle infrared targets very well due to their relatively low resolutions and blurred edges. This paper presents a new tracking by detection method based on online structural learning. We show how to train the classifier efficiently with dense samples through Fourier techniques and careful implementation. Furthermore, we introduce an effective feature representation for infrared objects. Finally, we demonstrate the performance of the proposed tracker on public infrared sequences with top accuracy and robustness. Meanwhile, our single thread C++ implementation of the algorithm achieves an average tracking speed of 215 FPS on a modern cpu.