ArticlePDF Available

Deep Convolutional Neural Networks for Thermal Infrared Object Tracking

Authors:

Abstract and Figures

Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Content may be subject to copyright.
A preview of the PDF is not available
... Considering the powerful representation ability of the Convolutional Neural Network (CNN), some works [19,20] introduce CNN features into TIR tracking. Unfortunately, these trackers have not made great progress for several reasons. ...
... These trackers have received more attention in TIR object tracking. To deal with various challenges, a variety of the classification-based TIR trackers are presented based on sparse representation [16,18,22,23], multiple instances learning [24], kernel density estimation [25], low-rank sparse learning [26,27], structural support vector machine [28], correlation filter [17,29,30], and deep learning [19,20,31]. For instance, to address the occlusion problem, He et al. [26] propose a robust low-rank sparse tracker using the low-rank constraints to capture the underlying structure of the TIR object. ...
... In [17], the authors propose an ensemble correlation filter TIR tracker which can handle the variation in appearance due to the proposed switch mechanism. In order to extract more discriminative features of the TIR object, Liu et al. [20] investigate the deep convolutional feature for TIR object tracking. They propose a Kullback-Leibler divergence based fusion tracker by exploiting multiple convolutional layer features. ...
Preprint
Most thermal infrared (TIR) tracking methods are discriminative, treating the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is coupled well to the objective of the tracking task. We propose a TIR tracker via a Hierarchical Spatial-aware Siamese Convolutional Neural Network (CNN), named HSSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN that coalesces the multiple hierarchical convolutional layers. Then, we propose a spatial-aware network to enhance the discriminative ability of the coalesced hierarchical feature. Subsequently, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the candidate that is most similar to the tracked target. Extensive experimental results on the benchmarks VOT-TIR 2015 and VOT-TIR 2016 show that our proposed method achieves favourable performance compared to the state-of-the-art methods.
... In recent years, some works have attempted to utilize the powerful representation capabilities of deep features to improve TIR tracking. For example, MCFTS [9] and LMSCO [10] use pre-trained VGGNet to extract deep features of TIR targets. MLSSNet [11] and GFSNet [12] train a deep matching network on a large-scale RGB dataset and then use it for TIR tracking directly. ...
... Pre-trainedbased deep TIR trackers usually use a pre-trained feature network learned from RGB datasets to extract the deep features of TIR targets and then combine an existing tracking framework. For example, MCFTS [9] uses pre-trained VGGNet [15] to obtain multiple convolution features of the TIR target and then combines with a correlation filter to form an ensemble-based TIR tracker. This method utilizes the powerful ability of deep learning in feature extraction and combines the tracking efficiency of traditional correlation filters, thereby improving tracking accuracy while maintaining real-time performance. ...
Article
Full-text available
Due to the lack of large-scale labeled Thermal InfraRed (TIR) training datasets, most existing TIR trackers are trained directly on RGB datasets. However, tracking methods trained on RGB datasets suffer a significant drop-off in TIR data, due to the domain shift issue. To address this issue, we propose a Progressive Domain Adaptation framework for TIR tracking (PDAT), which can effectively transfer knowledge from labeled RGB datasets to TIR tracking without requiring a large amount of labeled TIR data. The framework consists of an Adversarial-based Global Domain Adaptation module and a Clustering-based Subdomain Adaptation module to gradually align feature distributions between RGB and TIR domains. Additionally, we collected a large-scale TIR dataset with over 1.48 million unlabeled TIR images for training the proposed domain adaptation framework. Our experimental results on five TIR tracking benchmarks show that the proposed method improves the baseline tracking performance without compromising the tracking speed. Notably, on LSOTB-TIR100, LSOTB-TIR120, and PTB-TIR, the Success rates were approximately 6 percentage points higher than the baseline, demonstrating its effectiveness.
... Recently, discriminative correlation filters (DCF) based trackers [17,18,19,20,21] have achieved excellent results in terms of accuracy, robustness and speed [15,22,23], because they treat tracking tasks as similarity learning problems. Since DCF exploits all circular shifts of training samples to solve a ridge regression in the Fourier frequency domain [9], it avoids time-consuming correlation operations. ...
... We exploit the Lagrange multiplier method [20] to solve this problem. The optimal confidence score map can be calculated as ...
Preprint
In this paper, a novel circular and structural operator tracker (CSOT) is proposed for high performance visual tracking, it not only possesses the powerful discriminative capability of SOSVM but also efficiently inherits the superior computational efficiency of DCF. Based on the proposed circular and structural operators, a set of primal confidence score maps can be obtained by circular correlating feature maps with their corresponding structural correlation filters. Furthermore, an implicit interpolation is applied to convert the multi-resolution feature maps to the continuous domain and make all primal confidence score maps have the same spatial resolution. Then, we exploit an efficient ensemble post-processor based on relative entropy, which can coalesce primal confidence score maps and create an optimal confidence score map for more accurate localization. The target is localized on the peak of the optimal confidence score map. Besides, we introduce a collaborative optimization strategy to update circular and structural operators by iteratively training structural correlation filters, which significantly reduces computational complexity and improves robustness. Experimental results demonstrate that our approach achieves state-of-the-art performance in mean AUC scores of 71.5% and 69.4% on the OTB-2013 and OTB-2015 benchmarks respectively, and obtains a third-best expected average overlap (EAO) score of 29.8% on the VOT-2017 benchmark.
... Therefore, trackers for TIR tracking tasks are investigated. MCFTS [60] employs a correlation filter-based ensemble tracker with multilayer convolutional features. Nevertheless, TIR tracking is still confused by scarce benchmarks. ...
Article
Full-text available
With the development of unmanned aerial vehicle (UAV) technology, the threat of UAV intrusion is no longer negligible. Therefore, drone perception, especially anti-UAV tracking technology, has gathered considerable attention. However, both traditional Siamese and transformer-based trackers struggle in anti-UAV tasks due to the small target size, clutter backgrounds and model degradation. To alleviate these challenges, a novel contrastive-augmented memory network (CAMTracker) is proposed for anti-UAV tracking tasks in thermal infrared (TIR) videos. The proposed CAMTracker conducts tracking through a two-stage scheme, searching for possible candidates in the first stage and matching the candidates with the template for final prediction. In the first stage, an instance-guided region proposal network (IG-RPN) is employed to calculate the correlation features between the templates and the searching images and further generate candidate proposals. In the second stage, a contrastive-augmented matching module (CAM), along with a refined contrastive loss function, is designed to enhance the discrimination ability of the tracker under the instruction of contrastive learning strategy. Moreover, to avoid model degradation, an adaptive dynamic memory module (ADM) is proposed to maintain a dynamic template to cope with the feature variation of the target in long sequences. Comprehensive experiments have been conducted on the Anti-UAV410 dataset, where the proposed CAMTracker achieves the best performance compared to advanced tracking algorithms, with significant advantages on all the evaluation metrics, including at least 2.40%, 4.12%, 5.43% and 5.48% on precision, success rate, success AUC and state accuracy, respectively.
... However, the landscape changed in 2012 with the introduction of AlexNet, which achieved a groundbreaking performance in the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [109]. This breakthrough highlighted the potential of deep neural networks and paved the way for the widespread adoption of deep network architectures in various fields, including image processing [112,113,114], video analysis [115,116], natural language processing [117], speech processing [118], and particularly low-level computer vision tasks [119,120]. Following AlexNet's success, other deep network architectures such as Visual Geometry Group (VGG) [121] and GoogLeNet [122] gained significant popularity and were widely applied in various domains, leveraging their impressive performance and capability to extract meaningful features from input data. ...
Thesis
Full-text available
Medical ultrasound is a type of imaging that uses high-frequency sound waves to create images of body parts. A transducer, which creates high-frequency sound waves that flow through the bodily tissues, measures the dimensions, shape, and consistency of soft tissues and organs. The sound waves are reflected from the body tissues to vibrate the transducer which has piezoelectric material. The piezoelectric material converts the sound waves to electrical pulses that travel to the ultrasonic scanner where the electrical signal is amplified and processed to form a digital image in real-time. These images can be useful in detecting and treating a wide range of diseases and ailments. The main disadvantage of ultrasound imaging is the addition of noise during the signal processing stage which can result in images that can be difficult to interpret. Different kinds of noise are introduced during the acquisition or transmission of the image. Ultrasound images are mostly affected by speckle noise. Speckle noise is a random creation of multiple tiny dots in an image and is produced when sound waves randomly interfere with tiny particles on a scale equal to the sound wavelength. The quality of the image is reduced by speckle noise, which limits the ability of human observation to form judgments based on the diagnostic examination. Speckle noise reduces contrast in an image, making it challenging to execute subsequent image processing operations like segmentation and edge detection. This research dissertation provides a comprehensive study on the removal of speckle noise in ultrasound images. Different techniques to remove speckle noise in ultrasound images is discussed. Multiple experiments are done using single filters, hybrid filters and deep learning algorithms. The experimental results lead to the conclusion that deep learning algorithms excelled, particularly when confronted with high speckle variances. Specifically, among the deep learning algorithms, DnCNNL5 exhibited the most favorable performance under high speckle variances. However, it is important to acknowledge that deep learning algorithms are disadvantaged by longer processing times. Hybrid filters emerged as the second-best performers in terms of speckle noise removal, accompanied by the second-best processing time. These hybrid filters showed effective noise reduction capabilities due to their combination of the best performing individual filters. This result was as expected, considering that hybrid filters, constructed by amalgamating the best individual filters, were anticipated to perform better than single filters in terms of denoising efficacy. On the other hand, single filter algorithms demonstrated the shortest processing time but ranked last in terms of removing speckle noise.
... We categorize the TIR trackers in Fig. 7 and introduce the corresponding methods in the following. Besides, we report the results of TIR-based methods in Table 4. [287] proposed an ensemble method that merged the features of the multiple convolution layers by multiple weak trackers constructed by a correlation filter. LMSCO [302] integrated DCF with structured SVM and employed spatial regularization and implicit interpolation to obtain continuous deep feature maps. ...
Preprint
Full-text available
Visual Object Tracking (VOT) is an attractive and significant research area in computer vision, which aims to recognize and track specific targets in video sequences where the target objects are arbitrary and class-agnostic. The VOT technology could be applied in various scenarios, processing data of diverse modalities such as RGB, thermal infrared and point cloud. Besides, since no one sensor could handle all the dynamic and varying environments, multi-modal VOT is also investigated. This paper presents a comprehensive survey of the recent progress of both single-modal and multi-modal VOT, especially the deep learning methods. Specifically, we first review three types of mainstream single-modal VOT, including RGB, thermal infrared and point cloud tracking. In particular, we conclude four widely-used single-modal frameworks, abstracting their schemas and categorizing the existing inheritors. Then we summarize four kinds of multi-modal VOT, including RGB-Depth, RGB-Thermal, RGB-LiDAR and RGB-Language. Moreover, the comparison results in plenty of VOT benchmarks of the discussed modalities are presented. Finally, we provide recommendations and insightful observations, inspiring the future development of this fast-growing literature.
... Last year, DMSRDCF [13] have shown excellent tracking performance by fusing all the handcrafted features, deep appearance features and deep motion features. Moreover, MCFT [29] observes that the CNNs which trained on visible images also can well represent the targets for thermal infrared tracking. All the above-mentioned tracking approaches motivate us to investigate the incorporation of DCF and SOSVM frameworks, and transfer the pre-trained appearances network [30] and optical flow network [12] for thermal infrared tracking. ...
Preprint
Compared with visible object tracking, thermal infrared (TIR) object tracking can track an arbitrary target in total darkness since it cannot be influenced by illumination variations. However, there are many unwanted attributes that constrain the potentials of TIR tracking, such as the absence of visual color patterns and low resolutions. Recently, structured output support vector machine (SOSVM) and discriminative correlation filter (DCF) have been successfully applied to visible object tracking, respectively. Motivated by these, in this paper, we propose a large margin structured convolution operator (LMSCO) to achieve efficient TIR object tracking. To improve the tracking performance, we employ the spatial regularization and implicit interpolation to obtain continuous deep feature maps, including deep appearance features and deep motion features, of the TIR targets. Finally, a collaborative optimization strategy is exploited to significantly update the operators. Our approach not only inherits the advantage of the strong discriminative capability of SOSVM but also achieves accurate and robust tracking with higher-dimensional features and more dense samples. To the best of our knowledge, we are the first to incorporate the advantages of DCF and SOSVM for TIR object tracking. Comprehensive evaluations on two thermal infrared tracking benchmarks, i.e. VOT-TIR2015 and VOT-TIR2016, clearly demonstrate that our LMSCO tracker achieves impressive results and outperforms most state-of-the-art trackers in terms of accuracy and robustness with sufficient frame rate.
Chapter
Integrating advanced technologies such as artificial intelligence (AI) has become crucial for enhancing urban management and security as urban centers evolve into smart cities. This study conducts a bibliometric analysis to explore smart cities' security and surveillance dimensions, focusing on data protection and ethical AI use. By reviewing 745 articles indexed in Scopus from 1977 to 2023, we employ co-citation analysis, co-occurrence analysis, and bibliographic coupling to identify key thematic clusters and influential publications. The findings reveal the complex interplay between technological advancements and privacy concerns, highlighting the importance of a balanced approach to AI integration. This research provides valuable insights for urban planners and policymakers to develop strategies that enhance urban safety and efficiency while respecting individual privacy and fostering public trust. The study underscores the necessity of ethical standards and robust governance frameworks in deploying AI-powered security systems in smart cities.
Article
Full-text available
Our cardiovascular system weakens and is more prone to arrhythmia as we age. An arrhythmia is an abnormal heartbeat rhythm which can be life-threatening. Atrial fibrillation (Afib), atrial flutter (Afl), and ventricular fibrillation (Vfib) are the recurring life-threatening arrhythmias that affect the elderly population. An electrocardiogram (ECG) is the principal diagnostic tool employed to record and interpret ECG signals. These signals contain information about the different types of arrhythmias. However, due to the complexity and non-linearity of ECG signals, it is difficult to manually analyze these signals. Moreover, the interpretation of ECG signals is subjective and might vary between the experts. Hence, a computer-aided diagnosis (CAD) system is proposed. The CAD system will ensure that the assessment of ECG signals is objective and accurate. In this work, we present a convolutional neural network (CNN) technique to automatically detect the different ECG segments. Our algorithm consists of an eleven-layer deep CNN with the output layer of four neurons, each representing the normal (Nsr), Afib, Afl, and Vfib ECG class. In this work, we have used ECG signals of two seconds and five seconds’ durations without QRS detection. We achieved an accuracy, sensitivity, and specificity of 92.50%, 98.09%, and 93.13% respectively for two seconds of ECG segments. We obtained an accuracy of 94.90%, the sensitivity of 99.13%, and specificity of 81.44% for five seconds of ECG duration. This proposed algorithm can serve as an adjunct tool to assist clinicians in confirming their diagnosis.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Article
Principal component analysis (PCA) is widely used in dimensionality reduction. A lot of variants of PCA have been proposed to improve the robustness of the algorithm. However, the existing methods either cannot select the useful features consistently or is still sensitive to outliers, which will depress their performance of classification accuracy. In this paper, a novel approach called joint sparse principal component analysis (JSPCA) is proposed to jointly select useful features and enhance robustness to outliers. In detail, JSPCA relaxes the orthogonal constraint of transformation matrix to make it have more freedom to jointly select useful features for low-dimensional representation. JSPCA imposes joint sparse constraints on its objective function, i.e., ℓ2,1-norm is imposed on both the loss term and the regularization term, to improve the algorithmic robustness. A simple yet effective optimization solution is presented and the theoretical analyses of JSPCA are provided. The experimental results on eight data sets demonstrate that the proposed approach is feasible and effective.
Article
Nonlocal image representation methods, including group-based sparse coding and BM3D, have shown their great performance in application to low-level tasks. The nonlocal prior is extracted from each group consisting of patches with similar intensities. Grouping patches based on intensity similarity, however, gives rise to disturbance and inaccuracy in estimation of the true images. To address this problem, we propose a structure-based low-rank model with graph nuclear norm regularization. We exploit the local manifold structure inside a patch and group the patches by the distance metric of manifold structure. With the manifold structure information, a graph nuclear norm regularization is established and incorporated into a low-rank approximation model. We then prove that the graph-based regularization is equivalent to a weighted nuclear norm and the proposed model can be solved by a weighted singular-value thresholding algorithm. Extensive experiments on additive white Gaussian noise removal and mixed noise removal demonstrate that the proposed method achieves better performance than several state-of-the-art algorithms.