ArticlePDF Available

Deep Convolutional Neural Networks for Thermal Infrared Object Tracking

Authors:

Abstract and Figures

Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Content may be subject to copyright.
A preview of the PDF is not available
... ECO-TIR [13] introduces both deep features and motion features to the ECO [14] framework, resulting in significant improvement in the performance of the base tracker. Similarly, MCFTS [15] employs VGGNet [16] to extract deep features, which are subsequently integrated into a correlation filter framework. However, these methods are limited in their ability to handle the challenge of distractors, as pre-trained feature models often fail to extract the strong and discriminative features of TIR targets. ...
... Acharya et al. [29] propose a TIR tracker which integrates multi-channel deep features for TIR tracking. MCFTS [15] utilizes various CNN features to construct a correlation-based TIR tracker, thereby achieving a promising performance gains. Based on the Siamese framework, HSSNet [30] proposes a spatial-variation-aware module for TIR object tracking. ...
... Compared trackers: To evaluate our method, we compared it with deep feature based CF trackers, including SRDCF [51], MCFTS [15], HDT [52], HSSNet [30], CREST [53], DeepSTRCF [54], ECO-deep [14]; the other feature based trackers, such as ECO-MM [18], VITAL [55], ECOstir [13] ATOM [56], KYS [57], Siamese-FC [58], SiamRPN [59], CFNet [60], TADT [61], MLSSNet [62], SiamRPN++ [63]; the transformer based deep tracker TransT [1]. ...
... Target detection and false alarm rates are improved using LWIR sensors in the work presented in [7]. The effectiveness of combining a CNN with background modeling for human detection is proposed and demonstrated by Shahid et al. [8]. Improved Gaussian average and human classification using CNN is only performed for foreground objects in real time. ...
... Improved Gaussian average and human classification using CNN is only performed for foreground objects in real time. The technique to detect in real time humans among images, which are thermal and built by background modeling, CNN, is explained in [8]. Object detection using retina net gives prominently better results compared with the other existing methods. ...
Article
Full-text available
Thermal imaging is a cutting-edge technology which has the capability to detect objects in any environmental conditions, such as smoke, fog, smog, etc. This technology finds its importance mainly during nighttime since it does not require light to detect the objects. Applications of this technology span into various sectors, most importantly in border security to detect any incoming hazards. Object detection and classification are generally difficult with thermal imaging. In this paper, a one-stage deep convolution network-based object detection and classification called retina net is introduced. Existing surveys are based on object detection using infrared information obtained from the objects. This research is focused on detecting and identifying objects from thermal images and surveillance data.
... Fig. 2 shows the general overview of the correlation filtersbased tracking framework [63]. This framework use the target given in the initial frame and the target's expected label to train the filter model [42], [64]. The filter model can be written as: ...
... These tracking methods can be divided into three categories: i) correlation filters-based tracking methods (e.g. MCFTS [64], UDCT [75] and DAS [87]), ii) Siamese network-based tracking methods (e.g. MMNet [72], GFSNet [73], HSSNet [86] and SiamSAV [74]), and iii) other deep learning-based tracking methods (e.g. ...
Conference Paper
Full-text available
Thermal infrared (TIR) target tracking task is not affected by illumination changes and can be tracked at night, on rainy days, foggy days, and other extreme weather, so it is widely used in night auxiliary driving, unmanned aerial vehicle reconnaissance, video surveillance, and other scenes. Thermal infrared target tracking task still faces many challenges, such as occlusion, deformation, similarity interference, etc. To solve the challenge in the TIR target tracking scenarios, a large number of TIR target tracking methods have appeared in recent years. The purpose of this paper is to give a comprehensive review and summary of the research status of thermal infrared target tracking methods. We first introduce some basic principles and representative work of the thermal infrared target tracking methods. And then, some benchmarks for performance testing of thermal infrared target tracking methods are introduced. Subsequently, we demonstrate the tracking results of several representative tracking methods on some benchmarks. Finally, the future research direction of thermal infrared target tracking is discussed.
... Bhat et al. [20] innovatively designed a distinguished model to accomplish object tracking. Liu et al. [21] constructed a dense convolutional layer network to absorb the rich features of object and achieve a robust tracking. Li et al. [22] designed a regression network, which improves performance according to evaluate the confidence of the candidate bounding boxes. ...
Article
Full-text available
RGB and Thermal (RGBT) tracking is an important supplement of visual object tracking for its’ unique practical and research value. However, due to the limitations of the RGBT camera, extra interferences are introduced in the data synchronously. Removing or alleviating these interferences is a crucial direction to improve the performance of the RGBT tracking task. For this problem, we propose the multi-stage matching guidance and context integration network (M2GCI) for RGBT tracking. M2GCI reorganizes the feature-encoding pipeline into two context-integrating stages that are responsible for processing primary and senior information respectively. Firstly, the primary features encoded from primary encoders are integrated by the spatially adaptive fusion strategy. Next, the senior encoders cooperating with axial external-attention redistribution strategy further extract senior features which are more reliable for target prediction. Extensive experiments on the RGBT234 and GTOT provide that the proposed M2GCI can achieve more precision and robust tracking under some difficult scenarios compared with the excellent methods.
... Performance comparison of the proposed algorithm in scenarios with occlusions is performed with two classes of state-of-the-art trackers: discriminative correlation filters and deep Siamese networks, which have been recognized as the dominant video tracking paradigms [18]. We selected traditional discriminative correlation filters: Staple [47], KCF [19], and STRCF [48], as well as deep learning based discriminative correlation filters trained for visual object tracking: HCF [49], ECO [50], ECO-HC [50], and STRCFdeep [48], and trained for thermal object tracking: MCFTS [51], ECO-stir [52], and MMNet [53]. From the class of deep Siamese networks, trackers trained for visual object tracking were selected: SiamFC [54], DSiam [55], SiamRPN [56], SiamMASK [57], SiamCAR [58], and SiamBAN [59], as well as those trained for thermal infrared tracking: HSSNet [60] and MLSSNet [61]. ...
Article
Full-text available
Short-wave infrared (SWIR) imaging has significant advantages in challenging propagation conditions where the effectiveness of visible-light and thermal imaging is limited. Object tracking in SWIR imaging is particularly difficult due to lack of color information, but also because of occlusions and maneuvers of the tracked object. This paper proposes a new algorithm for object tracking in SWIR imaging, using a kernelized correlation filter (KCF) as a basic tracker. To overcome occlusions, the paper proposes the use of the Kalman filter as a predictor and a method to expand the object search area. Expanding the object search area helps in better re-detection of the object after occlusion, but also leads to the occasional appearance of errors in measurement data that can lead to object loss. These errors can be treated as outliers. To cope with outliers, Huber’s M-robust approach is applied, so this paper proposes robustification of the Kalman filter by introducing a nonlinear Huber’s influence function in the Kalman filter estimation step. However, robustness to outliers comes at the cost of reduced estimator efficiency. To make a balance between desired estimator efficiency and resistance to outliers, a new adaptive M-robustified Kalman filter is proposed. This is achieved by adjusting the saturation threshold of the influence function using the detection confidence information from the basic KCF tracker. Experimental results on the created dataset of SWIR video sequences indicate that the proposed algorithm achieves a better performance than state-of-the-art trackers in tracking the maneuvering object in the presence of occlusions.
... Beside being utilized for navigation, thermal sensor has been used in agriculture to monitor crops (Speth et al., 2022), infrastructure monitoring (Chokkalingham et al., 2012;Fuentes et al., 2021;Stypułkowski et al., 2021;Wu et al., 2018), objects detection and tracking (Leira et al., 2021;Liu, Li, et al., 2020;Liu et al., 2017Liu et al., , 2022. ...
Article
Full-text available
The study explores the feasibility of optical flow-based neural network from real-world thermal aerial imagery. While traditional optical flow techniques have shown adequate performance, sparse techniques do not work well during cold-soaked low-contrast conditions, and dense algorithms are more accurate in low-contrast conditions but suffer from the aperture problem in some scenes. On the other hand, optical flow from convolutional neural networks has demonstrated good performance with strong generalization from several synthetic public data set benchmarks. Ground truth was generated from real-world thermal data estimated with traditional dense optical flow techniques. The state-of-the-art Recurrent All-Pairs Field Transform for the Optical Flow model was trained with both color synthetic data and the captured real-world thermal data across various thermal contrast conditions. The results showed strong performance of the deep-learning network against established sparse and dense optical flow techniques in various environments and weather conditions, at the cost of higher computational demand. K E Y W O R D S deep learning, LWIR, navigation, optical flow, thermal imaging, UAVs
Article
Full-text available
Our cardiovascular system weakens and is more prone to arrhythmia as we age. An arrhythmia is an abnormal heartbeat rhythm which can be life-threatening. Atrial fibrillation (Afib), atrial flutter (Afl), and ventricular fibrillation (Vfib) are the recurring life-threatening arrhythmias that affect the elderly population. An electrocardiogram (ECG) is the principal diagnostic tool employed to record and interpret ECG signals. These signals contain information about the different types of arrhythmias. However, due to the complexity and non-linearity of ECG signals, it is difficult to manually analyze these signals. Moreover, the interpretation of ECG signals is subjective and might vary between the experts. Hence, a computer-aided diagnosis (CAD) system is proposed. The CAD system will ensure that the assessment of ECG signals is objective and accurate. In this work, we present a convolutional neural network (CNN) technique to automatically detect the different ECG segments. Our algorithm consists of an eleven-layer deep CNN with the output layer of four neurons, each representing the normal (Nsr), Afib, Afl, and Vfib ECG class. In this work, we have used ECG signals of two seconds and five seconds’ durations without QRS detection. We achieved an accuracy, sensitivity, and specificity of 92.50%, 98.09%, and 93.13% respectively for two seconds of ECG segments. We obtained an accuracy of 94.90%, the sensitivity of 99.13%, and specificity of 81.44% for five seconds of ECG duration. This proposed algorithm can serve as an adjunct tool to assist clinicians in confirming their diagnosis.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Article
Principal component analysis (PCA) is widely used in dimensionality reduction. A lot of variants of PCA have been proposed to improve the robustness of the algorithm. However, the existing methods either cannot select the useful features consistently or is still sensitive to outliers, which will depress their performance of classification accuracy. In this paper, a novel approach called joint sparse principal component analysis (JSPCA) is proposed to jointly select useful features and enhance robustness to outliers. In detail, JSPCA relaxes the orthogonal constraint of transformation matrix to make it have more freedom to jointly select useful features for low-dimensional representation. JSPCA imposes joint sparse constraints on its objective function, i.e., ℓ2,1-norm is imposed on both the loss term and the regularization term, to improve the algorithmic robustness. A simple yet effective optimization solution is presented and the theoretical analyses of JSPCA are provided. The experimental results on eight data sets demonstrate that the proposed approach is feasible and effective.
Article
Nonlocal image representation methods, including group-based sparse coding and BM3D, have shown their great performance in application to low-level tasks. The nonlocal prior is extracted from each group consisting of patches with similar intensities. Grouping patches based on intensity similarity, however, gives rise to disturbance and inaccuracy in estimation of the true images. To address this problem, we propose a structure-based low-rank model with graph nuclear norm regularization. We exploit the local manifold structure inside a patch and group the patches by the distance metric of manifold structure. With the manifold structure information, a graph nuclear norm regularization is established and incorporated into a low-rank approximation model. We then prove that the graph-based regularization is equivalent to a weighted nuclear norm and the proposed model can be solved by a weighted singular-value thresholding algorithm. Extensive experiments on additive white Gaussian noise removal and mixed noise removal demonstrate that the proposed method achieves better performance than several state-of-the-art algorithms.