ArticlePDF Available

Deep Convolutional Neural Networks for Thermal Infrared Object Tracking

Authors:

Abstract and Figures

Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Content may be subject to copyright.
A preview of the PDF is not available
... Since most of the developed networks comprise mainly greater performance consisting observed captures and also developed tasks have begun in mostly developing architecture and deep neural architecture for developing and increasing achievement in the thermal infrared detector and capturer. MSOT [56] mainly combines in-depth characteristics and extractions comprising HRP-T along with correlation differentiator along thermal infrared-based capturer. MNGCO [57] combines appearance and motion extractions and features to construct and develop a TIR target detector and tracking. ...
... Imitating along with also presenting regions in pictures and video also entire or a part of the image has also been successfully utilized as a validation and development algorithm across pictures and video categorization, drone tracking along with detection, etc. Since existing complex sets also present coverage consisting of the two-dimensional and 3D people key points data, few contributors [56] also propose the main information combining scenario, also synthetically and manually developing coverage along with also devising an unbiased random picture. The methodology comprises [89] trained and experimented over a people key point data present, achieving developed current achievements. ...
Article
Full-text available
Unmanned Flight Vehicle (UAVs) primarily provides many applications and uses in the commerce and also recreation fields. Thus, the perception and visualization of the state of UAVs are of prime importance. Also in this paper, the authors incorporate the primary objective involving capturing and Detecting Drones, thus deriving important And Valuable data of position along with also coordinates. The wide and overall diffusion of drones increases the existing hazards of their misuse in a lot of illegitimate actions for example drug smuggling and also terrorism. Thereby, drones' surveillance and also automated detection are very crucial for protecting and safeguarding certain restricted areas or special zones and regions from illegal drone entry. Although, when present under low illumination situations and scenes, the designed capturers may lose the capability in discovering valuable data, which may lead to the wrong and not accurate results. In order to alleviate and resolve this, there are some works that consider using and reading infrared (IR) videos and images for object detection and tracking. The crucial drawback existing for infrared images is pertaining that they generally possess low resolution, this, thus provides inadequate resolution and information for trackers. Thus, considering the provided above analysis, fusing RGB and visible data along with also infrared picture data is essential in capturing and detecting drones. Moreover, this leverages data consisting of more than a single mode of crucial data which is useful and advantageous in studying along with understanding precise with also important drone existing capturers. Thus, the very use involves few good data comprising more than a single mode which is also needed in order for learning and understanding some objectives involving detecting and capturing UAVs. This paper introduces an automated video and image-based drone tracking and detection system which utilizes a crucial and advanced deep-learning-based image and object detection and tracking method known as you only look once (YOLOv5) to protect restricted areas and regions or special regions and zones from the unlawful drone entry and interventions. YOLO v5, part of the single-stage existing detectors, has one of the best detection and tracking performances required for balancing both the accuracy and also speed by collecting in-depth and also high-level extracted features. Based on YOLO v5, this paper also improves it to track and detect UAVs more accurately and precisely, and it's one of the first times introducing a YOLO v5-based developed algorithm for UAV object tracking and detection for the anti-UAV. It also adopts the last four existing scales of feature extraction maps instead of the previous three pertaining scales of feature maps required to predict and draft bounding boxes of given objects, which can alsodeliver moretextureandalsoimportantcontour data for the necessity to track and detect tiny and small objects. Also at the same time, in order to reduce and decrease the calculation, the provided size of the UAV in the existing four scales feature and contour maps are calculated according to the provided input data, and also then the tracked number of anchor existing boxes is also modified and adjusted. Therefore, the proposed UAV tracking and detection technology can also be applied in the given field of anti-UAV. Accordingly, an important and effective method named a double-training strategy has been developed mainly in drone detection and capturing. Trained mainly in class and instance segmentation spanning in moving frames and image series, the capturer also understands the accurate and important segments data along with information and also derives some distinct and important instantaneous and class-order characteristics.
... With the development of deep learning in the field of computer vision, some studies applied this method to infrared small-target detection [37][38][39][40]. Such an approach provides comparative performance but requires training the model with a large amount of data in advance. ...
... With the development of deep learning in the field of computer vision, some stud applied this method to infrared small-target detection [37][38][39][40]. Such an approach provi comparative performance but requires training the model with a large amount of dat advance. ...
Article
Full-text available
In infrared small target detection, the infrared patch image (IPI)-model-based methods produce better results than other popular approaches (such as max-mean, top-hat, and human visual system) but in some extreme cases it suffers from long processing times and inconsistent performance. In order to overcome these issues, we propose a novel approach of dividing the traditional target detection process into two steps: suppression of background noise and elimination of clutter. The workflow consists of four steps: after importing the images, the second step applies the alternating direction multiplier method to preliminarily remove the background. Comparatively to the IPI model, this step does not require sliding patches, resulting in a significant reduction in processing time. To eliminate residual noise and clutter, the interim results from morphological filtering are then processed in step 3 through an improved new top-hat transformation, using a threefold structuring element. The final step is thresholding segmentation, which uses an adaptive threshold algorithm. Compared with IPI and the new top-hat methods, as well as some other widely used methods, our approach was able to detect infrared targets more efficiently (90% less computational time) and consistently (no sudden performance drop).
... The research idea also belongs to multi-feature fusion method, which mainly selects expressive subset features in a large set, and on this basis, designs adaptive fusion algorithm through the difference of representation ability of features for different targets. On the other hand, Liu et al. [33] apply the CNN model trained by the visible light data set to the thermal infrared tracking task by using the transfer learning method, to improve the discriminative capacity, a multi-level similarity model under a Siamese framework [32] and a multi-task framework [31] are proposed to learn the TIR-specific discriminative features, moreover, Liu proposed a framework named self-SDCT [54], which can alleviate the demand for large annotated training samples. Subsequently, Martin daneljan and others have proposed a series of high-performance tracking algorithms, such as DeepSRDCF [13], C-COT [14], ECO [9]. ...
Article
Full-text available
Convolution Neural Network (CNN) features have been widely used in visual tracking due to their powerful representation. As an important component of CNN, the pooling layer plays a critical role, but the max/average/min operation only explores the first-order information, which limits the discrimination ability of the CNN features in some complex situations. In this paper, a high-order pooling layer is integrated into the VGG16 network for visual tracking. In detail, a high-order covariance pooling layer is employed to replace the last maxpooling layer to learn discrimination features and is trained on the ImageNet and CUB200-2011 data sets. In tracking stage, the multiple levels of feature maps are extracted as the appearance representation of the target. After that, the extracted CNN features are integrated into the correlation filters framework when tracking is on-the-fly. The experimental results show that the proposed algorithm achieves excellent performance in both success rate and tracking accuracy.
... However, the camera sensor is easily affected by the sand and dust in the environment. In recent years, the thermal imaging camera has become the trend in multi-target detection and tracking [15][16][17], which is not easily affected by dust, and has the potential to be applied to environmental perception in open-pit mines. ...
Article
Full-text available
There exist many difficulties in environmental perception in transportation at open-pit mines, such as unpaved roads, dusty environments, and high requirements for the detection and tracking stability of small irregular obstacles. In order to solve the above problems, a new multi-target detection and tracking method is proposed based on the fusion of Lidar and millimeter-wave radar. It advances a secondary segmentation algorithm suitable for open-pit mine production scenarios to improve the detection distance and accuracy of small irregular obstacles on unpaved roads. In addition, the paper also proposes an adaptive heterogeneous multi-source fusion strategy of filtering dust, which can significantly improve the detection and tracking ability of the perception system for various targets in the dust environment by adaptively adjusting the confidence of the output target. Finally, the test results in the open-pit mine show that the method can stably detect obstacles with a size of 30–40 cm at 60 m in front of the mining truck, and effectively filter out false alarms of concentration dust, which proves the reliability of the method.
... Currently, some works focus on TIR tracking using correlation filters and achieve promising results. For example, MCFTS [19] transfers the VGG-Net to TIR tracking to extract multilayer CNN features and integrates it into KCF to build an ensemble tracker. To address the lack of large TIR datasets, Li et al. [20] utilize an image conversion model to generate a synthetic TIR dataset by RGB images to train an end-to-end CNN model for tracking. ...
Article
Full-text available
Existing thermal infrared (TIR) trackers based on correlation filters cannot adapt to the abrupt scale variation of nonrigid objects. This deficiency could even lead to tracking failure. To address this issue, we propose a TIR tracker, called ECO_LS, which improves the performance of efficient convolution operators (ECO) via the level set method. We first utilize the level set to segment the local region estimated by the ECO tracker to gain a more accurate size of the bounding box when the object changes its scale suddenly. Then, to accelerate the convergence speed of the level set contour, we leverage its historical information and continuously encode it to effectively decrease the number of iterations. In addition, our variant, ECOHG_LS, also achieves better performance via concatenating histogram of oriented gradient (HOG) and gray features to represent the object. Furthermore, experimental results on three infrared object tracking benchmarks show that the proposed approach performs better than other competing trackers. ECO_LS improves the EAO by 20.97% and 30.59% over the baseline ECO on VOT-TIR2016 and VOT-TIR2015, respectively.
Article
The necessity of integrated navigation complexes (INC) construction is substantiated. It is proposed to include in the complex the following inertial systems: inertial, satellite and visual. It helps to increase the accuracy of determining the coordinates of unmanned aerial vehicles. It is shown that in unfavorable cases, namely the suppression of external noise of the satellite navigation system, an increase in the errors of the inertial navigation system (INS), including through the use of accelerometers and gyroscopes manufactured using MEMS technology, the presence of bad weather conditions, which complicates the work of the visual navigation system. In order to ensure the operation of the navigation complex, it is necessary to ensure the suppression of interference (noise). To improve the accuracy of the INS, which is part of the INC, it is proposed to use the procedure for extracting noise from the raw signal of the INS, its prediction using neural networks and its suppression. To solve this problem, two approaches are proposed, the first of which is based on the use of a multi-row GMDH algorithm and single-layer networks with sigm_piecewise neurons, and the second is on the use of hybrid recurrent neural networks, when neural networks were used, which included long-term and short-term memory (LSTM) and Gated Recurrent Units (GRU). Various types of noise, that are inherent in video images in visual navigation systems are considered: Gaussian noise, salt and pepper noise, Poisson noise, fractional noise, blind noise. Particular attention is paid to blind noise. To improve the accuracy of the visual navigation system, it is proposed to use hybrid convolutional neural networks.
Article
Many trackers use attention mechanisms to enhance the details of feature maps. However, most attention mechanisms are designed based on RGB images and thus cannot be effectively adapted to infrared images. The features of infrared images are weak, and the attention mechanism is difficult to learn. Most thermal infrared trackers based on Siamese networks use traditional cross-correlation techniques, which ignore the correlation between local parts. To address these problems, this paper proposes a Siamese multigroup spatial shift (SiamMSS) network for thermal infrared tracking. The SiamMSS network uses a spatial shift model to enhance the details of feature maps. First, the feature map is divided into four groups according to the channel, moving unit wise in four directions of the two dimensions of height and width. Next, the sample and search image features are cross-correlated using the graph attention module cross-correlation method. Finally, split attention is used to fuse multiple response maps. Results of experiments on challenging benchmarks, including VOT-TIR2015, PTB-TIR, and LSOTB-TIR, demonstrate that the proposed SiamMSS outperforms state-of-the-art trackers. The code is available at lvlanbing/SiamMSS (github.com).
Article
The lack of large labeled training datasets hinders the usage of deep neural network for Thermal Infrared (TIR) tracking. Regular practice is to train a tracking network with large-scale RGB datasets and then retrain it to the TIR domain with limited TIR data. However, we observe that existing Siamese-based trackers can hardly generalize to TIR images though they achieve outstanding performance on RGB tracking. Therefore, the main challenge is the generalization problem: How to design a generalization-friendly Siamese tracking network and what affects the network generalization. To tackle this problem, we introduce the self-adaption structure into Siamese network and propose an effective TIR tracking model, GFSNet. GFSNet is successfully generalized to different TIR tracking tasks, including ground target, aircraft and high-diversity object tracking tasks. To estimate generalization ability, we present a notion of Growth Rate, the improvement of overall performance after retraining. Experimental results show that the Growth Rates of GFSNet exceed state-of-the-art SiamRPN++ by more than 7 times, which indicates the great power of GFSNet in generalization. In addition to experimental validations, we provide the theoretical analysis of network generalization from a novel perspective, model sensitivity. By performing some tests to analyze the sensitivity, we conclude that the self-adaption structure helps GFSNet converge to a more sensitive minimum with better generalization to new tasks. Furthermore, when compared with popular tracking methods, GFSNet maintains comparable accuracy while achieving real-time tracking with the speed of 112 FPS, 5 times faster than other TIR trackers.
Article
Modern low-altitude unmanned aircraft (UA) detection and surveillance systems mostly adopt the multi-sensor fusion technology scheme of radar, visible light, infrared, acoustic and radio detection. Firstly, this paper summarises the latest research progress of UA and bird target detection and recognition technology based on radar, and provides an effective way of detection and recognition from the aspects of echo modeling and micro motion characteristic cognition, manoeuver feature enhancement and extraction, motion trajectory difference, deep learning intelligent classification, etc. Furthermore, this paper also analyses the target feature extraction and recognition algorithms represented by deep learning for other kinds of sensor data. Finally, after a comparison of the detection ability of various detection technologies, a technical scheme for low-altitude UA surveillance system based on four types of sensors is proposed, with a detailed description of its main performance indicators.
Article
Full-text available
Our cardiovascular system weakens and is more prone to arrhythmia as we age. An arrhythmia is an abnormal heartbeat rhythm which can be life-threatening. Atrial fibrillation (Afib), atrial flutter (Afl), and ventricular fibrillation (Vfib) are the recurring life-threatening arrhythmias that affect the elderly population. An electrocardiogram (ECG) is the principal diagnostic tool employed to record and interpret ECG signals. These signals contain information about the different types of arrhythmias. However, due to the complexity and non-linearity of ECG signals, it is difficult to manually analyze these signals. Moreover, the interpretation of ECG signals is subjective and might vary between the experts. Hence, a computer-aided diagnosis (CAD) system is proposed. The CAD system will ensure that the assessment of ECG signals is objective and accurate. In this work, we present a convolutional neural network (CNN) technique to automatically detect the different ECG segments. Our algorithm consists of an eleven-layer deep CNN with the output layer of four neurons, each representing the normal (Nsr), Afib, Afl, and Vfib ECG class. In this work, we have used ECG signals of two seconds and five seconds’ durations without QRS detection. We achieved an accuracy, sensitivity, and specificity of 92.50%, 98.09%, and 93.13% respectively for two seconds of ECG segments. We obtained an accuracy of 94.90%, the sensitivity of 99.13%, and specificity of 81.44% for five seconds of ECG duration. This proposed algorithm can serve as an adjunct tool to assist clinicians in confirming their diagnosis.
Conference Paper
Full-text available
The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object’s appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Conference Paper
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled data to conventionally train or adapt a deep architecture to the new tasks. We investigate and visualize the semantic clustering of deep convolutional features with respect to a variety of such tasks, including scene recognition, domain adaptation, and fine-grained recognition challenges. We compare the efficacy of relying on various network levels to define a fixed feature, and report novel results that significantly outperform the state-of-the-art on several important vision challenges. We are releasing DeCAF, an open-source implementation of these deep convolutional activation features, along with all associated network parameters to enable vision researchers to be able to conduct experimentation with deep representations across a range of visual concept learning paradigms.
Article
Principal component analysis (PCA) is widely used in dimensionality reduction. A lot of variants of PCA have been proposed to improve the robustness of the algorithm. However, the existing methods either cannot select the useful features consistently or is still sensitive to outliers, which will depress their performance of classification accuracy. In this paper, a novel approach called joint sparse principal component analysis (JSPCA) is proposed to jointly select useful features and enhance robustness to outliers. In detail, JSPCA relaxes the orthogonal constraint of transformation matrix to make it have more freedom to jointly select useful features for low-dimensional representation. JSPCA imposes joint sparse constraints on its objective function, i.e., ℓ2,1-norm is imposed on both the loss term and the regularization term, to improve the algorithmic robustness. A simple yet effective optimization solution is presented and the theoretical analyses of JSPCA are provided. The experimental results on eight data sets demonstrate that the proposed approach is feasible and effective.
Article
Nonlocal image representation methods, including group-based sparse coding and BM3D, have shown their great performance in application to low-level tasks. The nonlocal prior is extracted from each group consisting of patches with similar intensities. Grouping patches based on intensity similarity, however, gives rise to disturbance and inaccuracy in estimation of the true images. To address this problem, we propose a structure-based low-rank model with graph nuclear norm regularization. We exploit the local manifold structure inside a patch and group the patches by the distance metric of manifold structure. With the manifold structure information, a graph nuclear norm regularization is established and incorporated into a low-rank approximation model. We then prove that the graph-based regularization is equivalent to a weighted nuclear norm and the proposed model can be solved by a weighted singular-value thresholding algorithm. Extensive experiments on additive white Gaussian noise removal and mixed noise removal demonstrate that the proposed method achieves better performance than several state-of-the-art algorithms.