ArticlePDF Available

Learning Deep Multi-Level Similarity for Thermal Infrared Object Tracking

Authors:

Abstract and Figures

Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Content may be subject to copyright.
A preview of the PDF is not available
... Similar to VOT, the goal of TIR tracking is to track the object in adjacent TIR images. Recently, several TIR trackers [6][7][8] have transplanted Siamese-based trackers for TIR tracking and achieved promising performance. However, it is time-consuming in a system without a high-performance graphics processing unit (GPU). ...
... Compared trackers: We select several TIR and visual Trackers to conduct the experiment. TIR trackers contain ECO-stir [20], MLSSNet [7], MMNet [6], ECO_LS(ours), ECOHG_LS(ours), and HSSNet [8]. While visual trackers include ECO [1], DeepSTRCF [25], MDNet [26], SRDCF [17], VITAL [27], ECO-HC [1], MCCT [2], UDT [28], and SiamFC-tri [29]. ...
... For CF-D, there are DeepSTRCF [25], MCCT [2], ECO [1], and ECO_LS(ours). Trackers in S-T contain HSSNet [8], MMNet [6], and MLSSNet [7]. ...
Article
Full-text available
Existing thermal infrared (TIR) trackers based on correlation filters cannot adapt to the abrupt scale variation of nonrigid objects. This deficiency could even lead to tracking failure. To address this issue, we propose a TIR tracker, called ECO_LS, which improves the performance of efficient convolution operators (ECO) via the level set method. We first utilize the level set to segment the local region estimated by the ECO tracker to gain a more accurate size of the bounding box when the object changes its scale suddenly. Then, to accelerate the convergence speed of the level set contour, we leverage its historical information and continuously encode it to effectively decrease the number of iterations. In addition, our variant, ECOHG_LS, also achieves better performance via concatenating histogram of oriented gradient (HOG) and gray features to represent the object. Furthermore, experimental results on three infrared object tracking benchmarks show that the proposed approach performs better than other competing trackers. ECO_LS improves the EAO by 20.97% and 30.59% over the baseline ECO on VOT-TIR2016 and VOT-TIR2015, respectively.
... The research idea also belongs to multi-feature fusion method, which mainly selects expressive subset features in a large set, and on this basis, designs adaptive fusion algorithm through the difference of representation ability of features for different targets. On the other hand, Liu et al. [33] apply the CNN model trained by the visible light data set to the thermal infrared tracking task by using the transfer learning method, to improve the discriminative capacity, a multi-level similarity model under a Siamese framework [32] and a multi-task framework [31] are proposed to learn the TIR-specific discriminative features, moreover, Liu proposed a framework named self-SDCT [54], which can alleviate the demand for large annotated training samples. Subsequently, Martin daneljan and others have proposed a series of high-performance tracking algorithms, such as DeepSRDCF [13], C-COT [14], ECO [9]. ...
Article
Full-text available
Convolution Neural Network (CNN) features have been widely used in visual tracking due to their powerful representation. As an important component of CNN, the pooling layer plays a critical role, but the max/average/min operation only explores the first-order information, which limits the discrimination ability of the CNN features in some complex situations. In this paper, a high-order pooling layer is integrated into the VGG16 network for visual tracking. In detail, a high-order covariance pooling layer is employed to replace the last maxpooling layer to learn discrimination features and is trained on the ImageNet and CUB200-2011 data sets. In tracking stage, the multiple levels of feature maps are extracted as the appearance representation of the target. After that, the extracted CNN features are integrated into the correlation filters framework when tracking is on-the-fly. The experimental results show that the proposed algorithm achieves excellent performance in both success rate and tracking accuracy.
... However, the camera sensor is easily affected by the sand and dust in the environment. In recent years, the thermal imaging camera has become the trend in multi-target detection and tracking [15][16][17], which is not easily affected by dust, and has the potential to be applied to environmental perception in open-pit mines. ...
Article
Full-text available
There exist many difficulties in environmental perception in transportation at open-pit mines, such as unpaved roads, dusty environments, and high requirements for the detection and tracking stability of small irregular obstacles. In order to solve the above problems, a new multi-target detection and tracking method is proposed based on the fusion of Lidar and millimeter-wave radar. It advances a secondary segmentation algorithm suitable for open-pit mine production scenarios to improve the detection distance and accuracy of small irregular obstacles on unpaved roads. In addition, the paper also proposes an adaptive heterogeneous multi-source fusion strategy of filtering dust, which can significantly improve the detection and tracking ability of the perception system for various targets in the dust environment by adaptively adjusting the confidence of the output target. Finally, the test results in the open-pit mine show that the method can stably detect obstacles with a size of 30–40 cm at 60 m in front of the mining truck, and effectively filter out false alarms of concentration dust, which proves the reliability of the method.
... MNGCO [57] combines appearance and motion extractions and features to construct and develop a TIR target detector and tracking. MLSSNet [58] is implied mainly in strong captures as provided developed architecture as a multi-level similarity model. Also since observed infrared integrators detector and capturing, also the covered mostly observed scenario development and architecture [59] are developed for re-organizing and develop prior samples required in discovering valuable data. ...
Article
Full-text available
Unmanned Flight Vehicle (UAVs) primarily provides many applications and uses in the commerce and also recreation fields. Thus, the perception and visualization of the state of UAVs are of prime importance. Also in this paper, the authors incorporate the primary objective involving capturing and Detecting Drones, thus deriving important And Valuable data of position along with also coordinates. The wide and overall diffusion of drones increases the existing hazards of their misuse in a lot of illegitimate actions for example drug smuggling and also terrorism. Thereby, drones' surveillance and also automated detection are very crucial for protecting and safeguarding certain restricted areas or special zones and regions from illegal drone entry. Although, when present under low illumination situations and scenes, the designed capturers may lose the capability in discovering valuable data, which may lead to the wrong and not accurate results. In order to alleviate and resolve this, there are some works that consider using and reading infrared (IR) videos and images for object detection and tracking. The crucial drawback existing for infrared images is pertaining that they generally possess low resolution, this, thus provides inadequate resolution and information for trackers. Thus, considering the provided above analysis, fusing RGB and visible data along with also infrared picture data is essential in capturing and detecting drones. Moreover, this leverages data consisting of more than a single mode of crucial data which is useful and advantageous in studying along with understanding precise with also important drone existing capturers. Thus, the very use involves few good data comprising more than a single mode which is also needed in order for learning and understanding some objectives involving detecting and capturing UAVs. This paper introduces an automated video and image-based drone tracking and detection system which utilizes a crucial and advanced deep-learning-based image and object detection and tracking method known as you only look once (YOLOv5) to protect restricted areas and regions or special regions and zones from the unlawful drone entry and interventions. YOLO v5, part of the single-stage existing detectors, has one of the best detection and tracking performances required for balancing both the accuracy and also speed by collecting in-depth and also high-level extracted features. Based on YOLO v5, this paper also improves it to track and detect UAVs more accurately and precisely, and it's one of the first times introducing a YOLO v5-based developed algorithm for UAV object tracking and detection for the anti-UAV. It also adopts the last four existing scales of feature extraction maps instead of the previous three pertaining scales of feature maps required to predict and draft bounding boxes of given objects, which can alsodeliver moretextureandalsoimportantcontour data for the necessity to track and detect tiny and small objects. Also at the same time, in order to reduce and decrease the calculation, the provided size of the UAV in the existing four scales feature and contour maps are calculated according to the provided input data, and also then the tracked number of anchor existing boxes is also modified and adjusted. Therefore, the proposed UAV tracking and detection technology can also be applied in the given field of anti-UAV. Accordingly, an important and effective method named a double-training strategy has been developed mainly in drone detection and capturing. Trained mainly in class and instance segmentation spanning in moving frames and image series, the capturer also understands the accurate and important segments data along with information and also derives some distinct and important instantaneous and class-order characteristics.
... The USV perception system consists of a perception computer, a marine radar [7], and a photoelectric device. The perception computer includes object detection [8][9][10][11], object tracking [12][13][14][15], communication, and decision-making modules. The azimuth accuracy of the marine radar is 0.2 degrees, and the accuracy of the laser ranging is 5 m. ...
Article
Full-text available
Unmanned surface vehicles frequently encounter foggy weather when performing surface object tracking tasks, resulting in low optical image quality and object recognition accuracy. Traditional defogging algorithms are time consuming and do not meet real-time requirements. In addition, there are problems with oversaturated colors, low brightness, and overexposed areas in the sky. In order to solve the problems mentioned above, this paper proposes a defogging algorithm for the first frame image of unmanned surface vehicles based on a radar-photoelectric system. The algorithm involves the following steps. The first is the fog detection algorithm for sea surface image, which determines the presence of fog. The second is the sea-sky line extraction algorithm which realizes the extraction of the sea-sky line in the first frame image. The third is the object detection algorithm based on the sea-sky line, which extracts the target area near the sea-sky line. The fourth is the local defogging algorithm, which defogs the extracted area to obtain higher quality images. This paper effectively solves the problems above in the sea test and dramatically reduces the calculation time of the defogging algorithm by 86.7%, compared with the dark channel prior algorithm.
Article
Full-text available
Siamese tracking paradigm has achieved great success, providing effective appearance discrimination and size estimation by classification and regression. While such a paradigm typically optimizes the classification and regression independently, leading to task misalignment (accurate prediction boxes have no high target confidence scores). In this paper, to alleviate this misalignment, we propose a novel tracking paradigm, called SiamLA. Within this paradigm, a series of simple, yet effective localization-aware components are introduced to generate localization-aware target confidence scores. Specifically, with the proposed localization-aware dynamic label (LADL) loss and localization-aware label smoothing (LALS) strategy, collaborative optimization between the classification and regression is achieved, enabling classification scores to be aware of location state, not just appearance similarity. Besides, we propose a separate localization-aware quality prediction (LAQP) branch to produce location quality scores to further modify the classification scores. To guide a more reliable modification, a novel localization-aware feature aggregation (LAFA) module is designed and embedded into this branch. Consequently, the resulting target confidence scores are more discriminative for the location state, allowing accurate prediction boxes tend to be predicted as high scores. Extensive experiments are conducted on six challenging benchmarks, including GOT-10k, TrackingNet, LaSOT, TNL2K, OTB100 and VOT2018. Our SiamLA achieves competitive performance in terms of both accuracy and efficiency. Furthermore, a stability analysis reveals that our tracking paradigm is relatively stable, implying that the paradigm is potential for real-world applications.
Article
Many trackers use attention mechanisms to enhance the details of feature maps. However, most attention mechanisms are designed based on RGB images and thus cannot be effectively adapted to infrared images. The features of infrared images are weak, and the attention mechanism is difficult to learn. Most thermal infrared trackers based on Siamese networks use traditional cross-correlation techniques, which ignore the correlation between local parts. To address these problems, this paper proposes a Siamese multigroup spatial shift (SiamMSS) network for thermal infrared tracking. The SiamMSS network uses a spatial shift model to enhance the details of feature maps. First, the feature map is divided into four groups according to the channel, moving unit wise in four directions of the two dimensions of height and width. Next, the sample and search image features are cross-correlated using the graph attention module cross-correlation method. Finally, split attention is used to fuse multiple response maps. Results of experiments on challenging benchmarks, including VOT-TIR2015, PTB-TIR, and LSOTB-TIR, demonstrate that the proposed SiamMSS outperforms state-of-the-art trackers. The code is available at lvlanbing/SiamMSS (github.com).
Article
Currently, the common used vision-based tracking system faces two major challenges: (1) the trade-off between speed and accuracy of the tracker; (2) the robustness of the tracking servo control. In this paper, we propose a new framework composed of a transformer attached to the siamese-type feature extraction networks called Siamese Transformer Network (SiamTrans) to balance the speed and accuracy, avoiding complicated hand-designed components and tedious post-processing of most existing siamese-type trackers with pre-defined anchor boxes or anchor-free schemes. SiamTrans forces the final set of predictions via bipartite matching, significantly reducing hyper-parameters associated with the candidate boxes. Moreover, to enhance the robustness of the servo control, the high-level control part is also redesigned by fusing all the bounding box information and with the Tracking Drift Suppression Strategy (TDSS). The TDSS is mainly used to judge the target’s loss. If the target is lost, it will feedback the previous information to reinitialize the tracker to track and update the template patch of SiamTrans, making the whole system more robust. Extensive experiments on visual tracking benchmarks, including GOT-10K, UAV123, demonstrate that SiamTrans achieves competitive performance and runs at 50 FPS. Specifically, SiamTrans outperforms the leading anchor-based tracker SiamRPN++ in the GOT-10K benchmark, confirming its effectiveness and efficiency. Furthermore, SiamTrans is deployed on the embedded device in which the algorithm can be run at 30FPS or 54FPS with TensorRT meeting the real-time requirements. In addition, we design the complete tracking system demo that can accurately track the target for multiple categories. The actual experimental results also show that the whole system is efficient and robust. The demo video link is as follows: https://youtu.be/UK37Q-M9ET4.
Article
The existing Siamese trackers have achieved increasingly results in visual tracking. However, the contextual association between template and search region is not been fully studied in previous Siamese Network-based methods, meanwhile, the feature information of the cross-correlation layer is investigated insufficiently. In this paper, we propose a new context-related Siamese network called SiamBC to address these issues. By introducing a cooperate attention mechanism based on block deformable convolution sampling features, the tracker can pre-match and enhance similar features to improve accuracy and robustness when the context embedding between the template and search fully interacted. In addition, we design a cascade cross correlation module. The cross-correlation layer of the stacked structure can gradually refine the deep information of the mined features and further improve the accuracy. Extensive experiments demonstrate the effectiveness of our tracker on six tracking benchmarks including OTB100, VOT2019, GOT10k, LaSOT, TrackingNet and UAV123. The code will be available at https://github.com/Soarkey/SiamBC.
Article
Full-text available
Learning a powerful feature representation is critical for constructing a robust Siamese tracker. However, most existing Siamese trackers learn the global appearance features of the entire object, which usually suffers from drift problems caused by partial occlusion or non-rigid appearance deformation. In this paper, we propose a new Local Semantic Siamese (LSSiam) network to extract more robust features for solving these drift problems, since the local semantic features contain more fine-grained and partial information. We learn the semantic features during offline training by adding a classification branch into the classical Siamese framework. To further enhance the representation of features, we design a generally focal logistic loss to mine the hard negative samples. During the online tracking, we remove the classification branch and propose an efficient template updating strategy to avoid aggressive computing load. Thus, the proposed tracker can run at a high-speed of 100 Frame-per-Second (FPS) far beyond real-time requirement. Extensive experiments on popular benchmarks demonstrate the proposed LSSiam tracker achieves the state-of-the-art performance with a high-speed. Our source code is available at https://github.com/shenjianbing/LSSiam.
Article
Full-text available
Hyperparameters are numerical pre-sets whose values are assigned prior to the commencement of a learning process. Selecting appropriate hyperparameters is often critical for achieving satisfactory performance in many vision problems such as deep learning-based visual object tracking. Yet it is difficult to determine their optimal values, in particular, adaptive ones for each specific video input. Most hyperparameter optimization algorithms depend on searching a generic range and they are imposed blindly on all sequences. In this paper, we propose a novel dynamical hyperparameter optimization method that adaptively optimizes hyperparameters for a given sequence using an action-prediction network leveraged on continuous deep Q-learning. Since the observation space for visual object tracking is significantly more complex than those in traditional control problems, existing continuous deep Q-learning algorithms cannot be directly applied. To overcome this challenge, we introduce an efficient heuristic strategy to handle high dimensional state space and meanwhile accelerate the convergence behavior. The proposed algorithm is applied to improve two representative trackers, a Siamese-based one and a correlation-filter-based one, to evaluate its generality. Their superior performances on several popular benchmarks are clearly demonstrated.
Article
Full-text available
Visual tracking addresses the problem of localizing an arbitrary target in video according to the annotated bounding box. In this article, we present a novel tracking method by introducing the attention mechanism into the Siamese network to increase its matching discrimination. We propose a new way to compute attention weights to improve matching performance by a sub-Siamese network [Attention Net (A-Net)], which locates attentive parts for solving the searching problem. In addition, features in higher layers can preserve more semantic information while features in lower layers preserve more location information. Thus, in order to solve the tracking failure cases by the higher layer features, we fully utilize location and semantic information by multilevel features and propose a new way to fuse multiscale response maps from each layer to obtain a more accurate position estimation of the object. We further propose a hierarchical attention Siamese network by combining the attention weights and multilayer integration for tracking. Our method is implemented with a pretrained network which can outperform most well-trained Siamese trackers even without any fine-tuning and online updating. The comparison results with the state-of-the-art methods on popular tracking benchmarks show that our method achieves better performance. Our source code and results will be available at https://github.com/shenjianbing/HASN.
Article
Deep convolutional neural networks (CNNs) have attracted considerable interest in low-level computer vision. Researches are usually devoted to improving the performance via very deep CNNs. However, as the depth increases, influences of the shallow layers on deep layers are weakened. Inspired by the fact, we propose an attention-guided denoising convolutional neural network (ADNet), mainly including a sparse block (SB), a feature enhancement block (FEB), an attention block (AB) and a reconstruction block (RB) for image denoising. Specifically, the SB makes a tradeoff between performance and efficiency by using dilated and common convolutions to remove the noise. The FEB integrates global and local features information via a long path to enhance the expressive ability of the denoising model. The AB is used to finely extract the noise information hidden in the complex background, which is very effective for complex noisy images, especially real noisy images and bind denoising. Also, the FEB is integrated with the AB to improve the efficiency and reduce the complexity for training a denoising model. Finally, a RB aims to construct the clean image through the obtained noise mapping and the given noisy image. Additionally, comprehensive experiments show that the proposed ADNet performs very well in three tasks (i.e. synthetic and real noisy images, and blind denoising) in terms of both quantitative and qualitative evaluations. The code of ADNet is accessible at http://www.yongxu.org/lunwen.html.