Article

Aligned Spatial-Temporal Memory Network for Thermal Infrared Target Tracking

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Thermal infrared (TIR) target tracking is susceptible to occlusion and similarity interference, which obviously affects the tracking results. To resolve this problem, we develop an Aligned Spatial-Temporal Memory network-based Tracking method (ASTMT) for the TIR target tracking task. Specifically, we model the scene information in the TIR target tracking scenario using the spatial-temporal memory network, which can effectively store the scene information and decrease the interference of similarity interference that is beneficial to the target. In addition, we use an aligned matching module to correct the parameters of the spatial-temporal memory network model, which can effectively alleviate the impact of occlusion on the target estimation, hence boosting the tracking accuracy even further. Through ablation study experiments, we have demonstrated that the spatial-temporal memory network and the aligned matching module in the proposed ASTMT tracker are exceptionally successful. Our ASTMT tracking method performs well on the PTB-TIR and LSOTB-TIR benchmarks contrasted with other tracking methods.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Li et al. [42] propose a hierarchical spatial-aware Siamese network for TIR tracking, which uses hierarchical convolutional features to acquire richer spatial and semantic feature representation for the TIR objects. Yuan et al. [43] present a spatial-temporal memory network to address occlusion and similar target interference in thermal infrared tracking tasks. Liu et al. [44] propose a Siamese TIR tracking framework with a similarity computation structure with multiple levels, where one computes the global semantic similarity and the other computes the local structural similarity for the TIR objects. ...
Article
Full-text available
With the growing threat of unmanned aerial vehicle (UAV) intrusions, the topic of anti-UAV tracking has received widespread attention from the community. Traditional Siamese trackers struggle with small UAV targets and are plagued by model degradation issues. To mitigate this, we propose a novel Searching Region-free and Template-free Siamese network (SiamSRT) to track UAV targets in thermal infrared (TIR) videos. The proposed tracker builds a two-stage Siamese architecture with the former providing detection of the first-frame groundtruth by using a cross-correlated region proposal network (C-C RPN) and the latter providing detection of previous-frame predictions via a similarity-learning region convolutional neural network (S-L RCNN). In both stage, global proposals are acquired by ROI alignment operation to break the limitation of searching region. Then, a spatial location consistency function is introduced to suppress background thermal distractors and a temporal memory bank is utilized to avoid template update degradation problem. Further, a single-category foreground detector (SCFD) is designed to independently predict the position of the UAV target. SCFD can re-initialize the tracker without the given target in the first frame, which can help to recover the tracking failures. Comprehensive experiments demonstrate that SiamSRT achieves the best performance compared to the most advanced algorithms in the anti-UAV tracking missions.
... The Aligned spatial-temporal memory network (ASTMN) [33] and the Efficient Attention Network (EAN) [34] improve the accuracy and efficiency of object detection and tracking by integrating and emphasizing relevant feature information through the fusion of features and attention. ...
Article
Full-text available
The improved YOLOv8 model (DCN_C2f+SC_SA+YOLOv8, hereinafter referred to as DS-YOLOv8) is proposed to address object detection challenges in complex remote sensing image tasks. It aims to overcome limitations such as the restricted receptive field caused by fixed convolutional kernels in the YOLO backbone network and the inadequate multi-scale feature learning capabilities resulting from the spatial and channel attention fusion mechanism’s inability to adapt to the input data’s feature distribution. The DS-YOLOv8 model introduces the Deformable Convolution C2f (DCN_C2f) module in the backbone network to enable adaptive adjustment of the network’s receptive field. Additionally, a lightweight Self-Calibrating Shuffle Attention (SC_SA) module is designed for spatial and channel attention mechanisms. This design choice allows for adaptive encoding of contextual information, preventing the loss of feature details caused by convolution iterations and improving the representation capability of multi-scale, occluded, and small object features. Moreover, the DS-YOLOv8 model incorporates the dynamic non-monotonic focus mechanism of Wise-IoU and employs a position regression loss function to further enhance its performance. Experimental results demonstrate the excellent performance of the DS-YOLOv8 model on various public datasets, including RSOD, NWPU VHR-10, DIOR, and VEDAI. The average mAP@0.5 values achieved are 97.7%, 92.9%, 89.7%, and 78.9%, respectively. Similarly, the average mAP@0.5:0.95 values are observed to be 74.0%, 64.3%, 70.7%, and 51.1%. Importantly, the model maintains real-time inference capabilities. In comparison to the YOLOv8 series models, the DS-YOLOv8 model demonstrates significant performance improvements and outperforms other mainstream models in terms of detection accuracy.
... Thermal infrared detection is the method of taking thermal images of solar panels using infrared cameras and then detecting different types of failures based on the characteristics of the images [10]. Electrical characteristic analysis is the method of collecting the output characteristics of solar panel components and then using these characteristics to detect failures [11]. ...
Article
Full-text available
The rapid development of solar energy technology has led to significant progress in recent years, but the daily maintenance of solar panels faces significant challenges. The diagnosis of solar panel failures by infrared detection devices can improve the efficiency of maintenance personnel. Currently, due to the scarcity of infrared solar panel failure samples and the problem of unclear image effective features, traditional deep neural network models can easily encounter overfitting and poor generalization performance under small sample conditions. To address these problems, this paper proposes a solar panel failure diagnosis method based on an improved Siamese network. Firstly, two types of solar panel samples of the same category are constructed. Secondly, the images of the samples are input into the feature model combining convolution, adaptive coordinate attention (ACA), and the feature fusion module (FFM) to extract features, learning the similarities between different types of solar panel samples. Finally, the trained model is used to determine the similarity of the input solar image, obtaining the failure diagnosis results. In this case, adaptive coordinate attention can effectively obtain interested effective feature information, and the feature fusion module can integrate the different effective information obtained, further enriching the feature information. The ACA-FFM Siamese network method can alleviate the problem of insufficient sample quantity and effectively improve the classification accuracy, achieving a classification accuracy rate of 83.9% on an open-accessed infrared failure dataset with high similarity.
... AMFT [42] integrates multiple types of features including hand-crafted and deep features to better model the target appearance in blur, which is benefit for tracking in the presence of motion blur. ASTMT [43] constructs a novel spatial-temporal memory architecture to store the scene information and decrease the interference of similarity interference in blurred scenes. STAMT [44] introduces a target-aware network and improves the discriminative ability of trackers in motion blur. ...
Article
Full-text available
RGBT tracking combines visible and thermal infrared images to achieve tracking and faces challenges due to motion blur caused by camera and target movement. In this study, we observe that the tracking in motion blur is significantly affected by both frequency and spatial aspects. And blurred targets exhibit sharp texture details that are represented as high-frequency information. But existing trackers capture low-frequency components while ignoring high-frequency information. To enhance the representation of sharp information in blurred scenes, we introduce multi-frequency and multi-spatial information in network, called FSBNet. First, we construct a modality-specific unsymmetrical architecture and integrate an adaptive soft threshold mechanism into a DCT-based multi-frequency channel attention adapter (DFDA). DFDA adaptively integrates rich multi-frequency information. Second, we propose a masked frequency-based translation adapter (MFTA) to refine drifting failure boxes caused by camera motion. Moreover, we find that small targets get more affected by motion blur compared to larger targets, and we mitigate this issue by designing a cross-scale mutual conversion adapter (CFCA) between the frequency and spatial domains. Extensive experiments on GTOT, RGBT234 and LasHeR benchmarks demonstrate the promising performance of our method in the presence of motion blur.
... Thermal infrared target tracking has been widely concerned and studied in recent years [5]- [9]. This task can be described as given the initial state of the target in the first frame of the tracking sequence, and the tracking method needs to predict the target state in other frames [10]- [14]. ...
Conference Paper
Full-text available
Thermal infrared (TIR) target tracking task is not affected by illumination changes and can be tracked at night, on rainy days, foggy days, and other extreme weather, so it is widely used in night auxiliary driving, unmanned aerial vehicle reconnaissance, video surveillance, and other scenes. Thermal infrared target tracking task still faces many challenges, such as occlusion, deformation, similarity interference, etc. To solve the challenge in the TIR target tracking scenarios, a large number of TIR target tracking methods have appeared in recent years. The purpose of this paper is to give a comprehensive review and summary of the research status of thermal infrared target tracking methods. We first introduce some basic principles and representative work of the thermal infrared target tracking methods. And then, some benchmarks for performance testing of thermal infrared target tracking methods are introduced. Subsequently, we demonstrate the tracking results of several representative tracking methods on some benchmarks. Finally, the future research direction of thermal infrared target tracking is discussed.
... The deep learning algorithms have been successfully used in medical image analysis (Ker, Wang, Rao, & Lim, 2017;Litjens et al., 2017), such as segmentation (Shu, Yang, Liu, Chang, & Wu, 2023), and target tracking tasks (Yuan, Chang, Li, & He, 2022;Yuan, Shu, Liu, & He, 2022), advancing clinical practice (Rasheed et al., 2020;Truong et al., 2018). Recently, Sip et al. (2023) have introduced a method using variational autoencoders (VAEs) for nonlinear dynamical system identification to infer both the neural mass model and the region-and subject-specific parameters from the functional data while respecting the known network structure. ...
Article
Full-text available
Whole-brain modeling of epilepsy combines personalized anatomical data with dynamical models of abnormal activities to generate spatio-temporal seizure patterns as observed in brain imaging data. Such a parametric simulator is equipped with a stochastic generative process, which itself provides the basis for inference and prediction of the local and global brain dynamics affected by disorders. However, the calculation of likelihood function at whole-brain scale is often intractable. Thus, likelihood-free algorithms are required to efficiently estimate the parameters pertaining to the hypothetical areas, ideally including the uncertainty. In this study, we introduce the simulation-based inference for the virtual epileptic patient model (SBI-VEP), enabling us to amortize the approximate posterior of the generative process from a low-dimensional representation of whole-brain epileptic patterns. The state-of-the-art deep learning algorithms for conditional density estimation are used to readily retrieve the statistical relationships between parameters and observations through a sequence of invertible transformations. We show that the SBI-VEP is able to efficiently estimate the posterior distribution of parameters linked to the extent of the epileptogenic and propagation zones from sparse intracranial electroencephalography recordings. The presented Bayesian methodology can deal with non-linear latent dynamics and parameter degeneracy, paving the way for fast and reliable inference on brain disorders from neuroimaging modalities.
Article
Full-text available
The increase in the use of electronic devices and the high rate of data stream production such as video reveals the importance of analyzing the content of such data. Content analysis of video data for human activity recognizing (HAR) has a significant application in the science of machine vision. So far, vast studies have been conducted to HAR subject. Also, despite many challenges in the research field of video data content analysis, previous researchers have proposed many effective methods in field of human activity recognition. However, the literature reveals lacking of proper context for identification, analysis and evaluation of the HAR methods and challenges in a coherent and uniform form to achieve a macro vision of the HAR subject. Hence, it seems necessary to present a comprehensive and comparative analytical review regarding the HAR on video data relying on methods and challenges. The novelty of this research is to present a comparative analytical framework called HAR-CO, which provide a macro vision, coherent structure and deeper understanding concerning to the HAR. The HAR-CO consists of three main parts. Firstly, categorizing the HAR methods in a coherent and structured way based on data collection hardware. Secondly, categorizing HAR challenges in a systematic based on the sensor attachment. Thirdly, a comparative analytical evaluation of each class of HAR approaches according to challenges toward researchers. We think that the HAR-CO framework can serve as road map and guide to select a more appropriate of HAR methods and provide new research directions by researchers.
Article
Full-text available
This paper discusses a new class of the Polak–Ribière–Polyak (PRP) conjugate gradient methods for an impulse noise removal problem, which is transformed into an unconstrained optimization problem with smooth objective function. Our new class contains the four improved conjugate gradient directions, three of which are regularized versions of PRP conjugate gradient directions and last of which is is the combination of Fletcher–Reeves and PRP conjugate gradient directions. It is shown on several known images that our new methods are more robust and efficient than other known methods for impulse noise removal.
Article
RGBT object tracking is an important research topic due to the complementary features of visible and thermal infrared images. However, the established fast-tracking Siamese series RGBT trackers do not fully exploit the predicted results and global information, which leads to poor tracking performance in challenging scenarios such as target object deformation and fast movement. To address these challenges, we propose a dynamic feature-memory transformer RGBT tracking framework in this article. Precisely, we extract the features of visible and infrared image pairs separately using convolutional networks. We then fuse the complementary semantic information using our proposed complementary semantic fusion module, which consists of a group convolution, a channel attention structure, a flattening operation, and channel reduction. Next, we concatenate the dynamic memory feature extracted from the predicted frame pair, the template pair feature, and the search region pair feature. Then input concatenated features into transformer encoder-decoder modules to acquire the global information and further fuse the RGBT features. Finally, we predict the object state using the tracking head while deciding whether to update the dynamic memory feature using the reliability estimator. We conducted extensive experiments on the widely used RGBT234 and LasHeR datasets to evaluate the proposed framework’s performance. The experimental results show that our method outperforms state-of-the-art RGBT trackers, demonstrating the effectiveness of our approach. Overall, our proposed framework leverages visible and infrared images and incorporates global information and dynamic memory features, offering a promising solution to the challenges in RGBT object tracking. Code will be open sourced at https://github.com/ELOESZHANG/DFMTNet</uri
Article
Infrared (IR) ship tracking is becoming increasingly important in various applications. However, it remains a challenging task as the information that can be obtained from IR images is limited. Aiming at enhancing IR ship tracking accuracy, we propose an innovative approach by presenting feature integration module (FIM) and backup matching module (BMM). FIM takes appearance feature, complete intersection over union (CIoU), and motion direction metrics (MDMs) into account. Regarding appearance feature extraction, an end-to-end characteristic learning strategy with a cross-guided multigranularity fusion network is proposed to obtain more integral appearance features and enhance reidentification (re-ID) accuracy, which helps to distinguish individual IR ship targets better. Besides, a backup matching strategy is then used to match the unmatched tracks and detections after cascaded matching. Virtual trajectories are generated for the matched tracks to optimize parameters by parameter optimization module (POM). The accumulation of errors caused by the lack of observations in the Kalman filter (KF) is reduced. Thus, the position of IR ships can be estimated more accurately, and more robust IR ship tracking can be achieved. In addition, we present a sequential frame IR ship tracking (SFIST) dataset, providing the first public benchmark for testing IR ship tracking performance. Experimental results indicate that the multiple object tracking accuracy (MOTA), multiple objects tracking precision (MOTP), and identity switch (IDs) of the proposed method are 73.441, 80.826, and 32, respectively, outperforming other state-of-the-art methods. This demonstrates the superior robustness of the proposed method, particularly when the IR ships are occluded or the target texture information is lacking. Our dataset is available at https://github.com/echo-sky/SFIST .
Article
Full-text available
Visual perception plays an important role in industrial information field, especially in robotic grasping application. In order to detect the object to be grasped quickly and accurately, salient object detection (SOD) is employed to the above task. Although the existing SOD methods have achieved impressive performance, they still have some limitations in the complex interference environment of practical application. To better deal with the complex interference environment, a novel triple-modal images fusion strategy is proposed to implement SOD for robotic visual perception, namely visible-depth-thermal (VDT) SOD. Meanwhile, we build an image acquisition system under variable lighting scene and construct a novel benchmark dataset for VDT SOD (VDT-2048 dataset). Multiple modal images will be introduced to assist each other to highlight the salient regions. But, inevitably, interference will also be introduced. In order to achieve effective cross-modal feature fusion while suppressing information interference, a hierarchical weighted suppress interference (HWSI) method is proposed. The comprehensive experimental results prove that our method achieves better performance than the state-of-the-art methods.
Article
Full-text available
When dealing with complex thermal infrared (TIR) tracking scenarios, the single category feature is not sufficient to portray the appearance of the target, which drastically affects the accuracy of the TIR target tracking method. In order to address these problems, we propose an adaptively multi-feature fusion model (AMFT) for the TIR tracking task. Specifically, our AMFT tracking method adaptively integrates hand-crafted features and deep convolutional neural network (CNN) features. In order to accurately locate the target position, it takes advantage of the complementarity between different features. Additionally, the model is updated using a simple but effective model update strategy to adapt to changes in the target during tracking. In addition, a simple but effective model update strategy is adopted to adapt the model to the changes of the target during the tracking process. We have shown through ablation studies that the adaptively multi-feature fusion model in our AMFT tracking method is very effective. Our AMFT tracker performs favorably on PTB-TIR and LSOTB-TIR benchmarks compared with state-of-the-art trackers.
Article
Full-text available
The feature models used by existing Thermal InfraRed (TIR) tracking methods are usually learned from RGB images due to the lack of a large-scale TIR image training dataset. However, these feature models are less effective in representing TIR objects and they are difficult to effectively distinguish distractors because they do not contain fine-grained discriminative information. To this end, we propose a dual-level feature model containing the TIR-specific discriminative feature and fine-grained correlation feature for robust TIR object tracking. Specifically, to distinguish inter-class TIR objects, we first design an auxiliary multi-classification network to learn the TIR-specific discriminative feature. Then, to recognize intra-class TIR objects, we propose a fine-grained aware module to learn the fine-grained correlation feature. These two kinds of features complement each other and represent TIR objects in the levels of inter-class and intra-class respectively. These two feature models are constructed using a multi-task matching framework and are jointly optimized on the TIR object tracking task. In addition, we develop a large-scale TIR image dataset to train the network for learning TIR-specific feature patterns. To the best of our knowledge, this is the largest TIR tracking training dataset with the richest object class and scenario. To verify the effectiveness of the proposed dual-level feature model, we propose an offline TIR tracker (MMNet) and an online TIR tracker (ECO-MM) based on the feature model and evaluate them on three TIR tracking benchmarks. Extensive experimental results on these benchmarks demonstrate that the proposed algorithms perform favorably against the state-of-the-art methods.
Article
Full-text available
RGB salient object detection (SOD) has made great progress. However, the performance of this single-modal salient object detection will be significantly decreased when encountering some challenging scenes, such as low light or darkness. To deal with the above challenges, thermal infrared (T) image is introduced into the salient object detection. This fused modal is called RGB-T salient object detection. To achieve deep mining of the unique characteristics of single modal and the full integration of cross-modality information, a novel Cross-Guided Fusion Network (CGFNet) for RGB-T salient object detection is proposed. Specifically, a Cross-Scale Alternate Guiding Fusion (CSAGF) module is proposed to mine the high-level semantic information and provide global context support. Subsequently, we design a Guidance Fusion Module (GFM) to achieve sufficient cross-modality fusion by using single modal as the main guidance and the other modal as auxiliary. Finally, the Cross-Guided Fusion Module (CGFM) is presented and serves as the main decoding block. And each decoding block is consists of two parts with two modalities information of each being the main guidance, i.e., cross-shared Cross-Level Enhancement (CLE) and Global Auxiliary Enhancement (GAE). The main difference between the two parts is that the GFM using different modalities as the main guide. The comprehensive experimental results prove that our method achieves better performance than the state-of-the-art salient detection methods. The source code has released at: https://github.com/wangjie0825/CGFNet.git .
Article
Full-text available
The training of a feature extraction network typically requires abundant manually annotated training samples, making this a time-consuming and costly process. Accordingly, we propose an effective self-supervised learning-based tracker in a deep correlation framework (named: self-SDCT). Motivated by the forward-backward tracking consistency of a robust tracker, we propose a multi-cycle consistency loss as self-supervised information for learning feature extraction network from adjacent video frames. At the training stage, we generate pseudo-labels of consecutive video frames by forward-backward prediction under a Siamese correlation tracking framework and utilize the proposed multi-cycle consistency loss to learn a feature extraction network. Furthermore, we propose a similarity dropout strategy to enable some low-quality training sample pairs to be dropped and also adopt a cycle trajectory consistency loss in each sample pair to improve the training loss function. At the tracking stage, we employ the pre-trained feature extraction network to extract features and utilize a Siamese correlation tracking framework to locate the target using forward tracking alone. Extensive experimental results indicate that the proposed self-supervised deep correlation tracker (self-SDCT) achieves competitive tracking performance contrasted to state-of-the-art supervised and unsupervised tracking methods on standard evaluation benchmarks.
Article
Full-text available
Real-time object tracking is demanded in versatile embedded computer vision applications. To this end, a low-cost high-speed VLSI system is proposed for object tracking, based on robust and computationally simple unified textural and dynamic compressive sensing features and simple elliptic matching with fast online template updating capability. The system introduces a memory-centric architectural paradigm, multiple-level pipelines and parallel processing circuits to achieve high frame rate while consuming as few hardware resources as possible. We implemented an FPGA prototype of the proposed VLSI tracking system. Under a 100 MHz clock frequency, the prototype achieves over 600 frame/s processing speed under 320 × 240 image resolution and obtains robust tracking results.
Conference Paper
Full-text available
In this paper, we present a Large-Scale and high-diversity general Thermal InfraRed (TIR) Object Tracking Benchmark, called LSOTB-TIR, which consists of an evaluation dataset and a training dataset with a total of 1,400 TIR sequences and more than 600K frames. We annotate the bounding box of objects in every frame of all sequences and generate over 730K bounding boxes in total. To the best of our knowledge, LSOTB-TIR is the largest and most diverse TIR object tracking benchmark to date. To evaluate a tracker on different attributes, we define 4 scenario attributes and 12 challenge attributes in the evaluation dataset. By releasing LSOTB-TIR, we encourage the community to develop deep learning based TIR trackers and evaluate them fairly and comprehensively. We evaluate and analyze more than 30 trackers on LSOTB-TIR to provide a series of baselines, and the results show that deep trackers achieve promising performance. Furthermore, we retrain several representative deep trackers on LSOTB-TIR, and their results demonstrate that the proposed training dataset significantly improves the performance of deep TIR trackers. Codes and dataset are available at https://github.com/QiaoLiuHit/LSOTB-TIR.
Article
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However, these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Article
Full-text available
Hyperparameters are numerical pre-sets whose values are assigned prior to the commencement of a learning process. Selecting appropriate hyperparameters is often critical for achieving satisfactory performance in many vision problems such as deep learning-based visual object tracking. Yet it is difficult to determine their optimal values, in particular, adaptive ones for each specific video input. Most hyperparameter optimization algorithms depend on searching a generic range and they are imposed blindly on all sequences. In this paper, we propose a novel dynamical hyperparameter optimization method that adaptively optimizes hyperparameters for a given sequence using an action-prediction network leveraged on continuous deep Q-learning. Since the observation space for visual object tracking is significantly more complex than those in traditional control problems, existing continuous deep Q-learning algorithms cannot be directly applied. To overcome this challenge, we introduce an efficient heuristic strategy to handle high dimensional state space and meanwhile accelerate the convergence behavior. The proposed algorithm is applied to improve two representative trackers, a Siamese-based one and a correlation-filter-based one, to evaluate its generality. Their superior performances on several popular benchmarks are clearly demonstrated.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However , these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Article
Full-text available
Thermal infrared (TIR) target tracking is a challenging task as it entails learning an effective model to identify the target in the situation of poor target visibility and clutter background. The sparse representation, as a typical appearance modeling approach, has been successfully exploited in the TIR target tracking. However, the discriminative information of the target and its surrounding background is usually neglected in the sparse coding process. To address this issue, we propose a mask sparse representation (MaskSR) model ,which combines sparse coding together with high-level semantic features for TIR target tracking. We first obtain the pixel-wise labeling results of the target and its surrounding background in the last frame, and then use such results to train target-specific deep networks using a supervised manner. According to the output features of the deep networks, the high-level pixel-wise discriminative map of the target area is obtained. We introduce the binarized discriminative map as a mask template to the sparse representation and develop a novel algorithm to collaboratively represent the reliable target part and unreliable target part partitioned with the mask template, which explicitly indicates different discriminant capabilities by label 1 and 0. The proposed MaskSR model controls the superiority of the reliable target part in the reconstruction process via a weighted scheme. We solve this multi-parameter constrained problem by a customized alternating direction method of multipliers (ADMM) method. This model is applied to achieve TIR target tracking in the particle filter framework. To improve the sampling effectiveness and decrease the computation cost at the same time, a discriminative particle selection strategy based on kernelized correlation filter is proposed to replace the previous random sampling for searching useful candidates. Our proposed tracking method was tested on the VOT-TIR2016 benchmark. The experiment results show that the proposed method has a significant superiority compared with various state-of-the-art methods in TIR target tracking.
Article
Full-text available
Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Chapter
Full-text available
Object tracking is still a critical and challenging problem with many applications in computer vision. For this challenge, more and more researchers pay attention to applying deep learning to get powerful feature for better tracking accuracy. In this paper, a novel triplet loss is proposed to extract expressive deep feature for object tracking by adding it into Siamese network framework instead of pairwise loss for training. Without adding any inputs, our approach is able to utilize more elements for training to achieve more powerful feature via the combination of original samples. Furthermore, we propose a theoretical analysis by combining comparison of gradients and back-propagation, to prove the effectiveness of our method. In experiments, we apply the proposed triplet loss for three real-time trackers based on Siamese network. And the results on several popular tracking benchmarks show our variants operate at almost the same frame-rate with baseline trackers and achieve superior tracking performance than them, as well as the comparable accuracy with recent state-of-the-art real-time trackers.
Article
Full-text available
Discriminative correlation filters (DCFs) have been shown to perform superiorly in visual tracking. They only need a small set of training samples from the initial frame to generate an appearance model. However, existing DCFs learn the filters separately from feature extraction, and update these filters using a moving average operation with an empirical weight. These DCF trackers hardly benefit from the end-to-end training. In this paper, we propose the CREST algorithm to reformulate DCFs as a one-layer convolutional neural network. Our method integrates feature extraction, response map generation as well as model update into the neural networks for an end-to-end training. To reduce model degradation during online update, we apply residual learning to take appearance changes into account. Extensive experiments on the benchmark datasets demonstrate that our CREST tracker performs favorably against state-of-the-art trackers.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
Thermal InfraRed (TIR) target trackers are easy to be interfered by similar objects, while susceptible to the influence of the target occlusion. To solve these problems, we propose a structural target-aware model (STAMT) for the thermal infrared target tracking tasks. Specifically, the proposed STAMT tracker can learn a target-aware model, which can add more attention to the target area to accurately identify the target from similar objects. In addition, considering the situation that the target is partially occluded in the tracking process, a structural weight model is proposed to locate the target through the unoccluded reliable target part. Ablation studies show the effectiveness of each component in the proposed tracker. Without bells and whistles, the experimental results demonstrate that our STAMT tracker performs favorably against state-of-the-art trackers on PTB-TIR and LSOTB-TIR datasets.
Article
In recent years, Siamese network based trackers have significantly advanced the state-of-the-art in real-time tracking. Despite their success, Siamese trackers tend to suffer from high memory costs, which restrict their applicability to mobile devices with tight memory budgets. To address this issue, we propose a distilled Siamese tracking framework to learn small, fast and accurate trackers (students, which capture critical knowledge from large Siamese trackers (teachers by a teacher-students knowledge distillation model. This model is intuitively inspired by the one teacher vs. multiple students learning method typically employed in schools. In particular, our model contains a single teacher-student distillation module and a student-student knowledge sharing mechanism. The former is designed using a tracking-specific distillation strategy to transfer knowledge from a teacher to students. The latter is utilized for mutual learning between students to enable in-depth knowledge understanding. Extensive empirical evaluations on several popular Siamese trackers demonstrate the generality and effectiveness of our framework. Moreover, the results on five tracking benchmarks show that the proposed distilled trackers achieve compression rates of up to 18 \times and frame-rates of 265 FPS, while obtaining {comparable tracking accuracy compared to base models.
Article
Tracking in the unmanned aerial vehicle (UAV) scenarios is one of the main components of target tracking tasks. Different from the target tracking task in the general scenarios, the target tracking task in the UAV scenarios is very challenging because of factors such as small scale and aerial view. Although the DCFs-based tracker has achieved good results in tracking tasks in general scenarios, the boundary effect caused by the dense sampling method will reduce the tracking accuracy, especially in UAV tracking scenarios. In this work, we propose learning an adaptive spatial-temporal context-aware (ASTCA) model in the DCFs-based tracking framework to improve the tracking accuracy and reduce the influence of boundary effect, thereby enabling our tracker to more appropriately handle UAV tracking tasks. Specifically, our ASTCA model can learn a spatial-temporal context weight, which can precisely distinguish the target and background in the UAV tracking scenarios. Besides, considering the small target scale and the aerial view in UAV tracking scenarios, our ASTCA model incorporates spatial context information within the DCFs-based tracker, which could effectively alleviate background interference. Extensive experiments demonstrate that our ASTCA method performs favorably against state-of-the-art tracking methods on some standard UAV datasets.
Article
Temporal and spatial contexts, characterizing target appearance variations and target-background differences, respectively, are crucial for improving the online adaptive ability and instance-level discriminative ability of object tracking. However, most existing trackers focus on either the temporal context or the spatial context during tracking and have not exploited these contexts simultaneously and effectively. In this paper, we propose a Spatial-TEmporal Memory (STEM) network to exploit these contexts jointly for object tracking. Specifically, we develop a key-value structured memory model equipped with a key-value index-based memory reading mechanism to model the spatial and temporal contexts simultaneously. To update the memory with new target states and ensure the diversity of the memory, we introduce a similarity-aware memory update scheme. In addition, we construct an entropy-guided ensemble strategy to fuse the prediction models based on these two contexts, such that these two contexts can be exploited to estimate the target state jointly. Extensive experimental results on eight challenging datasets, including OTB2015, TC128, UAV123, VOT2018, LaSOT, TrackingNet, GOT-10k, and OxUvA, demonstrate that the proposed method performs favorably against state-of-the-art trackers.
Article
This brief focuses on the problem of fixed-time adaptive trajectory tracking control for a quadrotor unmanned aerial vehicle (QUAV) subject to error constraints. By virtue of the fixed-time command filter, the phenomenon of “explosion of complexity" (EOC) existing in the conventional backstepping method is successfully eliminated, meanwhile the effect of filtered error is skillfully removed via a new fractional power error compensation mechanism. By combining the prescribed performance control and backstepping design method together with command filter technique, a fixed-time adaptive control strategy is established. It is proved that all signals of the closed-loop system are fixed-time bounded, and the position and attitude tracking errors of the QUAV approach to an arbitrarily small region of the original point within the predefined performance bounds in a fixed time. Finally, simulation results are given to show the availability of the presented fixed-time control algorithm.
Article
This article intends to address an adaptive asymptotic tracking control with prescribed performance function for a class of nonaffine systems with unknown disturbances. First, the nonaffine system is transformed into an affine system by using a set of alternative state variables. Subsequently, a prescribed performance function with predefined convergence rate, maximum overshoot and steady-state error is introduced. To achieve the asymptotic tracking control performance, the robust integral of the sign of the error (RISE) feedback term is utilized in the control design to reject the unknown external disturbances and NN approximation errors. Finally, an adaptive controller is presented so that the asymptotic tracking performance with guaranteed prescribed performance is achieved. Comparative experiments are provided to show the effectiveness of the proposed control scheme.
Article
A low power real-time visual object tracking (VOT) processor using the siamese network (SiamNet) is proposed for mobile devices. Two key features enable a real-time VOT with low power consumption on mobile devices. First, correlation-based spatial early stopping (CSES) is proposed to reduce the computational workload. CSES reduces ~56.8% of the overall computation of the SiamNet by gradually eliminating the background. Second, the dual mode reuse core (DMRC) is proposed for supporting both the convolution layer and the cross-correlation layer with high core utilization. Finally, the proposed VOT processor is implemented in 28 nm CMOS technology and occupies 0.42 mm <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">2</sup> . The proposed processor achieves 0.587 for the success rate and 0.778 for the precision in the OTB-100 dataset with SiamRPN++-AlexNet. Compared to previous VOT processors, the proposed processor shows state-of-the-art performance while showing lower power consumption. The proposed processor achieves 64.1 mW peak power and 58.2 mW tracking power consumption at 32.1 frame-per-second (fps) real-time VOT on mobile devices.
Chapter
In this paper, we provide a deep analysis for Siamese-based trackers and find that the one core reason for their failure on challenging cases can be attributed to the problem of decisive samples missing during offline training. Furthermore, we notice that the samples given in the first frame can be viewed as the decisive samples for the sequence since they contain rich sequence-specific information. To make full use of these sequence-specific samples, we propose a compact latent network to quickly adjust the tracking model to adapt to new scenes. A statistic-based compact latent feature is proposed to efficiently capture the sequence-specific information for the fast adjustment. In addition, we design a new training approach based on a diverse sample mining strategy to further improve the discrimination ability of our compact latent network. To evaluate the effectiveness of our method, we apply it to adjust a recent state-of-the-art tracker, SiamRPN++. Extensive experimental results on five recent benchmarks demonstrate that the adjusted tracker achieves promising improvement in terms of tracking accuracy, with almost the same speed. The code and models are available at https://github.com/xingpingdong/CLNet-tracking.
Chapter
Current state-of-the-art trackers rely only on a target appearance model in order to localize the object in each frame. Such approaches are however prone to fail in case of e.g. fast appearance changes or presence of distractor objects, where a target appearance model alone is insufficient for robust tracking. Having the knowledge about the presence and locations of other objects in the surrounding scene can be highly beneficial in such cases. This scene information can be propagated through the sequence and used to, for instance, explicitly avoid distractor objects and eliminate target candidate regions. In this work, we propose a novel tracking architecture which can utilize scene information for tracking. Our tracker represents such information as dense localized state vectors, which can encode, for example, if a local region is target, background, or distractor. These state vectors are propagated through the sequence and combined with the appearance model output to localize the target. Our network is learned to effectively utilize the scene information by directly maximizing tracking performance on video segments. The proposed approach sets a new state-of-the-art on 3 tracking benchmarks, achieving an AO score of 63.6% on the recent GOT-10k dataset.
Article
This brief proposes an adaptation of the Generalized Predictive Control (GPC) for ramp-reference tracking. The second-order difference operation and the plant model are used to get an augmented model with two embedded integrators and whose output is the tracking error. Differently from other GPC-based tracking algorithms, the proposed approach does not require information about the reference parameters, and the GPC prediction horizon is composed of the predicted errors instead of the expected plant outputs. Thus, the optimization function and the receding horizon strategy used in conventional GPC can be applied to get the control law. Simulation and experimental results prove that the proposed approach can successfully track constant and ramp references. The proposed method is applicable for single-input single-outputs plants. However, the mathematical background presented in this brief can be used in the development of new GPC strategies.
Article
Discriminative correlation filters (DCFs) have been widely used in the visual tracking community in recent years. The DCFs-based trackers determine the target location through a response map generated by the correlation filters and determine the target scale by a fixed scale factor. However, the response map is vulnerable to noise interference and the fixed scale factor also cannot reflect the real scale change of the target, which can obviously reduce the tracking performance. In this paper, to solve the aforementioned drawbacks, we propose to learn a metric learning model in correlation filters framework for visual tracking (called CFML). This model can use a metric learning function to solve the target scale problem. In particular, we adopt a hard negative mining strategy to alleviate the influence of the noise on the response map, which can effectively improve the tracking accuracy. Extensive experimental results demonstrate that the proposed CFML tracker achieves competitive performance compared with the state-of-the-art trackers.
Article
Discriminative correlation filters (DCFs) have been widely used in the tracking community recently. DCFs-based trackers utilize samples generated by circularly shifting from an image patch to train a ridge regression model, and estimate target location using a response map generated by the correlation filters. However, the generated samples produce some negative effects and the response map is vulnerable to noise interference, which degrades tracking performance. In this paper, to solve the aforementioned drawbacks, we propose a target-focusing convolutional regression (CR) model for visual object tracking tasks (called TFCR). This model uses a target-focusing loss function to alleviate the influence of background noise on the response map of the current tracking image frame, which effectively improves the tracking accuracy. In particular, it can effectively balance the disequilibrium of positive and negative samples by reducing some effects of the negative samples that act on the object appearance model. Extensive experimental results illustrate that our TFCR tracker achieves competitive performance compared with state-of-the-art trackers.
Article
This paper proposes a DC-DC buck converter using an analog coarse-fine self-tracking zero-current detection (AST-ZCD) scheme. The AST-ZCD detects the zero-current by measuring the voltage level across a freewheeling transistor. It adjusts the nMOS turn-off time using an amplifier, capacitors, and current sources instead of large numbers of shift register bits and unit delay cells in the conventional digital self-tracking zero-current detection (DST-ZCD). It also reduces the zero-current self-tracking time by using coarse-fine current-sources when the output current transition is large. The proposed DC-DC buck converter was fabricated with a 0.18 μm CMOS process. The AST-ZCD reduces the area by 94%, the power consumption by 80%, and the zero-current self-tracking time by 82% compared to the DST-ZCD.
Chapter
We introduce Spatial-Temporal Memory Networks for video object detection. At its core, a novel Spatial-Temporal Memory module (STMM) serves as the recurrent computation unit to model long-term temporal appearance and motion dynamics. The STMM’s design enables full integration of pretrained backbone CNN weights, which we find to be critical for accurate detection. Furthermore, in order to tackle object motion in videos, we propose a novel MatchTrans module to align the spatial-temporal memory from frame to frame. Our method produces state-of-the-art results on the benchmark ImageNet VID dataset, and our ablative studies clearly demonstrate the contribution of our different design choices. We release our code and models at http://fanyix.cs.ucdavis.edu/project/stmn/project.html.
Article
Depth cameras have recently become popular and many vision problems can be better solved with depth information. But how to integrate depth information into a visual tracker to overcome the challenges such as occlusion and background distraction is still under-investigated in current literatures of visual tracking. In this paper, we investigate a 3D extension of classical mean-shift tracker whose greedy gradient ascend strategy is generally considered as unreliable in conventional 2D tracking. However, through careful study of the physical property of 3D point clouds, we reveal that objects which may appear to be adjacent on 2D image will form distinctive modes in the 3D probability distribution approximated by kernel density estimation, and finding the nearest mode using 3D mean-shift can always work in tracking. Based on the understanding of 3D mean-shift, we propose two important mechanisms to further boost the tracker's robustness: one is to enable the tracker be aware of potential distractions and make corresponding adjustments to the appearance model; and the other is to enable the tracker to be able to detect and recover from tracking failures caused by total occlusion. The proposed method is both effective and computationally efficient. On a conventional PC, it runs at more than 60 FPS without GPU acceleration.
Article
Most thermal infrared (TIR) tracking methods are discriminative, which treat the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is well coupled to the objective of tracking task. We propose a TIR tracker via a hierarchical Siamese convolutional neural network (CNN), named HSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN coalescing the multiple hierarchical convolutional layers. Then, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the most similar one as the tracked target. Extensive experimental results on the benchmarks: VOT-TIR 2015 and VOT-TIR 2016, show that our proposed method achieves favorable performance against the state-of-the-art methods.
Article
Infrared object tracking is a key technology in many surveillance applications. General visual tracking algorithms designed for color images can not handle infrared targets very well due to their relatively low resolutions and blurred edges. This paper presents a new tracking by detection method based on online structural learning. We show how to train the classifier efficiently with dense samples through Fourier techniques and careful implementation. Furthermore, we introduce an effective feature representation for infrared objects. Finally, we demonstrate the performance of the proposed tracker on public infrared sequences with top accuracy and robustness. Meanwhile, our single thread C++ implementation of the algorithm achieves an average tracking speed of 215 FPS on a modern cpu.
Article
Correlation Filters (CFs) have recently demonstrated excellent performance in terms of rapidly tracking objects under challenging photometric and geometric variations. The strength of the approach comes from its ability to efficiently learn - "on the fly" - how the object is changing over time. A fundamental drawback to CFs, however, is that the background of the object is not be modelled over time which can result in suboptimal results. In this paper we propose a Background-Aware CF that can model how both the foreground and background of the object varies over time. Our approach, like conventional CFs, is extremely computationally efficient - and extensive experiments over multiple tracking benchmarks demonstrate the superior accuracy and real-time performance of our method compared to the state-of-the-art trackers including those based on a deep learning paradigm.
Conference Paper
This paper proposed a novel sparse representation-based infrared target tracking method using multi-feature fusion to compensate for incomplete description of single feature. In the proposed method, we extract the intensity histogram and the data on-Local Entropy and Local Contrast Mean Difference information for feature representation. To combine various features, particle candidates and multiple feature descriptors of dictionary templates were encoded as kernel matrices. Every candidate particle was sparsely represented as a linear combination of a set of atom vectors of a dictionary. Then, the sparse target template representation model was efficiently constructed using a kernel trick method. Finally, under the framework of particle filter the weights of particles were determined by sparse coefficient reconstruction errors for tracking. For tracking, a template update strategy employing Adaptive Structural Local Sparse Appearance Tracking (ASLAS) was implemented. The experimental results on benchmark data set demonstrate the better performance over many existing ones.
Active learning for deep visual tracking
  • D Yuan
  • X Chang
  • Q Liu
  • D Wang
  • Z He
D. Yuan, X. Chang, Q. Liu, D. Wang, and Z. He, "Active learning for deep visual tracking," arXiv preprint arXiv:2110.13259, 2021.
Video object detection with an aligned spatialtemporal memory
  • F Xiao
  • Y J Lee
F. Xiao and Y. J. Lee, "Video object detection with an aligned spatialtemporal memory," in ECCV, 2018, pp. 485-501.