Article

Aligned Spatial-Temporal Memory Network for Thermal Infrared Target Tracking

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Thermal infrared (TIR) target tracking is susceptible to occlusion and similarity interference, which obviously affects the tracking results. To resolve this problem, we develop an Aligned Spatial-Temporal Memory network-based Tracking method (ASTMT) for the TIR target tracking task. Specifically, we model the scene information in the TIR target tracking scenario using the spatial-temporal memory network, which can effectively store the scene information and decrease the interference of similarity interference that is beneficial to the target. In addition, we use an aligned matching module to correct the parameters of the spatial-temporal memory network model, which can effectively alleviate the impact of occlusion on the target estimation, hence boosting the tracking accuracy even further. Through ablation study experiments, we have demonstrated that the spatial-temporal memory network and the aligned matching module in the proposed ASTMT tracker are exceptionally successful. Our ASTMT tracking method performs well on the PTB-TIR and LSOTB-TIR benchmarks contrasted with other tracking methods.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The HSSNet [22] tracker refers to the Siamese architecture [24][25][26][27] which views the tracking task as similarity template matching and uses the Siamese-based convolutional neural network with multi-convolutional layers to obtain the semantic and spatial features of the tracking TIR target. These TIR target trackers utilize the off-the-shelf network for feature extraction which achieved some performance improvements; however, the tracking performance of these trackers is still significantly limited because this network is not trained for the TIR target-tracking task [21,28,29]. To make the feature extraction network adapt to the TIR tracking task, it is necessary to use TIR training samples for their convolutional feature extraction network training [30][31][32]. ...
... Deep Thermal Infrared Tracking: Deep learning has been widely used in image processing tasks because of its powerful representation ability [29,37,38]. Nowadays, deep learning-based trackers have attracted more and more researchers due to their good representation ability [39][40][41]. ...
... RGB images are sensitive to lighting changes, while TIR images work well in varying light conditions, including complete darkness. Consequently, applying RGB-trained models to TIR data can lead to performance degradation due to differences in feature distribution [28,29,49]. To bridge this gap, domain adaptation techniques are needed to adapt RGB models to TIR features or retrain models specifically for TIR [50,51]. ...
Article
Full-text available
The limited availability of thermal infrared (TIR) training samples leads to suboptimal target representation by convolutional feature extraction networks, which adversely impacts the accuracy of TIR target tracking methods. To address this issue, we propose an unsupervised cross-domain model (UCDT) for TIR tracking. Our approach leverages labeled training samples from the RGB domain (source domain) to train a general feature extraction network. We then employ a cross-domain model to adapt this network for effective target feature extraction in the TIR domain (target domain). This cross-domain strategy addresses the challenge of limited TIR training samples effectively. Additionally, we utilize an unsupervised learning technique to generate pseudo-labels for unlabeled training samples in the source domain, which helps overcome the limitations imposed by the scarcity of annotated training data. Extensive experiments demonstrate that our UCDT tracking method outperforms existing tracking approaches on the PTB-TIR and LSOTB-TIR benchmarks.
... The traffic congestion factor was calculated by estimating traffic density, traffic flow, road occupancy, and traffic speed to estimate the road traffic condition. In predicting traffic conditions, some LSTM-based models were used to predict traffic states with time-series input data [10], [11]. Apart from forecasting, some efforts were made to estimate traffic conditions through detecting and tracking vehicle locations [12]. ...
... There is a trade-off between the inference speed, and the accuracy of the intermediate results. Furthermore, in order to detect the lanes accurately, the system has to accumulate sufficient initial data and corresponding time duration to converge and conclude the formations of lanes on 10 VOLUME 11,2023 This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and content may change prior to final publication. ...
Article
Full-text available
The Macao Government provides web-based streaming videos for the public to monitor live traffic and road conditions across the city. This allows individuals to review the latest road traffic conditions online before planning their travels. To let road user makes better and faster decisions, it is desirable to design an automated subsystem in an Intelligent Transportation System (ITS). And the subsystem should analyze available live video streams and recommend multiple travel routes, if possible, to each query instantly. In the paper, we propose a real-time road traffic condition evaluation system. Its design is based on a combination of deep learning models (YOLO and BoTSORT), and a modified Non-Maximum Suppression (mNMS) algorithm. The mNMS strategy removes the needs to manually tune the NMS parameters. By deploying YOLO with our mNMS, the object detection efficiency on live videos improves significantly. Together with the BoTSORT method, we can track the moving vehicles, create the corresponding motion trajectories, and identify traffic lanes with high correctness. The generated trajectory then operates as a filtering mechanism in assessing real-time road traffic conditions. By separating the lanes based on observation angles and using a per-lane status score independently, we further enhance the overall system performance. Through thorough experiments on the live videos, our design correctly estimates traffic status with high accuracy and without needing any manual parametric adjustments.
... Based on different issues of the scene complexity, different methods of tracking such as feature-based tracking [7,8,9,10], model-based tracking [11,12,13,14,15,16,17], region-based tracking [18], and deformable template-based tracking [19,20,21] have been proposed. It is found from the literature that illumination variation and occlusion are the two crucial issues of the scene during object tracking [7,8,9,10,22,23,24,25]. In case of occlusion, incomplete feature extraction of the target object at the occluded region leads to unstable tracking, whereas a sudden change of intensity in the scene due to illumination variation results in distortions in feature distribution thus resulting in an improper target model. ...
... A multi-level feature enhancement unit together with a global channel attention network is proposed by Gu et al. [25] to strengthen the target model. Yuan et al. [24] proposed a video object tracking scheme for thermal infrared target tracking. Hence, researchers have been motivated to handle these two key issues of the scene by adapting the notion of fusion of multiple potential features to model the target in a particle filter framework. ...
Article
Full-text available
Video object tracking in real-world scenarios is one of the challenging problems of computer vision. The issue is compounded in the presence of varying illumination conditions, dynamic entities of the background, and bad weather conditions. In this research, a particle filter based new video object tracking scheme is developed with the proposed notions of target remodeling and reinitialization. During the tracking phase, the target is remodeled in each frame to take care of the changing scene dynamics over frames. The target is remodeled by fused feature distributions chosen from the created bank of fused feature distributions having discriminating potential to differentiate the target and the background in a given frame. The fused feature bank is created by fusing two features from the set consisting of Color, LBP, and HOG features. The features are fused probabilistically where the weights are determined based on the discriminating ability of a given feature. In order to achieve high tracking accuracy, the deviation of the tracker is evaluated in each frame using the notion of time motion history while reinitialization of the tracker position takes place when the deviation is above a preselected threshold. Besides, the proposed algorithm has been implemented successfully on a Raspberry Pi based hardware setup and thus becomes a potential candidate for real time implementation. The proposed scheme is successfully tested on videos from DAVIS 2016, LASIESTA, OTB 100, and CDnet 2014 databases and in most of the cases the tracking accuracy is found to be higher than those of the existing algorithms.
... The novel FAAM subnetwork incorporated into the tracking system plays a crucial role in improving performance [27] . Yuan et al. (2022) developed an Aligned Spatial-Temporal Memory network-based Tracking method (ASTMT) for thermal infrared (TIR) target tracking. This research focuses on the challenges of occlusion and similarity interference in TIR target tracking and proposes a spatial-temporal memory network to effectively store scene information and decrease interference, thereby enhancing detection accuracy in complex scenarios [28]. ...
... Yuan et al. (2022) developed an Aligned Spatial-Temporal Memory network-based Tracking method (ASTMT) for thermal infrared (TIR) target tracking. This research focuses on the challenges of occlusion and similarity interference in TIR target tracking and proposes a spatial-temporal memory network to effectively store scene information and decrease interference, thereby enhancing detection accuracy in complex scenarios [28]. Gu et al. (2022) presented RPformer, a robust parallel transformer for visual tracking in complex scenes. ...
Article
Full-text available
This research presents a groundbreaking approach in aerial image analysis by integrating the Real-Time Detection and Recognition (RT-DETR-X) model with the Slicing Aided Hyper Inference (SAHI) methodology, utilizing the VisDrone-DET dataset. Aimed at enhancing the efficiency of drone technology across a spectrum of applications, including water conservancy, geological exploration, and military operations, this study focuses on harnessing the real-time, end-to-end object detection capabilities of RT-DETR-X. Characterized by its high-speed and high-accuracy performance, particularly in UAV aerial photography, RT-DETR-X demonstrates a remarkable 54.8% Average Precision (AP) and 74 frames per second (FPS), surpassing similar models in both speed and accuracy. The research thoroughly examines the VisDrone-DET dataset, which encompasses a diverse range of small targets in UAV aerial photography scenes. Covering 10 distinct categories, the dataset provides a robust platform for rigorous model testing. The study emphasizes the utilization of the original image dataset for comprehensive training and evaluation, alongside the practical implementation of the SAHI method for enhanced detection of small-scale objects. Through an in-depth exploration of the model’s performance in various scenarios and a detailed analysis of the environmental setup, this paper underscores the impact of integrating RT-DETR with the SAHI approach. The findings reveal significant progress in drone detection technologies, offering a holistic framework for effective and efficient aerial surveillance. The integration not only boosts the model’s detection accuracy but also opens new avenues for advanced image analysis in UAV applications.
... By incorporating active learning techniques, future lane detection models could achieve comparable accuracy with significantly less labeled data. Finally, temporal modeling has emerged as a powerful tool for handling dynamic environments, as demonstrated in Aligned Spatial-Temporal Memory Network for Thermal Infrared Target Tracking [18]. This work leverages spatial-temporal memory networks to store and retrieve relevant features over time, improving tracking accuracy in challenging conditions like low visibility and dynamic backgrounds. ...
Article
Full-text available
Lane detection plays a crucial role in autonomous driving systems by enabling vehicles to comprehend road structure and ensure safe navigation. However, the current performance of lane line detection models, such as CCNet, exhibits limitations in handling difficult driving conditions like shadows, nighttime, no lines,and dazzle, which significantly impact the safety of autonomous driving. In addition, due to the lack of attention to both the global and local aspects of road images, this issue becomes even more pronounced. To address these challenges, we propose a novel network architecture named Criss-Cross Attention Enhanced Cross-Layer Refinement Network (CCCNet). By integrating the strengths of criss-cross attention and cross-layer refinement mechanisms, CCCNet effectively captures long-range dependencies and global context information from the input images, leading to more reliable lane detection in complex environments. Extensive evaluations on standard datasets, including CULane and TuSimple, demonstrate that CCCNet outperforms CLRNet and other leading models by achieving higher accuracy and robustness, especially in challenging scenarios. In addition, we publicly release our code and models to encourage further research advancements in lane detection technologies at https://github.com/grass2440/CCCNet.
... It is noteworthy that, PMBM filters seemingly do not provide explicit track continuity between time steps [11]. However, during the filtering recursion, there is an implicit PMBM on the set of trajectories. ...
Article
Full-text available
Poisson multi‐Bernoulli mixture (PMBM) filtering is an effective method for multi‐extended target tracking. However, it has a low utilization of information from multiple time scans, leading to potential errors such as jumps and discontinuities in the target trajectories. To address this, propose a gamma Gaussian inverse Wishart PMBM filter that uses multiple time scans for smoothing. The results of continuous tracking are smoothed and corrected at regular intervals, resulting in more accurate tracking performance. Simulation results show that the proposed algorithm effectively reduces generalized optimal sub‐pattern assignment errors compared to the original PMBM filter.
... Lu et al. [11]proposed a composite transformer that fully utilizes target feature information through spatiotemporal feature fusion. Di et al. [12] employed a spatiotemporal memory network to model scene information, effectively store scene features, and reduce interference from similarities that are detrimental to target tracking. To more effectively enhance the features of small targets, we designed a feature enhancement module based on the attention mechanism. ...
Article
Full-text available
To address the challenges of diverse defect shapes, small defect sizes, and real-time detection on ceramic tile surfaces, this paper proposes an efficient anchor-free detector that integrates adaptive receptive fields and feature enhancement. First, the anchor-free YOLOv8 is proposed as the detection framework to eliminate the need for setting anchor-related hyperparameters, thus avoiding their impact on performance. Second, an adaptive receptive field module (ARFM) is introduced, which extracts defect features from multiple scales through constructing several parallel branches and fuses the feature maps from different receptive fields, thereby dynamically adjusting the receptive field to detect various defects. Finally, a feature enhancement module (FEM) is added to the neck part, enhancing feature representation from both global and spatial dimensions to reduce feature information loss and improve defect detection performance. Experiments show that our detector achieves a mean average precision of 70.4%, an improvement of 6.1% over YOLOv8, while maintaining a high detection speed of 121.2 frames per second. This performance surpasses the state-of-the-art detection methods. These results indicate that our proposed model can achieve satisfactory defect detection performance, meeting the industry’s real-time detection needs.
... Therefore, researchers have begun to concentrate on extracting spatiotemporal information and delving deeper into temporal information in video sequences. ASTMT tracker [46] develops an aligned spatial-temporal memory network to utilize temporal information, compensating for the limitations of appearance-based trackers. SwinTrack [18] records historical tracking results and fuses them into a motion token. ...
Article
Full-text available
Extracting additional spatiotemporal information from video sequences is critical for accurately perceiving target appearance changes during visual tracking. However, most learning-based trackers utilize only a single search image and template from a video for training, resulting in a lack of temporal information and low data utilization. To address these issues, we present an innovative Trajectory Guided Tracking (TGTrack) framework, which leverages the historical states of the target to predict its current location. Specifically, we construct trajectory tokens derived from tracking results in historical frames, integrating the position and scale information of the target. We propose a trajectory prediction module to utilize these trajectory tokens to generate the potential scope of current target. Furthermore, to enhance the inference efficiency of the tracker, we eliminate manually customized heads and post-processing steps. Consequently, we achieve a good balance between inference speed and effectiveness. Extensive experimental results demonstrate that our TGTrack achieves leading performance across multiple benchmarks.
... Furthermore, MMNet [65] performs multi-level feature matching for TIR targets by adopting a discriminative matching module for inter-class recognition and a fine-grained aware module for intra-class recognition. To reduce the influence of occlusion and distractors, Yuan et al. [66] design a spatial-temporal memory network, collecting high-quality results in previous frames for tracking. ...
Article
Full-text available
With the development of unmanned aerial vehicle (UAV) technology, the threat of UAV intrusion is no longer negligible. Therefore, drone perception, especially anti-UAV tracking technology, has gathered considerable attention. However, both traditional Siamese and transformer-based trackers struggle in anti-UAV tasks due to the small target size, clutter backgrounds and model degradation. To alleviate these challenges, a novel contrastive-augmented memory network (CAMTracker) is proposed for anti-UAV tracking tasks in thermal infrared (TIR) videos. The proposed CAMTracker conducts tracking through a two-stage scheme, searching for possible candidates in the first stage and matching the candidates with the template for final prediction. In the first stage, an instance-guided region proposal network (IG-RPN) is employed to calculate the correlation features between the templates and the searching images and further generate candidate proposals. In the second stage, a contrastive-augmented matching module (CAM), along with a refined contrastive loss function, is designed to enhance the discrimination ability of the tracker under the instruction of contrastive learning strategy. Moreover, to avoid model degradation, an adaptive dynamic memory module (ADM) is proposed to maintain a dynamic template to cope with the feature variation of the target in long sequences. Comprehensive experiments have been conducted on the Anti-UAV410 dataset, where the proposed CAMTracker achieves the best performance compared to advanced tracking algorithms, with significant advantages on all the evaluation metrics, including at least 2.40%, 4.12%, 5.43% and 5.48% on precision, success rate, success AUC and state accuracy, respectively.
... LMSCO [6] integrated B Peng Gao pgao@qfnu.edu.cn 1 the motion information of the pedestrian into VGGNet to overcome the background clutter and motion blur of the TIR image. ASTMT [7] designed an aligned spatial-temporal memory network based on ResNet to take advantage of learning scene information for pedestrian localization. The wellperformance handcrafted neural networks require substantial effort from human experts. ...
Article
Full-text available
Manually-designed network architectures for thermal infrared pedestrian tracking (TIR-PT) require substantial effort from human experts. AlexNet and ResNet are widely used as backbone networks in TIR-PT applications. However, these architectures were originally designed for image classification and object detection tasks, which are less complex than the challenges presented by TIR-PT. This paper makes an early attempt to search an optimal network architecture for TIR-PT automatically, employing single-bottom and dual-bottom cells as basic search units and incorporating eight operation candidates within the search space. To expedite the search process, a random channel selection strategy is employed prior to assessing operation candidates. Classification, batch hard triplet, and center loss are jointly used to retrain the searched architecture. The outcome is a high-performance network architecture that is both parameter- and computation-efficient. Extensive experiments proved the effectiveness of the automated method.
... Furthermore, the high computational complexity of current deep learning MOT algorithms hinders their ability to meet real-time application requirements. This issue primarily arises from the widespread adoption of the Transformer [19] architecture in mainstream MOT algorithms [20][21][22][23]. ...
Article
Full-text available
This study proposes FLSTrack, an end-to-end multi-object tracking algorithm that integrates Focused Linear Attention with dual decoders. The algorithm aims to address the limitations of current multi-object tracking methods, including poor performance in complex scenarios, inadequate data association, and high computational complexity. Initially, the SwinTransformer is paired with a Focused Linear Attention module to enhance the network’s ability to extract both local and global information, thereby reducing computational costs. Subsequently, a dual-branch decoder based on window attention is developed, with one branch dedicated to tracking and the other to detecting targets in image frames. To further enhance the algorithm’s speed, the complex feature re-identification (ReID) network is replaced with the BYTE data association method. To compensate for the loss of feature appearance resulting from omitting the ReID network, the SIoU loss function is introduced, significantly improving target localization accuracy. The experimental results of FLSTrack on the MOT17, MOT20, DanceTrack, and KITTI datasets show superior performance. Moreover, with an inference speed nearing 30 FPS, the algorithm achieves an optimal balance between tracking accuracy and real-time performance.
... On the other hand, visible images excel in representing object texture details. Due to the lack of color vision patterns in thermal infrared images, which are characterized by low resolution and blurred contours, single-modal Thermal infrared object tracking [36,43,65,69] also faces many challenges. As shown in Fig. 1, features in Thermal infrared images are clearer during nighttime, while features in RGB images perform better under well-lit conditions. ...
Article
Supervised RGBT (SRGBT) tracking tasks need both expensive and time-consuming annotations. Therefore, the implementation of Self-Supervised RGBT (SSRGBT) tracking methods has become increasingly important. Straightforward SSRGBT tracking methods use pseudo-labels for tracking, but inaccurate pseudo-labels can lead to object drift, which severely affects tracking performance. This paper proposes a self-supervised RGBT object tracking method (S2OTFormer) to bridge the gap between tracking methods supervised under pseudo-labels and ground truth labels. Firstly, to provide more robust appearance features for motion cues, we introduce a Multi-Modal Hierarchical Transformer module (MHT) for feature fusion. This module allocates weights to both modalities and strengthens the expressive capability of the MHT module through multiple nonlinear layers to fully utilize the complementary information of the two modalities. Secondly, in order to solve the problems of motion blur caused by camera motion and inaccurate appearance information caused by pseudo-labels, we introduce a Motion-Aware Mechanism (MAM). The MAM extracts the average motion vectors from the previous multi-frame search frame features and constructs the consistency loss with the motion vectors of the current search frame features. The motion vectors of inter-frame objects are obtained by reusing the inter-frame attention map to predict coordinate positions. Finally, to further reduce the effect of inaccurate pseudo-labels, we propose an Attention-Based Multi-Scale Enhancement Module. By introducing cross-attention to achieve more precise and accurate object tracking, this module overcomes the receptive field limitations of traditional CNN tracking heads. We demonstrate the effectiveness of S2OTFormer on four large-scale public datasets through extensive comparisons as well as numerous ablation experiments. The source code is available at https://github.com/LiShenglana/S2OTFormer .
... Li et al. [18] proposed to learn hierarchical spatial-aware features suitable for TIR tracking under the matching task. The ASTMT [19] tracker uses a spatiotemporal memory network to model the TIR target tracking scene information and reduce the similarity interference that is favorable to the target. The feature space contains both the spatial structure features of shallow convolution and the semantic features of deep convolution, so it has a stronger representation ability. ...
Article
Full-text available
Thermal infrared (TIR) target tracking is an important topic in the computer vision area. The TIR images are not affected by ambient light and have strong environmental adaptability, making them widely used in battlefield perception, video surveillance, assisted driving, etc. However, TIR target tracking faces problems such as relatively insufficient information and lack of target texture information, which significantly affects the tracking accuracy of the TIR tracking methods. To solve the above problems, we propose a TIR target tracking method based on a Siamese network with a hierarchical attention mechanism (called: SiamHAN). Specifically, the CIoU Loss is introduced to make full use of the regression box information to calculate the loss function more accurately. The GCNet attention mechanism is introduced to reconstruct the feature extraction structure of fine-grained information for the fine-grained information of thermal infrared images. Meanwhile, for the feature information of the hierarchical backbone network of the Siamese network, the ECANet attention mechanism is used for hierarchical feature fusion, so that it can fully utilize the feature information of the multi-layer backbone network to represent the target. On the LSOTB-TIR, the hierarchical attention Siamese network achieved a 2.9% increase in success rate and a 4.3% increase in precision relative to the baseline tracker. Experiments show that the proposed SiamHAN method has achieved competitive tracking results on the thermal infrared testing datasets.
... Deep learning technology, having made significant breakthroughs in numerous fields, has demonstrated its efficacy in processing high-dimensional data and extracting features [28][29][30][31]. These advances have prompted scholars to explore deep learning's potential for signal processing applications [32,33]. ...
Article
Full-text available
With technological advancements and scientific progress, mobile robots have found widespread applications across various fields. To enable robots to perform tasks safely and effectively in diverse and unknown environments, this paper proposes a ground medium classification algorithm for robots based on feature fusion and an adaptive spatio-temporal cascade network. Specifically, the original directional features in the dataset are first transformed into quaternion form. Then, spatio-temporal forward and reverse neighbors are identified using KD trees, and their connection strengths are evaluated via a kernel density estimation algorithm to determine the final set of neighbors. Subsequently, based on the connection strengths determined in the previous step, we perform noise reduction on the features using discrete wavelet transform. The noise-reduced features are then weighted and fused to generate a new feature representation.After feature fusion, the Adaptive Dynamic Convolutional Neural Network (ADC) proposed in this paper is cascaded with the Long Short-Term Memory (LSTM) network to further extract hybrid spatio-temporal feature information from the dataset, culminating in the final terrain classification. Experiments on the terrain type classification dataset demonstrate that our method achieves an average accuracy of 97.46% and an AUC of 99.80%, significantly outperforming other commonly used algorithms in the field. Furthermore, the effectiveness of each module in the proposed method is further demonstrated through ablation experiments.
... The popularity of Siamese trackers [9,[17][18][19][20][21][22] has been steadily increasing in recent years. These trackers typically employ a two-stream pipeline to separately extract features from the template and search region. ...
Article
Full-text available
The introduction of the one-stream one-stage framework has led to remarkable advances in visual object tracking, resulting in exceptional tracking performance. Most existing one-stream one-stage tracking pipelines have achieved a relative balance between accuracy and speed. However, they focus solely on integrating feature learning and relational modelling. In complex scenes, the tracking performance often falls short due to confounding factors such as changes in target scale, occlusion, and fast motion. In these cases, numerous trackers cannot sufficiently exploit the target feature information and face the dilemma of information loss. To address these challenges, we propose a screening enrichment for transformer-based tracking. Our method incorporates a screening enrichment module as an additional processing operation in the integration of feature learning and relational modelling. The module effectively distinguishes target areas within the search regions. It also enriches the associations between tokens of target area information. In addition, we introduce our box validation module. This module uses the target position information from the previous frame to validate and revise the target position in the current frame. This process enables more accurate target localization. Through these innovations, we have developed a powerful and efficient tracker. It achieves state-of-the-art performance on six benchmark datasets, including GOT-10K, LaSOT, TrackingNet, UAV123, TNL2K and VOT2020. On the GOT-10K benchmarks, Specifically, on the GOT-10K benchmarks, our proposed tracker reaches an impressive Success Rate (SR0.5S{{R}_{0.5}}) of 85.4 and an Average Overlap (AO) of 75.3. Experimental results show that our proposed tracker outperforms other state-of-the-art trackers in terms of tracking accuracy.
... In recent years, passive tracking techniques have witnessed significant advancements in terms of both accuracy and speed. Various approaches have been proposed to address challenges such as occlusion, blurring, changes in lighting conditions, and target deformation, which commonly occur during the tracking process [4][5][6][7][8]. Early tracking algorithms predominantly relied on optical flow methods [9], filtering techniques [10], and kernel-based algorithms [11]. ...
Article
Full-text available
This paper addresses the challenge of active tracking of space non-cooperative targets, a critical task in various aerospace applications. Traditional active tracking algorithms often require extensive data and suffer from limited generalization ability, making them inefficient for tracking targets with diverse characteristics. To overcome these limitations, we propose an end-to-end active target tracking method named Meta-Reinforcement Learning based Active Visual Tracking (MRLAVT). This approach integrates meta-reinforcement learning, enabling the system to quickly adapt to new tasks by leveraging experiences from previous tasks. By employing convolutional neural networks to extract information from images and generate corresponding actions, MRLAVT demonstrates strong adaptability and robustness in tracking targets with varying characteristics. Experimental results confirm the effectiveness of our proposed algorithm, showcasing superior performance in scenarios involving both few adaptations and non-adaptation. Overall, MRLAVT significantly reduces the complexity of system integration while achieving high-quality tracking results.
... Although DFMT-Net [102] successfully integrates the temporal information of the sequence, it requires manual adjustment of the update frame number of the tracker, which requires us to have prior knowledge of the tracked data, such as the sequence length of the target appearance change. Yuan et al. [129] utilized a spatiotemporal memory network to model scene information, aiming to mitigate similarity interference and consequently enhance the performance of target tracking. In addition, Yuan et al. [130] utilized a model update strategy to accommodate target variations during tracking. ...
Article
RGBT tracking seeks to leverage both visible and thermal infrared images to enhance the robustness of target tracking. This method makes up for the limitations of single-sensor tracking. The RGB and thermal infrared images complement each other effectively, enabling the tracker to operate seamlessly in complex environments day and night. However, RGBT tracking still faces challenges such as thermal crossover, occlusion, illumination changes, rapid target movement, and background clutter. These problems hinder the performance improvement of RGBT trackers. Researchers have proposed numerous RGBT tracking methods in response to these challenges over time. This paper aims to comprehensively review and summarize the latest research progress in RGBT tracking. First, we categorize the existing tracing methods according to the RGBT tracing framework. Then, we introduce the datasets/benchmarks used to evaluate RGBT tracking methods. Subsequently, we present the results of representative tracking methods on multiple testing benchmarks, providing a more intuitive presentation of current research advances in the field of RGBT tracking. Finally, research directions are discussed to further advance the development of RGBT tracking.
... Inevitably, additional label of challenging attributes inspired many studies to overcome complex scenarios in visual tracking. [36] combined spatialtemporal memory network and aligned matching module to decrease the interference of occlusion and similarity interference. RPformer [30] and EANTrack [29] enhanced tracking performance in complex scenarios by introducing transformer module and attention mechanism into Siamese tracker which show satisfactory results on challenging attributes of several tracking benchmarks. ...
Article
Full-text available
In recent years, Siamese network-based trackers have achieved significant improvements in real-time tracking. Despite their success, performance bottlenecks caused by unavoidably complex scenarios in target-tracking tasks are becoming increasingly non-negligible. For example, occlusion and fast motion are factors that can easily cause tracking failures and are labeled in many high-quality tracking databases as challenging attributes. In addition, Siamese trackers tend to suffer from high memory costs, which restricts their applicability to mobile devices with tight memory budgets. To address these issues, we propose a Specialized teachers Distilled Siamese Tracker (SDST) framework to learn a student tracker, which is small, fast, and has enhanced performance in challenging attributes. SDST introduces two types of teachers for multi-teacher distillation: general teacher and specialized teachers. The former imparts basic knowledge to the students. The latter is used to transfer specialized knowledge to students, which helps improve their performance in challenging attributes. For students to efficiently capture critical knowledge from the two types of teachers, SDST is equipped with a carefully designed multi-teacher knowledge distillation model. Our model contains two processes: general teacher-student knowledge transfer and specialized teachers-student knowledge transfer. Extensive empirical evaluations of several popular Siamese trackers demonstrated the generality and effectiveness of our framework. Moreover, the results on Large-scale Single Object Tracking (LaSOT) show that the proposed method achieves a significant improvement of more than 2–4% in most challenging attributes. SDST also maintained high overall performance while achieving compression rates of up to 8x and framerates of 252 FPS and obtaining outstanding accuracy on all challenging attributes.
... In the realm of visual tracking, certain works have stood out due to their focus on refining the attention mechanisms in Transformers 8,9,[13][14][15][16][17] . Most of these trackers excel in managing lengthy sequential relationships and capturing comprehensive global information, yet they tend to overlook target-specific details within the designated search region. ...
Article
Full-text available
The Transformer-based Siamese networks have excelled in the field of object tracking. Nevertheless, a notable limitation persists in their reliance on ResNet as backbone, which lacks the capacity to effectively capture global information and exhibits constraints in feature representation. Furthermore, these trackers struggle to effectively attend to target-relevant information within the search region using multi-head self-attention (MSA). Additionally, they are prone to robustness challenges during online tracking and tend to exhibit significant model complexity. To address these limitations, We propose a novel tracker named ASACTT, which includes a backbone network, feature fusion network and prediction head. First, we improve the Swin-Transformer-Tiny to enhance its global information extraction capabilities. Second, we propose an adaptive sparse attention (ASA) to focus on target-specific details within the search region. Third, we leverage position encoding and historical candidate data to develop a dynamic template updater (DTU), which ensures the preservation of the initial frame’s integrity while gracefully adapting to variations in the target’s appearance. Finally, we optimize the network model to maintain accuracy while minimizing complexity. To verify the effectiveness of our proposed tracker, ASACTT, experiments on five benchmark datasets demonstrated that the proposed tracker was highly comparable to other state-of-the-art methods. Notably, in the GOT-10K¹ evaluation, our tracker achieved an outstanding success score of 75.3% at 36 FPS, significantly surpassing other trackers with comparable model parameters.
... First, [13] introduces a robust Stereoscopic Transformer network for enhanced tracking, which includes a Channel Feature Awareness Network (CFAN), a Global Channel Attention Network (GCAN), and a Multi-Level Feature Enhancement Unit (MFEU). On the other hand, ASTMT [14] is a tracking method for TIR targets. Utilizing a spatio-temporal memory network, it effectively stores scene information, minimizes interference, and provides a corrective aligned matching module. ...
Article
Full-text available
Multi-Object Tracking, also known as Multi-Target Tracking, is an important area of computer vision with various applications in different domains. The advent of deep learning has had a profound impact on this field, forcing researchers to explore innovative avenues. Deep learning methods have become the cornerstone of today's state-of-the-art solutions, consistently delivering exceptional tracking results. However, the significant computational demands of deep learning models require powerful hardware resources that do not always match real-time tracking requirements, limiting their practical applicability in real-world scenarios. Thus, there is an imperative to strike a balance by merging robust deep learning strategies with conventional approaches to enable more accessible, cost-effective solutions that meet real-time requirements. This paper embarks on this endeavor by presenting a hybrid strategy for real-time multi-target tracking. It effectively combines a classical optical flow algorithm with a deep learning architecture tailored for human crowd tracking systems. This hybrid approach achieves a commendable balance between tracking accuracy and computational efficiency. The proposed architecture, subjected to extensive experimentation in various settings, demonstrated notable results, achieving a Mean Object Tracking Accuracy (MOTA) of 0.608. This level of performance placed it as the highest ranking solution on the MOT15 benchmark, surpassing the state-of-the-art benchmark of 0.549, and consistently ranked among the superior models on the MOT17 and MOT20 benchmarks. Additionally, the incorporation of the optical flow phase resulted in a substantial reduction in processing time, nearly halving the duration, while simultaneously maintaining accuracy levels comparable to established techniques.
... The visible part is fully utilized for detection, thereby effectively reducing the effect of occlusions. To cope with the occlusion problem of target tracking, Yuan et al. [49] developed an Aligned Spatial-Temporal Memory networkbased Tracking method (ASTMT), and Gu et al. [50]proposed a novel shared-encoder dual-pipeline Transformer architecture. Compared to the Common Mean Square Error Loss Function, L1 Loss Function, and L2 Loss Function, more new loss strategies are proven to be useful, such as IoU Loss [51], Focal Loss [52], GIoU Loss [53], DIoU Loss [54], EIoU Loss [55] and so on. ...
Article
Full-text available
Classroom learning behavior recognition can provide effective technical support for teaching and learning. However, in natural classroom teaching scenarios, classroom learning behaviors are often missed or falsely detected due to character occlusion and the small object. To tackle the above issues, this study proposed an improved classroom learning behavior recognition algorithm (YOLOv8n_BT) based on YOLOv8n. On the one hand, for the occlusion problem of classroom learning behaviors, this study incorporated the BRA into the Backbone to better capture feature information; on the other hand, for the small object problem of classroom learning behaviors for back-row-students, this study expanded a Tiny Object Detection Layer (TODL) to detect small targets better. Experiments show that the BRA and the TODL can significantly improve the model performance. The YOLOv8n_BT model, which incorporated both the BRA and the TODL into the YOLOv8n(baseline) model simultaneously, has the most significant performance improvement. Compared with the YOLOv8n(baseline), the YOLOv8n_BT model improved by 3.0%, 6.7%, 5.0%, 3.6%, and 9.0% on P, R, F1, mAP50, and mAP50-90, respectively. The detection performance of YOLOv8n_BT also outperforms other state-of-the-arts.
... While existing research in the field of infrared object detection and tracking has made significant strides, much of it has focused either on detection [3][4][5] or tracking [6][7][8] in isolation. Furthermore, studies specifically addressing infrared tracking in complex environments, such as dense urban traffic scenarios, are relatively scarce. ...
Article
Full-text available
Infrared object detection and tracking in dense urban traffic remain a challenge due to factors such as low contrast, small intra‐class differences, and frequent false positives and negatives. To overcome these, the authors introduce YOLO‐IR, an algorithm based on the enhanced YOLOv8s, and YOLO‐DeepOC‐IR, a comprehensive infrared multi‐object tracking method for urban traffic, integrating both detection and tracking. During preprocessing, three infrared image enhancement techniques, local contrast multi‐scale enhancement, non‐local means, and contrast limited adaptive histogram equalization, are applied for better reliability in dense scenes. To further improve the performance, the original YOLOv8s backbone is replaced with MobileVITv3 to enhance detection accuracy and robustness. This infrared feature extraction module, incorporated into the detector, combines canny edge detection, Gabor filtering, and open operation layers, significantly boosting object detection in infrared imagery. The tracker's feature processing capabilities are improved using the learned arrangements of three patch codes descriptor and locality‐sensitive hashing for feature extraction and matching. Experimental results on FLIR ADAS v2 and InfiRay datasets indicate superior performance of this method, achieving 78.6% mAP and 151.1 FPS in detection, and up to 80.8% moving object tracking accuracy, 78.6% identification F1 score, and 62.1% higher order tracking accuracy in multi‐object tracking.
... Shang et al. [11] added an occlusion prediction branch to Siamese-based network to detect the target occlusion degree and correct the target prediction position. Yuan et al. [12] introduced a spatial-temporal memory network and an aligned matching model aimed at alleviating the impact of distractors and occlusion. While there have been some advancements in addressing challenges related to target occlusion, these methods are insufficient to effectively tackle severe or complete occlusion challenges. ...
Article
Full-text available
To address the issue of tracking drift and failures in thermal infrared (TIR) tracking tasks caused by target occlusion, this study proposes an anti‐occlusion TIR target tracker named AODiMP‐TIR. This approach involves an anti‐occlusion strategy that relies on target occlusion status determination and trajectory prediction. This enables the prediction of the target's current position when it is identified as occluded, ensuring swift recapture upon reappearance. A criterion is introduced for occlusion status determination based on the classification response map of SuperDiMP. Additionally, a trajectory mapping module designed to decouple target motion from camera motion is presented, enhancing the precision of trajectory prediction. Comparative experiments with other state‐of‐the‐art trackers are conducted on the large‐scale high‐diversity thermal infrared object tracking benchmark (LSOTB‐TIR), LSOTB‐TIR100, and thermal infrared pedestrian tracking benchmark (PTB‐TIR) datasets. The results indicate that the AODiMP‐TIR performs well across all three datasets, particularly exhibiting outstanding performance in occlusion sequences. Furthermore, ablation study experiments confirm the effectiveness of the anti‐occlusion strategy, occlusion determination criterion and trajectory mapping module.
... M. P. Muresan et al., developed a Siamese network to implement real-time pedestrian detection and tracking from thermal images with the help of original edge-based descriptor and data association method [28]. Yuan et al., proposed an efficient thermal infrared target tracking method by utilizing a spatial-temporal memory network model and an alignment matching module to model and spatially correct information in the infrared target tracking scenario [29]. ...
Article
Large-scale deployed cameras in the automated container terminal (ACT) area helps on-site staff better identify unexpected yet emergency events by monitoring port personnel trajectories. Rainy weather is a common yet typical problem which may significantly deteriorate trajectory extraction performance. To tackle the problem, the study proposes an ensemble framework to extract personnel trajectory from port-like surveillance videos under varied rainy weather scenarios. Firstly, the proposed framework learns fine-grained personnel features with the help of the object query and transformer encoder-decoder module from the input port-like image sequences, and thus obtains port personnel locations from the input low-visibility images. Secondly, the personnel positions are further associated in a frame-by-frame manner with the help of neighboring kine-matic movement information and feature information. Finally, a memory mechanism is introduced in the proposed framework to suppress personnel trajectory discontinuity outlier. In that manner, we can obtain accurate yet consistent personnel trajectories, and each person is assigned with a unique ID. We verified the proposed model performance on three port-like rainy videos involving with interferences of rain, rain streak and fog. Experimental results show that the proposed port personnel trajectory extraction framework can obtain satisfied performance considering that the average multi-target accuracy (MOTA), the average value of judging the same target (IDF 1), average recall rate (IDR) and average precision (IDP) were larger than 92%.
... M. P. Muresan et al., developed a Siamese network to implement real-time pedestrian detection and tracking from thermal images with the help of original edge-based descriptor and data association method [28]. Yuan et al., proposed an efficient thermal infrared target tracking method by utilizing a spatial-temporal memory network model and an alignment matching module to model and spatially correct information in the infrared target tracking scenario [29]. ...
Article
Large-scale deployed cameras in the automated container terminal (ACT) area helps on-site staff better identify unexpected yet emergency events by monitoring port personnel trajectories. Rainy weather is a common yet typical problem which may significantly deteriorate trajectory extraction performance. To tackle the problem, the study proposes an ensemble framework to extract personnel trajectory from port-like surveillance videos under varied rainy weather scenarios. Firstly, the proposed framework learns fine-grained personnel features with the help of the object query and transformer encoder-decoder module from the input port-like image sequences, and thus obtains port personnel locations from the input low-visibility images. Secondly, the personnel positions are further associated in a frame-by-frame manner with the help of neighboring kinematic movement information and feature information. Finally, a memory mechanism is introduced in the proposed framework to suppress personnel trajectory discontinuity outlier. In that manner, we can obtain accurate yet consistent personnel trajectories, and each person is assigned with a unique ID. We verified the proposed model performance on three port-like rainy videos involving with interferences of rain, rain streak and fog. Experimental results show that the proposed port personnel trajectory extraction framework can obtain satisfied performance considering that the average multi-target accuracy (MOTA), the average value of judging the same target ( IDF1{\mathbf{IDF}}_{\mathbf{1}} ), average recall rate (IDR) and average precision (IDP) were larger than 92%.
... Gu et al. [15] created the EANTrack tracker by proposing an efficient attention network, which achieves more robust tracking performance in complex scenarios. Yuan et al. [16] presented an Aligned Spatial-Temporal Memory networkbased Tracking (ASTMT) method tailored for the thermal infrared target tracking. RPformer [17] serves as a straightforward and robust tracking framework, operating seamlessly without the need for prior knowledge and eliminating the hassle of hyperparameter adjustments. ...
Article
Full-text available
For multi-object tracking (MOT), jointly learning the detector and embedding model (JDE) is one of the mainstream solutions. However, an inherent problem in this architecture arises as the tasks of target detection and appearance feature extraction compete with each other. FairMOT, as a representative method, attempts to address this issue by employing two homogeneous branches, but it overlooks the essential difference between these two tasks. Upon the original network architecture, we propose an adaptive dual decoder structure. Our objective is to separately learn more focused features for the target detection and the appearance feature extraction. Furthermore, we introduce a noise-adaptive Kalman filter based on the width estimation. In the motion information matching stage, we enhance the affinity matrix of motion information by employing an expanded-width strategy, combined with a more accurate overlap measure. We verify the effectiveness of our proposed approach through extensive experiments using the MOT17 dataset.
... In their work, B. Ding et al. (2023) introduced an unsupervised transmissionaware dehazing module designed to enhance visibility and mitigate depth-dependent noise propagation within the dehazing process. D. Yuan et al. (2022aYuan et al. ( , 2022b have devised an Aligned Spatial-Temporal Memory network-based Tracking approach specifically tailored for the task of tracking Thermal Infrared targets and an adaptive spatial-temporal context-aware model within the DCF-based tracking framework which aimed at enhancing tracking precision and minimising the impact of boundary effects. These studies showed that expanding small objects are effective enough while requiring fewer processing resources. ...
Article
Full-text available
The surge in demand for advanced operations in sports video analysis has underscored the crucial role of multiple object tracking. This study addresses the escalating need for efficient and accurate player and referee identification in sports video analysis. The challenge of identity switching among players, especially those with similar appearances, complicates multi-player tracking. Existing algorithms relying on manually labeled data face limitations, particularly with changes in jersey colors. This paper introduces an automated algorithm employing Intersection over Union (IoU) loss and Euclidean Distance (EUD), termed EIoU-Distance Loss, to track players and referees. The method prioritizes identity coherence, aiming to mitigate challenges associated with player and referee recognition. Comprising BackgroundSubtractionMOG2 for player and referee detection and IoU with EUD for connecting nodes across frames, the proposed approach enhances tracking performance, ensuring a clear distinction between different identities. This innovative method addresses critical issues in sports video analysis, offering a robust solution for tracking players and referees in dynamic game scenarios.
Article
Full-text available
Drone aerial imaging has become increasingly important across numerous fields as drone optical sensor technology continues to advance. One critical challenge in this domain is achieving both accurate and efficient multi-object tracking. Traditional deep learning methods often separate object identification from tracking, leading to increased complexity and potential performance degradation. Conventional approaches rely heavily on manual feature engineering and intricate algorithms, which can further limit efficiency. To overcome these limitations, we propose a novel Transformer-based end-to-end multi-object tracking framework. This innovative method leverages self-attention mechanisms to capture complex inter-object relationships, seamlessly integrating object detection and tracking into a unified process. By utilizing end-to-end training, our approach simplifies the tracking pipeline, leading to significant performance improvements. A key innovation in our system is the introduction of a trajectory detection label matching technique. This technique assigns labels based on a comprehensive assessment of object appearance, spatial characteristics, and Gaussian features, ensuring more precise and logical label assignments. Additionally, we incorporate cross-frame self-attention mechanisms to extract long-term object properties, providing robust information for stable and consistent tracking. We further enhance tracking performance through a newly developed self-characteristics module, which extracts semantic features from trajectory information across both current and previous frames. This module ensures that the long-term interaction modules maintain semantic consistency, allowing for more accurate and continuous tracking over time. The refined data and stored trajectories are then used as input for subsequent frame processing, creating a feedback loop that sustains tracking accuracy. Extensive experiments conducted on the VisDrone and UAVDT datasets demonstrate the superior performance of our approach in drone-based multi-object tracking.
Article
Multiple object tracking (MOT) has emerged as a crucial component of the rapidly developing computer vision. However, existing multi-object tracking methods often overlook the relationship between features and motion, hindering the ability to strike a performance balance between coupled motion and complex scenes. In this work, we propose a novel end-to-end multi-object tracking method that integrates motion and feature information. To achieve this, we introduce a motion prior generator that transforms motion information into attention masks. Additionally, we leverage prior-posterior fusion multi-head attention to combine the motion-derived priors and attention-based posteriors. Our proposed method is extensively evaluated on MOT17 and DanceTrack datasets through comprehensive experiments and ablation studies, demonstrating state-of-the-art performance in the feature-based method with reasonable speed.
Article
Full-text available
In response to the issue of infrared ground target tracking failure caused by background occlusion, a novel anti-occlusion tracker for infrared ground targets is proposed based on an enhanced trajectory prediction network. Initially, an occlusion assessment criterion is proposed to accurately assess the occlusion status of infrared ground targets. Subsequently, enhancements are made to the BiTrap trajectory prediction network. On one hand, velocity information is introduced through a Siamese network structure, adopting a unidirectional prediction method, building the SiamTrap trajectory prediction network that improves trajectory prediction accuracy. On the other hand, refining both the training and application methods enables more precise predictions of ground target trajectories. For short-term occlusion, the SiamTrap network uses temporal context information to predict the occluded position of the target. For long-term occlusion, a search expansion strategy is introduced to address prediction errors accumulated due to a lack of real target information. Finally, a "second verification" criterion is introduced, realizing accurate target capture and normal tracking. Comparative tests are conducted on infrared target tracking sequences with occlusion. Compared to baseline trackers, the proposed algorithm shows a 5.2% improvement in success rate and a 5.9% improvement in accuracy under the OPE evaluation metric. This indicates the robustness of the proposed algorithm in handling occlusion scenarios for infrared ground targets.
Article
In response to the complex situations encountered in real traffic scenarios, such as sudden acceleration or deceleration of vehicles, similar-looking vehicles, and occlusions caused by other vehicles or obstacles, which severely affect tracking accuracy, a long-term vehicle tracking algorithm based on correlation filter (CF) is proposed and optimized using the swarm intelligence (SI) tracking framework. The fast-discriminative scale space tracking (FDSST) of CF is adopted as the core tracker. Multiple features are adaptively weighted and fused. The optimal feature template is dynamically updated based on confidence. An in-depth analysis of the intrinsic relationship is conducted between SI and object tracking, leading to the design of an SI tracking framework. Within this framework, the carnivorous plant algorithm (CPA) is employed as an optimization method, with further enhancing CPA functionality through phototaxis strategy and population partition mechanism. During FDSST tracking, when tracking uncertainty surpasses set threshold, the SI tracking framework is integrated to rectify tracking outcomes, and a short-term memory module is designed to predict the object position if the object disappears for dozens of frames. Experimental results on benchmark datasets (UAV20L and LaSOT) demonstrate a success rate of 67.45%, a precision rate of 70.50%, and a speed at 22.22 FPS . Comparative analyses with other notable tracking algorithms confirm exceptional accuracy and robustness of the proposed approach that can effectively address diverse challenges in vehicle tracking scenarios, by achieving highly reliable tracking outcomes and enhancing long-term vehicle tracking capability.
Article
The Infrared Object Tracking (IOT) task aims to locate objects in infrared sequences. Since color and texture information is unavailable in infrared modality, most existing infrared trackers merely rely on capturing spatial contexts from the image to enhance feature representation, where other complementary information is rarely deployed. To fill this gap, we in this paper propose a novel Asymmetric Deformable Spatio-Temporal Framework (ADSF) to fully exploit collaborative shape and temporal clues in terms of the objects. Firstly, an asymmetric deformable cross-attention module is designed to extract shape information, which attends to the deformable correlations between distinct frames in an asymmetric manner. Secondly, a spatio-temporal tracking framework is coined to learn the temporal variance trend of the object during the training process and store the template information closest to the tracking frame when testing. Comprehensive experiments demonstrate that ADSF outperforms state-of-the-art methods on three public datasets. Extensive ablation experiments further confirm the effectiveness of each component in ADSF. Furthermore, we conduct generalization validation to demonstrate that the proposed method also achieves performance gains in RGB-based tracking scenarios.
Article
Many RGBT tracking researches primarily focus on modal fusion design, while overlooking the effective handling of target appearance changes. While some approaches have introduced historical frames or fuse and replace initial templates to incorporate temporal information, they have the risk of disrupting the original target appearance and accumulating errors over time. To alleviate these limitations, we propose a novel Transformer RGBT tracking approach, which mixes spatio-temporal multimodal tokens from the static multimodal templates and multimodal search regions in Transformer to handle target appearance changes, for robust RGBT tracking. We introduce independent dynamic template tokens to interact with the search region, embedding temporal information to address appearance changes, while also retaining the involvement of the initial static template tokens in the joint feature extraction process to ensure the preservation of the original reliable target appearance information that prevent deviations from the target appearance caused by traditional temporal updates. We also use attention mechanisms to enhance the target features of multimodal template tokens by incorporating supplementary modal cues, and make the multimodal search region tokens interact with multimodal dynamic template tokens via attention mechanisms, which facilitates the conveyance of multimodal-enhanced target change information. Our module is inserted into the transformer backbone network and inherits joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms while running at 39.1 FPS. The project-related materials are available at: https://github.com/yinghaidada/STMT .
Article
Domain Adaptive Object Detection (DAOD) alleviates the challenge of labor-intensive annotations by transferring semantic information from a labeled source domain to an unlabeled target domain. However, the DAOD suffers from biased discrimination and negative transfer in the thermal domain due to the inherent heterogeneity between the RGB and thermal images. To address the above issues, we propose the Unbiased Granularity Alignment (UGA) framework to facilitate the unified alignment for RGB-Thermal DAOD. Specifically, we devise a Channel Self-encoding Adaptation (CSA) module to mitigate biased discrimination from the discriminative enhancement perspective. CSA aligns the intra-domain channel subspace for inter-domain channel harmonizing. Upon revisiting instance alignment, we uncovered inaccuracies proposals and unstable positive sample phenomena. Therefore, we propose the Relative Relationship Adaptation (RRA) module to mitigate negative transfer. RRA ensures inter-domain semantic consistency through sparse instance alignment. Extensive experiments are conducted on visible-to-thermal and visible-to-visible benchmarks to validate the effectiveness, and our UGA framework outperforms state-of-the-art by a remarkable margin. The code of our UGA is available at https://github.com/zyfone/UGA .
Article
Adversarial attack of convolutional neural networks (CNN) is a technique for deceiving models with perturbations, which provides a way to evaluate the robustness of models. Adversarial attack research has primarily focused on single images. However, videos are more widely used. The existing attack methods generally require iterative optimization on different video sequences with high time-consuming. In this paper, we propose a simple and effective approach for attacking video sequences, called Ghost Adversarial Attack (GAA), to greatly degrade the tracking performance of the state-of-the-art (SOTA) CNN-based trackers with the minimum ghost perturbations. Considering the timeliness of the attack, we only generate the ghost adversarial example once with a novel ghost-generator and use a less computable attack way in subsequent frames. The ghost-generator is used to extract the target region and generate the indistinguishable ghost noise of the target, hence misleading the tracker. Moreover, we propose a novel combined loss that includes the content loss, the ghost loss, and the transferred-fixed loss, which are used in different parts of the proposed method. The combined loss can help to generate similar adversarial examples with slight noises, like a ghost of the real target. Experiments were conducted on six benchmark datasets (UAV123, UAV20L, NFS, LaSOT, OTB50, and OTB100). The experimental results indicate that the ghost adversarial examples produced by GAA are well stealthy while remaining effective in fooling SOTA trackers with high transferability. The GAA can reduce the tracking success rate by an average of 66.6% and the precision rate by an average of 68.3%.
Article
In the realm of unmanned aerial vehicle (UAV) tracking, Siamese-based approaches have gained traction due to their optimal balance between efficiency and precision. However, UAV scenarios often present challenges such as insufficient sampling resolution, fast motion and small objects with limited feature information. As a result, temporal context in UAV tracking tasks plays a pivotal role in target location, overshadowing the target’s precise features. In this paper, we introduce MT-Track, a streamlined and efficient multi-step temporal modeling framework designed to harness the temporal context from historical frames for enhanced UAV tracking. This temporal integration occurs in two steps: correlation map generation and correlation map refinement. Specifically, we unveil a unique temporal correlation module that dynamically assesses the interplay between the template and search region features. This module leverages temporal information to refresh the template feature, yielding a more precise correlation map. Subsequently, we propose a mutual transformer module to refine the correlation maps of historical and current frames by modeling the temporal knowledge in the tracking sequence. This method significantly trims computational demands compared to the raw transformer. The compact yet potent nature of our tracking framework ensures commendable tracking outcomes, particularly in extended tracking scenarios. Comprehensive tests across four renowned UAV benchmarks substantiate the superior efficacy of our approach, delivering real-time performance at 84.7 FPS on a single GPU. Real-world test on the NVIDIA AGX hardware platform achieves a speed exceeding 30 FPS, validating the practicality of our method.
Article
Objectives Respiratory motion-induced displacement of internal organs poses a significant challenge in image-guided radiation therapy, particularly affecting liver landmark tracking accuracy. Methods Addressing this concern, we propose a self-supervised method for robust landmark tracking in long liver ultrasound sequences. Our approach leverages a Siamese-based context-aware correlation filter network, trained by using the consistency loss between forward tracking and back verification. By effectively utilizing both labeled and unlabeled liver ultrasound images, our model, Siam-CCF , mitigates the impact of speckle noise and artifacts on ultrasonic image tracking by a context-aware correlation filter. Additionally, a fusion strategy for template patch feature helps the tracker to obtain rich appearance information around the point-landmark. Results Siam-CCF achieves a mean tracking error of 0.79 ± 0.83 mm at a frame rate of 118.6 fps, exhibiting a superior speed-accuracy trade-off on the public MICCAI 2015 Challenge on Liver Ultrasound Tracking (CLUST2015) 2D dataset. This performance won the 5th place on the CLUST2015 2D point-landmark tracking task. Conclusions Extensive experiments validate the effectiveness of our proposed approach, establishing it as one of the top-performing techniques on the CLUST2015 online leaderboard at the time of this submission.
Article
Satellite video object tracking is a critical task in satellite video analysis, focused on locating targets within the imagery. Current research in satellite video single-object tracking primarily centers around traditional correlation filter-based methods and deep learning techniques based on Siamese networks. However, these methods encounter challenges, such as template inconsistency due to target appearance variations and imprecise object tracking caused by target-background similarities. In this paper, we introduce a novel approach called SiamOBR (Satellite Video Object Tracking Siamese Network with Online Background Discriminative Learning and Bounding Box Optimization) to address these issues. Building upon the baseline twin network framework, our model leverages an online background discriminative learning module to initialize using input satellite video frames. This module improves the network’s ability to distinguish the target by dynamically updating the template, effectively addressing target deformations resulting from motion. To tackle inaccuracies in bounding box estimation, we introduce a bounding box optimization module that refines tracking results obtained from the baseline tracker, further enhancing tracking accuracy. During network training, we employ probability regression in place of confidence regression and utilize a cross-entropy loss function to rectify labeling errors for small targets where the labeled box’s center point may not align perfectly with the object. We conducted quantitative experiments on real-world satellite video datasets, demonstrating that the SiamOBR framework outperforms existing satellite video object tracking models, showcasing its effectiveness in this domain.
Article
This paper considers the problem of detecting and tracking objects in a sequence of images. The problem is formulated in a filtering framework, using the output of objectdetection algorithms as measurements. An extension to the filtering formulation is proposed that incorporates class information from the previous frame to robustify the classification. Further, the properties of the object-detection algorithm are exploited to quantify the uncertainty of the bounding box detection in each frame. The complete filtering method is evaluated on camera trap images of the four large Swedish carnivores, bear, lynx, wolf, and wolverine. The experiments show that the class tracking formulation leads to a more robust classification.
Article
Full-text available
Small moving objects at far distance always occupy only one or a few pixels in image and exhibit extremely limited visual features, which bring great challenges to motion detection. Highly evolved visual systems endow flying insects with remarkable ability to pursue tiny mates and prey, providing a good template to develop image processing method for small target motion detection. The insects’ excellent sensitivity to small moving objects is believed to come from a class of specific neurons called small target motion detectors (STMDs). However, existing STMD-based methods often experience performance degradation when coping with complex natural scenes. In this paper, we propose a bio-inspired visual system with spatio-temporal feedback mechanism (called Spatio-Temporal Feedback STMD) to suppress false positive background movement while enhancing system responses to small targets. Specifically, the proposed visual system is composed of two complementary subnetworks and a feedback loop. The first subnetwork is designed to extract spatial and temporal movement patterns of cluttered background by neuronal ensemble coding. The second subnetwork is developed to capture small target motion information where its output and signal from the first subnetwork are integrated together via the feedback loop to filter out background false positives in a recurrent manner. Experimental results demonstrate that the proposed spatio-temporal feedback visual system is more competitive than existing methods in discriminating small moving targets from complex natural environments.
Article
Full-text available
Visual perception plays an important role in industrial information field, especially in robotic grasping application. In order to detect the object to be grasped quickly and accurately, salient object detection (SOD) is employed to the above task. Although the existing SOD methods have achieved impressive performance, they still have some limitations in the complex interference environment of practical application. To better deal with the complex interference environment, a novel triple-modal images fusion strategy is proposed to implement SOD for robotic visual perception, namely visible-depth-thermal (VDT) SOD. Meanwhile, we build an image acquisition system under variable lighting scene and construct a novel benchmark dataset for VDT SOD (VDT-2048 dataset). Multiple modal images will be introduced to assist each other to highlight the salient regions. But, inevitably, interference will also be introduced. In order to achieve effective cross-modal feature fusion while suppressing information interference, a hierarchical weighted suppress interference (HWSI) method is proposed. The comprehensive experimental results prove that our method achieves better performance than the state-of-the-art methods.
Article
Full-text available
When dealing with complex thermal infrared (TIR) tracking scenarios, the single category feature is not sufficient to portray the appearance of the target, which drastically affects the accuracy of the TIR target tracking method. In order to address these problems, we propose an adaptively multi-feature fusion model (AMFT) for the TIR tracking task. Specifically, our AMFT tracking method adaptively integrates hand-crafted features and deep convolutional neural network (CNN) features. In order to accurately locate the target position, it takes advantage of the complementarity between different features. Additionally, the model is updated using a simple but effective model update strategy to adapt to changes in the target during tracking. In addition, a simple but effective model update strategy is adopted to adapt the model to the changes of the target during the tracking process. We have shown through ablation studies that the adaptively multi-feature fusion model in our AMFT tracking method is very effective. Our AMFT tracker performs favorably on PTB-TIR and LSOTB-TIR benchmarks compared with state-of-the-art trackers.
Article
Full-text available
The feature models used by existing Thermal InfraRed (TIR) tracking methods are usually learned from RGB images due to the lack of a large-scale TIR image training dataset. However, these feature models are less effective in representing TIR objects and they are difficult to effectively distinguish distractors because they do not contain fine-grained discriminative information. To this end, we propose a dual-level feature model containing the TIR-specific discriminative feature and fine-grained correlation feature for robust TIR object tracking. Specifically, to distinguish inter-class TIR objects, we first design an auxiliary multi-classification network to learn the TIR-specific discriminative feature. Then, to recognize intra-class TIR objects, we propose a fine-grained aware module to learn the fine-grained correlation feature. These two kinds of features complement each other and represent TIR objects in the levels of inter-class and intra-class respectively. These two feature models are constructed using a multi-task matching framework and are jointly optimized on the TIR object tracking task. In addition, we develop a large-scale TIR image dataset to train the network for learning TIR-specific feature patterns. To the best of our knowledge, this is the largest TIR tracking training dataset with the richest object class and scenario. To verify the effectiveness of the proposed dual-level feature model, we propose an offline TIR tracker (MMNet) and an online TIR tracker (ECO-MM) based on the feature model and evaluate them on three TIR tracking benchmarks. Extensive experimental results on these benchmarks demonstrate that the proposed algorithms perform favorably against the state-of-the-art methods.
Article
Full-text available
RGB salient object detection (SOD) has made great progress. However, the performance of this single-modal salient object detection will be significantly decreased when encountering some challenging scenes, such as low light or darkness. To deal with the above challenges, thermal infrared (T) image is introduced into the salient object detection. This fused modal is called RGB-T salient object detection. To achieve deep mining of the unique characteristics of single modal and the full integration of cross-modality information, a novel Cross-Guided Fusion Network (CGFNet) for RGB-T salient object detection is proposed. Specifically, a Cross-Scale Alternate Guiding Fusion (CSAGF) module is proposed to mine the high-level semantic information and provide global context support. Subsequently, we design a Guidance Fusion Module (GFM) to achieve sufficient cross-modality fusion by using single modal as the main guidance and the other modal as auxiliary. Finally, the Cross-Guided Fusion Module (CGFM) is presented and serves as the main decoding block. And each decoding block is consists of two parts with two modalities information of each being the main guidance, i.e., cross-shared Cross-Level Enhancement (CLE) and Global Auxiliary Enhancement (GAE). The main difference between the two parts is that the GFM using different modalities as the main guide. The comprehensive experimental results prove that our method achieves better performance than the state-of-the-art salient detection methods. The source code has released at: https://github.com/wangjie0825/CGFNet.git .
Article
Full-text available
The training of a feature extraction network typically requires abundant manually annotated training samples, making this a time-consuming and costly process. Accordingly, we propose an effective self-supervised learning-based tracker in a deep correlation framework (named: self-SDCT). Motivated by the forward-backward tracking consistency of a robust tracker, we propose a multi-cycle consistency loss as self-supervised information for learning feature extraction network from adjacent video frames. At the training stage, we generate pseudo-labels of consecutive video frames by forward-backward prediction under a Siamese correlation tracking framework and utilize the proposed multi-cycle consistency loss to learn a feature extraction network. Furthermore, we propose a similarity dropout strategy to enable some low-quality training sample pairs to be dropped and also adopt a cycle trajectory consistency loss in each sample pair to improve the training loss function. At the tracking stage, we employ the pre-trained feature extraction network to extract features and utilize a Siamese correlation tracking framework to locate the target using forward tracking alone. Extensive experimental results indicate that the proposed self-supervised deep correlation tracker (self-SDCT) achieves competitive tracking performance contrasted to state-of-the-art supervised and unsupervised tracking methods on standard evaluation benchmarks.
Article
Full-text available
Real-time object tracking is demanded in versatile embedded computer vision applications. To this end, a low-cost high-speed VLSI system is proposed for object tracking, based on robust and computationally simple unified textural and dynamic compressive sensing features and simple elliptic matching with fast online template updating capability. The system introduces a memory-centric architectural paradigm, multiple-level pipelines and parallel processing circuits to achieve high frame rate while consuming as few hardware resources as possible. We implemented an FPGA prototype of the proposed VLSI tracking system. Under a 100 MHz clock frequency, the prototype achieves over 600 frame/s processing speed under 320 × 240 image resolution and obtains robust tracking results.
Conference Paper
Full-text available
In this paper, we present a Large-Scale and high-diversity general Thermal InfraRed (TIR) Object Tracking Benchmark, called LSOTB-TIR, which consists of an evaluation dataset and a training dataset with a total of 1,400 TIR sequences and more than 600K frames. We annotate the bounding box of objects in every frame of all sequences and generate over 730K bounding boxes in total. To the best of our knowledge, LSOTB-TIR is the largest and most diverse TIR object tracking benchmark to date. To evaluate a tracker on different attributes, we define 4 scenario attributes and 12 challenge attributes in the evaluation dataset. By releasing LSOTB-TIR, we encourage the community to develop deep learning based TIR trackers and evaluate them fairly and comprehensively. We evaluate and analyze more than 30 trackers on LSOTB-TIR to provide a series of baselines, and the results show that deep trackers achieve promising performance. Furthermore, we retrain several representative deep trackers on LSOTB-TIR, and their results demonstrate that the proposed training dataset significantly improves the performance of deep TIR trackers. Codes and dataset are available at https://github.com/QiaoLiuHit/LSOTB-TIR.
Article
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However, these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Article
Full-text available
Hyperparameters are numerical pre-sets whose values are assigned prior to the commencement of a learning process. Selecting appropriate hyperparameters is often critical for achieving satisfactory performance in many vision problems such as deep learning-based visual object tracking. Yet it is difficult to determine their optimal values, in particular, adaptive ones for each specific video input. Most hyperparameter optimization algorithms depend on searching a generic range and they are imposed blindly on all sequences. In this paper, we propose a novel dynamical hyperparameter optimization method that adaptively optimizes hyperparameters for a given sequence using an action-prediction network leveraged on continuous deep Q-learning. Since the observation space for visual object tracking is significantly more complex than those in traditional control problems, existing continuous deep Q-learning algorithms cannot be directly applied. To overcome this challenge, we introduce an efficient heuristic strategy to handle high dimensional state space and meanwhile accelerate the convergence behavior. The proposed algorithm is applied to improve two representative trackers, a Siamese-based one and a correlation-filter-based one, to evaluate its generality. Their superior performances on several popular benchmarks are clearly demonstrated.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However , these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Article
Full-text available
Thermal infrared (TIR) target tracking is a challenging task as it entails learning an effective model to identify the target in the situation of poor target visibility and clutter background. The sparse representation, as a typical appearance modeling approach, has been successfully exploited in the TIR target tracking. However, the discriminative information of the target and its surrounding background is usually neglected in the sparse coding process. To address this issue, we propose a mask sparse representation (MaskSR) model ,which combines sparse coding together with high-level semantic features for TIR target tracking. We first obtain the pixel-wise labeling results of the target and its surrounding background in the last frame, and then use such results to train target-specific deep networks using a supervised manner. According to the output features of the deep networks, the high-level pixel-wise discriminative map of the target area is obtained. We introduce the binarized discriminative map as a mask template to the sparse representation and develop a novel algorithm to collaboratively represent the reliable target part and unreliable target part partitioned with the mask template, which explicitly indicates different discriminant capabilities by label 1 and 0. The proposed MaskSR model controls the superiority of the reliable target part in the reconstruction process via a weighted scheme. We solve this multi-parameter constrained problem by a customized alternating direction method of multipliers (ADMM) method. This model is applied to achieve TIR target tracking in the particle filter framework. To improve the sampling effectiveness and decrease the computation cost at the same time, a discriminative particle selection strategy based on kernelized correlation filter is proposed to replace the previous random sampling for searching useful candidates. Our proposed tracking method was tested on the VOT-TIR2016 benchmark. The experiment results show that the proposed method has a significant superiority compared with various state-of-the-art methods in TIR target tracking.
Article
Full-text available
Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Chapter
Full-text available
Object tracking is still a critical and challenging problem with many applications in computer vision. For this challenge, more and more researchers pay attention to applying deep learning to get powerful feature for better tracking accuracy. In this paper, a novel triplet loss is proposed to extract expressive deep feature for object tracking by adding it into Siamese network framework instead of pairwise loss for training. Without adding any inputs, our approach is able to utilize more elements for training to achieve more powerful feature via the combination of original samples. Furthermore, we propose a theoretical analysis by combining comparison of gradients and back-propagation, to prove the effectiveness of our method. In experiments, we apply the proposed triplet loss for three real-time trackers based on Siamese network. And the results on several popular tracking benchmarks show our variants operate at almost the same frame-rate with baseline trackers and achieve superior tracking performance than them, as well as the comparable accuracy with recent state-of-the-art real-time trackers.
Article
Full-text available
Discriminative correlation filters (DCFs) have been shown to perform superiorly in visual tracking. They only need a small set of training samples from the initial frame to generate an appearance model. However, existing DCFs learn the filters separately from feature extraction, and update these filters using a moving average operation with an empirical weight. These DCF trackers hardly benefit from the end-to-end training. In this paper, we propose the CREST algorithm to reformulate DCFs as a one-layer convolutional neural network. Our method integrates feature extraction, response map generation as well as model update into the neural networks for an end-to-end training. To reduce model degradation during online update, we apply residual learning to take appearance changes into account. Extensive experiments on the benchmark datasets demonstrate that our CREST tracker performs favorably against state-of-the-art trackers.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
Thermal InfraRed (TIR) target trackers are easy to be interfered by similar objects, while susceptible to the influence of the target occlusion. To solve these problems, we propose a structural target-aware model (STAMT) for the thermal infrared target tracking tasks. Specifically, the proposed STAMT tracker can learn a target-aware model, which can add more attention to the target area to accurately identify the target from similar objects. In addition, considering the situation that the target is partially occluded in the tracking process, a structural weight model is proposed to locate the target through the unoccluded reliable target part. Ablation studies show the effectiveness of each component in the proposed tracker. Without bells and whistles, the experimental results demonstrate that our STAMT tracker performs favorably against state-of-the-art trackers on PTB-TIR and LSOTB-TIR datasets.
Article
In recent years, Siamese network based trackers have significantly advanced the state-of-the-art in real-time tracking. Despite their success, Siamese trackers tend to suffer from high memory costs, which restrict their applicability to mobile devices with tight memory budgets. To address this issue, we propose a distilled Siamese tracking framework to learn small, fast and accurate trackers (students, which capture critical knowledge from large Siamese trackers (teachers by a teacher-students knowledge distillation model. This model is intuitively inspired by the one teacher vs. multiple students learning method typically employed in schools. In particular, our model contains a single teacher-student distillation module and a student-student knowledge sharing mechanism. The former is designed using a tracking-specific distillation strategy to transfer knowledge from a teacher to students. The latter is utilized for mutual learning between students to enable in-depth knowledge understanding. Extensive empirical evaluations on several popular Siamese trackers demonstrate the generality and effectiveness of our framework. Moreover, the results on five tracking benchmarks show that the proposed distilled trackers achieve compression rates of up to 18 \times and frame-rates of 265 FPS, while obtaining {comparable tracking accuracy compared to base models.
Article
Tracking in the unmanned aerial vehicle (UAV) scenarios is one of the main components of target tracking tasks. Different from the target tracking task in the general scenarios, the target tracking task in the UAV scenarios is very challenging because of factors such as small scale and aerial view. Although the DCFs-based tracker has achieved good results in tracking tasks in general scenarios, the boundary effect caused by the dense sampling method will reduce the tracking accuracy, especially in UAV tracking scenarios. In this work, we propose learning an adaptive spatial-temporal context-aware (ASTCA) model in the DCFs-based tracking framework to improve the tracking accuracy and reduce the influence of boundary effect, thereby enabling our tracker to more appropriately handle UAV tracking tasks. Specifically, our ASTCA model can learn a spatial-temporal context weight, which can precisely distinguish the target and background in the UAV tracking scenarios. Besides, considering the small target scale and the aerial view in UAV tracking scenarios, our ASTCA model incorporates spatial context information within the DCFs-based tracker, which could effectively alleviate background interference. Extensive experiments demonstrate that our ASTCA method performs favorably against state-of-the-art tracking methods on some standard UAV datasets.
Article
Temporal and spatial contexts, characterizing target appearance variations and target-background differences, respectively, are crucial for improving the online adaptive ability and instance-level discriminative ability of object tracking. However, most existing trackers focus on either the temporal context or the spatial context during tracking and have not exploited these contexts simultaneously and effectively. In this paper, we propose a Spatial-TEmporal Memory (STEM) network to exploit these contexts jointly for object tracking. Specifically, we develop a key-value structured memory model equipped with a key-value index-based memory reading mechanism to model the spatial and temporal contexts simultaneously. To update the memory with new target states and ensure the diversity of the memory, we introduce a similarity-aware memory update scheme. In addition, we construct an entropy-guided ensemble strategy to fuse the prediction models based on these two contexts, such that these two contexts can be exploited to estimate the target state jointly. Extensive experimental results on eight challenging datasets, including OTB2015, TC128, UAV123, VOT2018, LaSOT, TrackingNet, GOT-10k, and OxUvA, demonstrate that the proposed method performs favorably against state-of-the-art trackers.
Article
This brief focuses on the problem of fixed-time adaptive trajectory tracking control for a quadrotor unmanned aerial vehicle (QUAV) subject to error constraints. By virtue of the fixed-time command filter, the phenomenon of “explosion of complexity" (EOC) existing in the conventional backstepping method is successfully eliminated, meanwhile the effect of filtered error is skillfully removed via a new fractional power error compensation mechanism. By combining the prescribed performance control and backstepping design method together with command filter technique, a fixed-time adaptive control strategy is established. It is proved that all signals of the closed-loop system are fixed-time bounded, and the position and attitude tracking errors of the QUAV approach to an arbitrarily small region of the original point within the predefined performance bounds in a fixed time. Finally, simulation results are given to show the availability of the presented fixed-time control algorithm.
Article
This article intends to address an adaptive asymptotic tracking control with prescribed performance function for a class of nonaffine systems with unknown disturbances. First, the nonaffine system is transformed into an affine system by using a set of alternative state variables. Subsequently, a prescribed performance function with predefined convergence rate, maximum overshoot and steady-state error is introduced. To achieve the asymptotic tracking control performance, the robust integral of the sign of the error (RISE) feedback term is utilized in the control design to reject the unknown external disturbances and NN approximation errors. Finally, an adaptive controller is presented so that the asymptotic tracking performance with guaranteed prescribed performance is achieved. Comparative experiments are provided to show the effectiveness of the proposed control scheme.
Article
A low power real-time visual object tracking (VOT) processor using the siamese network (SiamNet) is proposed for mobile devices. Two key features enable a real-time VOT with low power consumption on mobile devices. First, correlation-based spatial early stopping (CSES) is proposed to reduce the computational workload. CSES reduces ~56.8% of the overall computation of the SiamNet by gradually eliminating the background. Second, the dual mode reuse core (DMRC) is proposed for supporting both the convolution layer and the cross-correlation layer with high core utilization. Finally, the proposed VOT processor is implemented in 28 nm CMOS technology and occupies 0.42 mm 2 . The proposed processor achieves 0.587 for the success rate and 0.778 for the precision in the OTB-100 dataset with SiamRPN++-AlexNet. Compared to previous VOT processors, the proposed processor shows state-of-the-art performance while showing lower power consumption. The proposed processor achieves 64.1 mW peak power and 58.2 mW tracking power consumption at 32.1 frame-per-second (fps) real-time VOT on mobile devices.
Chapter
In this paper, we provide a deep analysis for Siamese-based trackers and find that the one core reason for their failure on challenging cases can be attributed to the problem of decisive samples missing during offline training. Furthermore, we notice that the samples given in the first frame can be viewed as the decisive samples for the sequence since they contain rich sequence-specific information. To make full use of these sequence-specific samples, we propose a compact latent network to quickly adjust the tracking model to adapt to new scenes. A statistic-based compact latent feature is proposed to efficiently capture the sequence-specific information for the fast adjustment. In addition, we design a new training approach based on a diverse sample mining strategy to further improve the discrimination ability of our compact latent network. To evaluate the effectiveness of our method, we apply it to adjust a recent state-of-the-art tracker, SiamRPN++. Extensive experimental results on five recent benchmarks demonstrate that the adjusted tracker achieves promising improvement in terms of tracking accuracy, with almost the same speed. The code and models are available at https://github.com/xingpingdong/CLNet-tracking.
Chapter
Current state-of-the-art trackers rely only on a target appearance model in order to localize the object in each frame. Such approaches are however prone to fail in case of e.g. fast appearance changes or presence of distractor objects, where a target appearance model alone is insufficient for robust tracking. Having the knowledge about the presence and locations of other objects in the surrounding scene can be highly beneficial in such cases. This scene information can be propagated through the sequence and used to, for instance, explicitly avoid distractor objects and eliminate target candidate regions. In this work, we propose a novel tracking architecture which can utilize scene information for tracking. Our tracker represents such information as dense localized state vectors, which can encode, for example, if a local region is target, background, or distractor. These state vectors are propagated through the sequence and combined with the appearance model output to localize the target. Our network is learned to effectively utilize the scene information by directly maximizing tracking performance on video segments. The proposed approach sets a new state-of-the-art on 3 tracking benchmarks, achieving an AO score of 63.6% on the recent GOT-10k dataset.
Article
This brief proposes an adaptation of the Generalized Predictive Control (GPC) for ramp-reference tracking. The second-order difference operation and the plant model are used to get an augmented model with two embedded integrators and whose output is the tracking error. Differently from other GPC-based tracking algorithms, the proposed approach does not require information about the reference parameters, and the GPC prediction horizon is composed of the predicted errors instead of the expected plant outputs. Thus, the optimization function and the receding horizon strategy used in conventional GPC can be applied to get the control law. Simulation and experimental results prove that the proposed approach can successfully track constant and ramp references. The proposed method is applicable for single-input single-outputs plants. However, the mathematical background presented in this brief can be used in the development of new GPC strategies.
Article
Discriminative correlation filters (DCFs) have been widely used in the visual tracking community in recent years. The DCFs-based trackers determine the target location through a response map generated by the correlation filters and determine the target scale by a fixed scale factor. However, the response map is vulnerable to noise interference and the fixed scale factor also cannot reflect the real scale change of the target, which can obviously reduce the tracking performance. In this paper, to solve the aforementioned drawbacks, we propose to learn a metric learning model in correlation filters framework for visual tracking (called CFML). This model can use a metric learning function to solve the target scale problem. In particular, we adopt a hard negative mining strategy to alleviate the influence of the noise on the response map, which can effectively improve the tracking accuracy. Extensive experimental results demonstrate that the proposed CFML tracker achieves competitive performance compared with the state-of-the-art trackers.
Article
Discriminative correlation filters (DCFs) have been widely used in the tracking community recently. DCFs-based trackers utilize samples generated by circularly shifting from an image patch to train a ridge regression model, and estimate target location using a response map generated by the correlation filters. However, the generated samples produce some negative effects and the response map is vulnerable to noise interference, which degrades tracking performance. In this paper, to solve the aforementioned drawbacks, we propose a target-focusing convolutional regression (CR) model for visual object tracking tasks (called TFCR). This model uses a target-focusing loss function to alleviate the influence of background noise on the response map of the current tracking image frame, which effectively improves the tracking accuracy. In particular, it can effectively balance the disequilibrium of positive and negative samples by reducing some effects of the negative samples that act on the object appearance model. Extensive experimental results illustrate that our TFCR tracker achieves competitive performance compared with state-of-the-art trackers.
Article
This paper proposes a DC-DC buck converter using an analog coarse-fine self-tracking zero-current detection (AST-ZCD) scheme. The AST-ZCD detects the zero-current by measuring the voltage level across a freewheeling transistor. It adjusts the nMOS turn-off time using an amplifier, capacitors, and current sources instead of large numbers of shift register bits and unit delay cells in the conventional digital self-tracking zero-current detection (DST-ZCD). It also reduces the zero-current self-tracking time by using coarse-fine current-sources when the output current transition is large. The proposed DC-DC buck converter was fabricated with a 0.18 μm CMOS process. The AST-ZCD reduces the area by 94%, the power consumption by 80%, and the zero-current self-tracking time by 82% compared to the DST-ZCD.
Chapter
We introduce Spatial-Temporal Memory Networks for video object detection. At its core, a novel Spatial-Temporal Memory module (STMM) serves as the recurrent computation unit to model long-term temporal appearance and motion dynamics. The STMM’s design enables full integration of pretrained backbone CNN weights, which we find to be critical for accurate detection. Furthermore, in order to tackle object motion in videos, we propose a novel MatchTrans module to align the spatial-temporal memory from frame to frame. Our method produces state-of-the-art results on the benchmark ImageNet VID dataset, and our ablative studies clearly demonstrate the contribution of our different design choices. We release our code and models at http://fanyix.cs.ucdavis.edu/project/stmn/project.html.
Article
Depth cameras have recently become popular and many vision problems can be better solved with depth information. But how to integrate depth information into a visual tracker to overcome the challenges such as occlusion and background distraction is still under-investigated in current literatures of visual tracking. In this paper, we investigate a 3D extension of classical mean-shift tracker whose greedy gradient ascend strategy is generally considered as unreliable in conventional 2D tracking. However, through careful study of the physical property of 3D point clouds, we reveal that objects which may appear to be adjacent on 2D image will form distinctive modes in the 3D probability distribution approximated by kernel density estimation, and finding the nearest mode using 3D mean-shift can always work in tracking. Based on the understanding of 3D mean-shift, we propose two important mechanisms to further boost the tracker's robustness: one is to enable the tracker be aware of potential distractions and make corresponding adjustments to the appearance model; and the other is to enable the tracker to be able to detect and recover from tracking failures caused by total occlusion. The proposed method is both effective and computationally efficient. On a conventional PC, it runs at more than 60 FPS without GPU acceleration.
Article
Most thermal infrared (TIR) tracking methods are discriminative, which treat the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is well coupled to the objective of tracking task. We propose a TIR tracker via a hierarchical Siamese convolutional neural network (CNN), named HSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN coalescing the multiple hierarchical convolutional layers. Then, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the most similar one as the tracked target. Extensive experimental results on the benchmarks: VOT-TIR 2015 and VOT-TIR 2016, show that our proposed method achieves favorable performance against the state-of-the-art methods.
Article
Infrared object tracking is a key technology in many surveillance applications. General visual tracking algorithms designed for color images can not handle infrared targets very well due to their relatively low resolutions and blurred edges. This paper presents a new tracking by detection method based on online structural learning. We show how to train the classifier efficiently with dense samples through Fourier techniques and careful implementation. Furthermore, we introduce an effective feature representation for infrared objects. Finally, we demonstrate the performance of the proposed tracker on public infrared sequences with top accuracy and robustness. Meanwhile, our single thread C++ implementation of the algorithm achieves an average tracking speed of 215 FPS on a modern cpu.
Article
Correlation Filters (CFs) have recently demonstrated excellent performance in terms of rapidly tracking objects under challenging photometric and geometric variations. The strength of the approach comes from its ability to efficiently learn - "on the fly" - how the object is changing over time. A fundamental drawback to CFs, however, is that the background of the object is not be modelled over time which can result in suboptimal results. In this paper we propose a Background-Aware CF that can model how both the foreground and background of the object varies over time. Our approach, like conventional CFs, is extremely computationally efficient - and extensive experiments over multiple tracking benchmarks demonstrate the superior accuracy and real-time performance of our method compared to the state-of-the-art trackers including those based on a deep learning paradigm.
Conference Paper
This paper proposed a novel sparse representation-based infrared target tracking method using multi-feature fusion to compensate for incomplete description of single feature. In the proposed method, we extract the intensity histogram and the data on-Local Entropy and Local Contrast Mean Difference information for feature representation. To combine various features, particle candidates and multiple feature descriptors of dictionary templates were encoded as kernel matrices. Every candidate particle was sparsely represented as a linear combination of a set of atom vectors of a dictionary. Then, the sparse target template representation model was efficiently constructed using a kernel trick method. Finally, under the framework of particle filter the weights of particles were determined by sparse coefficient reconstruction errors for tracking. For tracking, a template update strategy employing Adaptive Structural Local Sparse Appearance Tracking (ASLAS) was implemented. The experimental results on benchmark data set demonstrate the better performance over many existing ones.
Active learning for deep visual tracking
  • D Yuan
  • X Chang
  • Q Liu
  • D Wang
  • Z He
D. Yuan, X. Chang, Q. Liu, D. Wang, and Z. He, "Active learning for deep visual tracking," arXiv preprint arXiv:2110.13259, 2021.
Video object detection with an aligned spatialtemporal memory
  • F Xiao
  • Y J Lee
F. Xiao and Y. J. Lee, "Video object detection with an aligned spatialtemporal memory," in ECCV, 2018, pp. 485-501.