ArticlePDF Available

Abstract and Figures

Existing regression based tracking methods built on correlation filter model or convolution modeldo not take both accuracy and robustness into account at the same time. In this paper, we pro-pose a dual regression framework comprising a discriminative fully convolutional module and a fine-grained correlation filter component for visual tracking. The convolutional module trainedin a classification manner with hard negative mining ensures the discriminative ability of the proposed tracker, which facilitates the handling of several challenging problems, such as drastic de-formation, distracters, and complicated backgrounds. The correlation filter component built onthe shallow features with fine-grained features enables accurate localization. By fusing these twobranches in a coarse-to-fine manner, the proposed dual-regression tracking framework achievesa robust and accurate tracking performance. Extensive experiments on the OTB2013, OTB2015,and VOT2015 datasets demonstrate that the proposed algorithm performs favorably against thestate-of-the-art methods.
Content may be subject to copyright.
A preview of the PDF is not available
... Motivated by the success of visual tracking methods [6,7,8,9,10,11], some TIR trackers adapt deep features to represent TIR objects. For example, DSST-TIR [12] combines deep features with a traditional tracker to construct a TIR tracker. ...
... CenterTrack [44] uses a pair of images as inputs and predicts their associations with the previous frame. Reference [45] proposed a dual regression tracking framework for object tracking using a classification-based convolutional neural network (CNN) and a correlation filter (CF) module to improve tracking performance and reduce computational costs. Reference [49] demonstrated the effectiveness of using a dual-level deep representation model for thermal IR tracking and highlighted the importance of considering both image and motion information for accurate tracking. ...
Article
Full-text available
Multiple object tracking (MOT) of unmanned aerial vehicle (UAV) systems is essential for both defense and civilian applications. As drone technology moves towards real-time, conventional tracking algorithms cannot be directly applied to UAV videos due to limited computational resources and the unstable movements of UAVs in dynamic environments. These challenges lead to blurry video frames, object occlusion, scale changes, and biased data distribution of object classes and samples, resulting in poor tracking accuracy for non-representative classes. Therefore, in this study, we present a deep learning multiple object tracking model for UAV aerial videos to achieve real-time performance. Our approach combines detection and tracking methods using adjacent frame pairs as inputs with shared features to reduce computational time. We also employed a multi-loss function to address the imbalance between the challenging classes and samples. To associate objects between frames, a dual regression bounding box method that considers the center distance of objects rather than just their areas was adopted. This enables the proposed model to perform object ID verification and movement forecasting via single regression. In addition, our model can perform online tracking by predicting the position of an object within the next video frame. By exploiting both low- and high-quality detection techniques to locate the same object across frames, more accurate tracking of objects within the video is attained. The proposed method achieved real-time tracking with a running time of 77 frames per second. The testing results have demonstrated that our approach outperformed the state-of-the-art on the VisDrone2019 test-dev dataset for all ten object categories. In particular, the multiple object tracking accuracy (MOTA) score and the F1 score both increased in comparison to earlier work by 8.7 and 5.3 percent, respectively.
... Several recent works also utilize regression-based target tracking methods [8], [44]. Dual-regression-based frameworks [45] and dual-margin models [46] have been shown to optimize both accuracy and robustness. Learning dual-level deep representation has been shown to be effective for infrared thermal tracking as well [47]. ...
Article
Full-text available
Recent innovations in ROI camera systems have opened up the avenue for exploring energy optimization techniques like adaptive subsampling. Generally speaking, image frame capture and read-out demand high power consumption. ROI camera systems make it possible to exploit the inverse relation between energy consumption and spatiotemporal pixel readout to optimize the power efficiency of the image sensor. To this end, we develop a reinforcement learning (RL) based adaptive subsampling framework which predicts ROI trajectories and reconfigures the image sensor on-the-fly for improved power efficiency of the image sensing pipeline. In our proposed framework, a pre-trained convolutional neural network (CNN) extracts rich visual features from incoming frames and a long short-term memory (LSTM) network predicts the region of interest (ROI) and subsampling pattern for the consecutive image frame. Based on the application and the difficulty level of object motion trajectory, the user can utilize either the predicted ROI or coarse subsampling pattern to switch off the pixels for sequential frame capture, thus saving energy. We have validated our proposed method by adapting existing trackers for the adaptive subsampling framework and evaluating them as competing baselines. As a proof-of-concept, our method outperforms the baselines and achieves an average AUC score of 0.5090 on three benchmarking datasets. We also characterize the energy-accuracy tradeoff of our method vs. the baselines and show that our approach is best suited for applications that demand both high visual tracking precision and low power consumption. On the TB100 dataset, our method achieves the highest AUC score of 0.5113 out of all the competing algorithms and requires a medium-level power consumption of approximately 4 W as per a generic energy model and an energy consumption of 1.9 mJ as per a mobile system energy model. Although other baselines are shown to have better performance in terms of power consumption, they are ill-suited for applications that require considerable tracking precision, making our method the ideal candidate in terms of power-accuracy tradeoff.
... Spatial temporal regularized correlation filter (STRCF) can handle the unwanted boundary effects by integrating temporal and spatial regularization [17]. Li et al. presented a dual-regression framework that fuses a discriminative fully convolutional module and fine-grained correlation filter component to realize robust and accurate visual tracking results [18]. After the widespread use of correlation filters, Siamese neural networks have become the focus of generative tracking approaches in recent years owing to their high performance and efficiency. ...
Article
Full-text available
Deep learning algorithms provide visual tracking robustness at an unprecedented level, but realizing an acceptable performance is still challenging because of the natural continuous changes in the features of foreground and background objects over videos. One of the factors that most affects the robustness of tracking algorithms is the choice of network architecture parameters, especially the depth. A robust visual tracking model using a very deep generator (RTDG) was proposed in this study. We constructed our model on an ordinary convolutional neural network (CNN), which consists of feature extraction and binary classifier networks. We integrated a generative adversarial network (GAN) into the CNN to enhance the tracking results through an adversarial learning process performed during the training phase. We used the discriminator as a classifier and the generator as a store that produces unlabeled feature-level data with different appearances by applying masks to the extracted features. In this study, we investigated the role of increasing the number of fully connected (FC) layers in adversarial generative networks and their impact on robustness. We used a very deep FC network with 22 layers as a high-performance generator for the first time. This generator is used via adversarial learning to augment the positive samples to reduce the gap between the hungry deep learning algorithm and the available training data to achieve robust visual tracking. The experiments showed that the proposed framework performed well against state-of-the-art trackers on OTB-100, VOT2019, LaSOT and UAVDT benchmark datasets.
... In addition, CREST [10] and DeepSTRCF [11] incorporated spatial and temporal information to improve model generalization. To achieve robust and reliable visual tracking results, Li et al. presented a dual-regression architecture that fuses a discriminative fully convolutional module and fine-grained correlation filter component [12]. ...
Article
Full-text available
Visual tracking is an open and exciting field of research. The researchers introduced great efforts to be close to the ideal state of stable tracking of objects regardless of different appearances or circumstances. Owing to the attractive advantages of generative adversarial networks (GANs), they have been a promising area of research in many fields. However, GAN network architecture has not been thoroughly investigated in the visual tracking research community. Inspired by visual tracking via adversarial learning (VITAL), we present a novel network to generate randomly initialized masks for building augmented feature maps using multilayer perceptron (MLP) generative models. To obtain more robust tracking these augmented masks can extract robust features that do not change over a long temporal span. Some models such as deep convolutional generative adversarial networks (DCGANs) have been proposed to obtain powerful generator architectures by eliminating or minimizing the use of fully connected layers. This study demonstrates that the use of MLP architecture for the generator is more robust and efficient than the convolution-only architecture. Also, to realize better performance, we used one-sided label smoothing to regularize the discriminator in the training stage and the label smoothing regularization (LSR) method to reduce the overfitting of the classifier in the online tracking stage. The experiments show that the proposed model is more robust than the DCGAN model and offers satisfactory performance compared with the state-of-the-art deep visual trackers on OTB-100, VOT2019 and LaSOT datasets.
Article
Correlation filter (CF)-based approaches have been widely applied in online object tracking tasks for unmanned aerial vehicles (UAVs) due to their high computational efficiency and low memory consumption. One of the key steps is to perform correlation operations between the appearance model (AM) and the filter. However, as the difficulty in controlling the learning rate of the AM, most existing trackers are prone to causing degradation. In this paper, we propose a novel complementary AM (CAM) consisting of a primary model (PM) and a secondary model (SM). Specifically, the learning rates of the PM and SM are approximately complementary, allowing the CAM to consider both past and current information. Moreover, in order to take full advantage of historical information, a CAM-based reversibility reasoning approach is proposed for CF training. It can robustly handle the variations in object appearance. Then we further create a deep tracker by fusing convolutional features which demonstrates more outstanding performance. We also embed the CAM into two advanced trackers to validate the scalability of the CAM. Comprehensive experiments on six challenging UAV tracking benchmarks have indicated the superiority of our method compared to other 36 state-of-the-art CPU- and GPU-based trackers, with a speed of 45 FPS running on a cheap CPU.
Article
Different from visible cameras which record intensity images frame by frame, the biologically inspired event camera produces a stream of asynchronous and sparse events with much lower latency. In practice, visible cameras can better perceive texture details and slow motion, while event cameras can be free from motion blurs and have a larger dynamic range which enables them to work well under fast motion and low illumination (LI). Therefore, the two sensors can cooperate with each other to achieve more reliable object tracking. In this work, we propose a large-scale Visible-Event benchmark (termed VisEvent) due to the lack of a realistic and scaled dataset for this task. Our dataset consists of 820 video pairs captured under LI, high speed, and background clutter scenarios, and it is divided into a training and a testing subset, each of which contains 500 and 320 videos, respectively. Based on VisEvent, we transform the event flows into event images and construct more than 30 baseline methods by extending current single-modality trackers into dual-modality versions. More importantly, we further build a simple but effective tracking algorithm by proposing a cross-modality transformer, to achieve more effective feature fusion between visible and event data. Extensive experiments on the proposed VisEvent dataset, FE108, COESOT, and two simulated datasets (i.e., OTB-DVS and VOT-DVS), validated the effectiveness of our model. The dataset and source code have been released on: https://github.com/wangxiao5791509/VisEvent_SOT_Benchmark .
Article
Spatial boundary effect can significantly reduce the performance of a learned discriminative correlation filter (DCF) model. A commonly used method to relieve this effect is to extract appearance features from a wider region of a target. However, this way would introduce unexpected features from background pixels and noises, which will lead to a decrease of the filter's discrimination power. To address this shortcoming, this paper proposes an innovative method called enhanced robust spatial feature selection and correlation filter Learning (EFSCF), which performs jointly sparse feature learning to handle boundary effects effectively while suppressing the influence of background pixels and noises. Unlike the ℓ2-norm-based tracking approaches that are prone to non-Gaussian noises, the proposed method imposes the ℓ2,1-norm on the loss term to enhance the robustness against the training outliers. To enhance the discrimination further, a jointly sparse feature selection scheme based on the ℓ2,1 -norm is designed to regularize the filter in rows and columns simultaneously. To the best of the authors' knowledge, this has been the first work exploring the structural sparsity in rows and columns of a learned filter simultaneously. The proposed model can be efficiently solved by an alternating direction multiplier method. The proposed EFSCF is verified by experiments on four challenging unmanned aerial vehicle datasets under severe noise and appearance changes, and the results show that the proposed method can achieve better tracking performance than the state-of-the-art trackers.
Article
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However , these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Article
Full-text available
Fine-grained image classification is a challenging task due to the large inter-class difference and small intra-class difference. In this paper, we propose a novel Cascade Attention Model using the Deep Convolutional Neural Network to address this problem. Our method first leverages the Spatial Confusion Attention to identify ambiguous areas of the input image. Two constraint loss functions are proposed: the Spatial Mask loss and the Spatial And loss; Second, the Cross-network Attention, applying different pre-train parameters to the two stream architecture. Also, two novel loss functions called Cross-network Similarity loss and Satisfied Rank loss are proposed to make the two-stream networks reinforce each other and get better results. Finally, the Network Fusion Attention merges intermediate results with the novel entropy add strategy to obtain the final predictions. All of these modules can work together and can be trained end to end. Besides, different from previous works, our model is fully weak-supervised and fully paralleled, which leads to easier generalization and faster computation. We obtain the state-of-the-art performance on three challenge benchmark datasets (CUB-200-2011, FGVC-Aircraft and Flower 102) with results of 90.8%, 92.1%, and 98.5%, respectively. The model will be publicly available at https://github.com/billzyx/LCA-CNN.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
Visual trackers using deep neural networks have demonstrated favorable performance in object tracking. However, training a deep classification network using overlapped initial target regions may lead an overfitted model. To increase the model generalization, we propose an appearance variation adaptation (AVA) tracker that aligns the feature distributions of target regions over time by learning an adaptation mask in an adversarial network. The proposed adversarial network consists of a generator and a discriminator network that compete with each other over optimizing a discriminator loss in a mini-max optimization problem. Specifically, the discriminator network aims to distinguish recent target regions from earlier ones by minimizing the discriminator loss, while the generator network aims to produce an adaptation mask to maximize the discriminator loss. We incorporate a gradient reverse layer in the adversarial network to solve the aforementioned mini-max optimization in an end-to-end manner. We compare the performance of the proposed AVA tracker with the most recent state-of-the-art trackers by doing extensive experiments on OTB50, OTB100, and VOT2016 tracking benchmarks. Among the compared methods, AVA yields the highest area under curve (AUC) score of 0.712 and the highest average precision score of 0.951 on the OTB50 tracking benchmark. It achieves the second best AUC score of 0.688 and the best precision score of 0.924 on the OTB100 tracking benchmark. AVA also achieves the second best expected average overlap (EAO) score of 0.366, the best failure rate of 0.68, and the second best accuracy of 0.53 on the VOT2016 tracking benchmark.
Article
Deep convolutional neural networks (CNNs) have attracted considerable interest in low-level computer vision. Researches are usually devoted to improving the performance via very deep CNNs. However, as the depth increases, influences of the shallow layers on deep layers are weakened. Inspired by the fact, we propose an attention-guided denoising convolutional neural network (ADNet), mainly including a sparse block (SB), a feature enhancement block (FEB), an attention block (AB) and a reconstruction block (RB) for image denoising. Specifically, the SB makes a tradeoff between performance and efficiency by using dilated and common convolutions to remove the noise. The FEB integrates global and local features information via a long path to enhance the expressive ability of the denoising model. The AB is used to finely extract the noise information hidden in the complex background, which is very effective for complex noisy images, especially real noisy images and bind denoising. Also, the FEB is integrated with the AB to improve the efficiency and reduce the complexity for training a denoising model. Finally, a RB aims to construct the clean image through the obtained noise mapping and the given noisy image. Additionally, comprehensive experiments show that the proposed ADNet performs very well in three tasks (i.e. synthetic and real noisy images, and blind denoising) in terms of both quantitative and qualitative evaluations. The code of ADNet is accessible at http://www.yongxu.org/lunwen.html.
Article
Correlation filtering based visual tracking has achieved impressive success in terms of both tracking accuracy and computational efficiency. In this paper, a novel correlation filtering approach is proposed by means of a joint learning to bridge the gap between the circulant filtering and the classical filtering methods. The circulant structure of tracking and the information from successive frames are simultaneously exploited in the proposed work. A new formulation for the correlation filter learning is proposed to enhance the discrimination of the learned filter by integrating both the kernel and the image feature domains. The proposed approach is computational efficient since a closed-form solution is derived for the new formulation. Extensive experiments are conducted on two popular tracking benchmarks and the experimental results demonstrate that the proposed tracker outperforms most of the state-of-the-art trackers.