ArticlePDF Available

Abstract and Figures

Existing regression based tracking methods built on correlation filter model or convolution modeldo not take both accuracy and robustness into account at the same time. In this paper, we pro-pose a dual regression framework comprising a discriminative fully convolutional module and a fine-grained correlation filter component for visual tracking. The convolutional module trainedin a classification manner with hard negative mining ensures the discriminative ability of the proposed tracker, which facilitates the handling of several challenging problems, such as drastic de-formation, distracters, and complicated backgrounds. The correlation filter component built onthe shallow features with fine-grained features enables accurate localization. By fusing these twobranches in a coarse-to-fine manner, the proposed dual-regression tracking framework achievesa robust and accurate tracking performance. Extensive experiments on the OTB2013, OTB2015,and VOT2015 datasets demonstrate that the proposed algorithm performs favorably against thestate-of-the-art methods.
Content may be subject to copyright.
A preview of the PDF is not available
... CenterTrack [44] uses a pair of images as inputs and predicts their associations with the previous frame. Reference [45] proposed a dual regression tracking framework for object tracking using a classification-based convolutional neural network (CNN) and a correlation filter (CF) module to improve tracking performance and reduce computational costs. Reference [49] demonstrated the effectiveness of using a dual-level deep representation model for thermal IR tracking and highlighted the importance of considering both image and motion information for accurate tracking. ...
Article
Full-text available
Multiple object tracking (MOT) of unmanned aerial vehicle (UAV) systems is essential for both defense and civilian applications. As drone technology moves towards real-time, conventional tracking algorithms cannot be directly applied to UAV videos due to limited computational resources and the unstable movements of UAVs in dynamic environments. These challenges lead to blurry video frames, object occlusion, scale changes, and biased data distribution of object classes and samples, resulting in poor tracking accuracy for non-representative classes. Therefore, in this study, we present a deep learning multiple object tracking model for UAV aerial videos to achieve real-time performance. Our approach combines detection and tracking methods using adjacent frame pairs as inputs with shared features to reduce computational time. We also employed a multi-loss function to address the imbalance between the challenging classes and samples. To associate objects between frames, a dual regression bounding box method that considers the center distance of objects rather than just their areas was adopted. This enables the proposed model to perform object ID verification and movement forecasting via single regression. In addition, our model can perform online tracking by predicting the position of an object within the next video frame. By exploiting both low- and high-quality detection techniques to locate the same object across frames, more accurate tracking of objects within the video is attained. The proposed method achieved real-time tracking with a running time of 77 frames per second. The testing results have demonstrated that our approach outperformed the state-of-the-art on the VisDrone2019 test-dev dataset for all ten object categories. In particular, the multiple object tracking accuracy (MOTA) score and the F1 score both increased in comparison to earlier work by 8.7 and 5.3 percent, respectively.
... Several recent works also utilize regression-based target tracking methods [8], [44]. Dual-regression-based frameworks [45] and dual-margin models [46] have been shown to optimize both accuracy and robustness. Learning dual-level deep representation has been shown to be effective for infrared thermal tracking as well [47]. ...
Article
Full-text available
Recent innovations in ROI camera systems have opened up the avenue for exploring energy optimization techniques like adaptive subsampling. Generally speaking, image frame capture and read-out demand high power consumption. ROI camera systems make it possible to exploit the inverse relation between energy consumption and spatiotemporal pixel readout to optimize the power efficiency of the image sensor. To this end, we develop a reinforcement learning (RL) based adaptive subsampling framework which predicts ROI trajectories and reconfigures the image sensor on-the-fly for improved power efficiency of the image sensing pipeline. In our proposed framework, a pre-trained convolutional neural network (CNN) extracts rich visual features from incoming frames and a long short-term memory (LSTM) network predicts the region of interest (ROI) and subsampling pattern for the consecutive image frame. Based on the application and the difficulty level of object motion trajectory, the user can utilize either the predicted ROI or coarse subsampling pattern to switch off the pixels for sequential frame capture, thus saving energy. We have validated our proposed method by adapting existing trackers for the adaptive subsampling framework and evaluating them as competing baselines. As a proof-of-concept, our method outperforms the baselines and achieves an average AUC score of 0.5090 on three benchmarking datasets. We also characterize the energy-accuracy tradeoff of our method vs. the baselines and show that our approach is best suited for applications that demand both high visual tracking precision and low power consumption. On the TB100 dataset, our method achieves the highest AUC score of 0.5113 out of all the competing algorithms and requires a medium-level power consumption of approximately 4 W as per a generic energy model and an energy consumption of 1.9 mJ as per a mobile system energy model. Although other baselines are shown to have better performance in terms of power consumption, they are ill-suited for applications that require considerable tracking precision, making our method the ideal candidate in terms of power-accuracy tradeoff.
... Spatial temporal regularized correlation filter (STRCF) can handle the unwanted boundary effects by integrating temporal and spatial regularization [17]. Li et al. presented a dual-regression framework that fuses a discriminative fully convolutional module and fine-grained correlation filter component to realize robust and accurate visual tracking results [18]. After the widespread use of correlation filters, Siamese neural networks have become the focus of generative tracking approaches in recent years owing to their high performance and efficiency. ...
Article
Full-text available
Deep learning algorithms provide visual tracking robustness at an unprecedented level, but realizing an acceptable performance is still challenging because of the natural continuous changes in the features of foreground and background objects over videos. One of the factors that most affects the robustness of tracking algorithms is the choice of network architecture parameters, especially the depth. A robust visual tracking model using a very deep generator (RTDG) was proposed in this study. We constructed our model on an ordinary convolutional neural network (CNN), which consists of feature extraction and binary classifier networks. We integrated a generative adversarial network (GAN) into the CNN to enhance the tracking results through an adversarial learning process performed during the training phase. We used the discriminator as a classifier and the generator as a store that produces unlabeled feature-level data with different appearances by applying masks to the extracted features. In this study, we investigated the role of increasing the number of fully connected (FC) layers in adversarial generative networks and their impact on robustness. We used a very deep FC network with 22 layers as a high-performance generator for the first time. This generator is used via adversarial learning to augment the positive samples to reduce the gap between the hungry deep learning algorithm and the available training data to achieve robust visual tracking. The experiments showed that the proposed framework performed well against state-of-the-art trackers on OTB-100, VOT2019, LaSOT and UAVDT benchmark datasets.
... In addition, CREST [10] and DeepSTRCF [11] incorporated spatial and temporal information to improve model generalization. To achieve robust and reliable visual tracking results, Li et al. presented a dual-regression architecture that fuses a discriminative fully convolutional module and fine-grained correlation filter component [12]. ...
Article
Full-text available
Visual tracking is an open and exciting field of research. The researchers introduced great efforts to be close to the ideal state of stable tracking of objects regardless of different appearances or circumstances. Owing to the attractive advantages of generative adversarial networks (GANs), they have been a promising area of research in many fields. However, GAN network architecture has not been thoroughly investigated in the visual tracking research community. Inspired by visual tracking via adversarial learning (VITAL), we present a novel network to generate randomly initialized masks for building augmented feature maps using multilayer perceptron (MLP) generative models. To obtain more robust tracking these augmented masks can extract robust features that do not change over a long temporal span. Some models such as deep convolutional generative adversarial networks (DCGANs) have been proposed to obtain powerful generator architectures by eliminating or minimizing the use of fully connected layers. This study demonstrates that the use of MLP architecture for the generator is more robust and efficient than the convolution-only architecture. Also, to realize better performance, we used one-sided label smoothing to regularize the discriminator in the training stage and the label smoothing regularization (LSR) method to reduce the overfitting of the classifier in the online tracking stage. The experiments show that the proposed model is more robust than the DCGAN model and offers satisfactory performance compared with the state-of-the-art deep visual trackers on OTB-100, VOT2019 and LaSOT datasets.
... The performance of the proposed tracker has been evaluated based on two performance metrics, the Area-Underthe-Curve (AUC) and the Distance Precision (DP). For the validation of the proposed tracker's performance, a comparison has been carried out based on trackers presented in previous works: ASLA [37], CSK [38], DSST [30], MEEM [39], MUSTER [40], SAMF [41], SRDCF [27], Struck [42], siamfc3s [43], HCFTs [4], HDT [11], Staple [44], CNN-SVM [45], CF2 [3], LCT [46], KCF [32], TLD [47], KCF_GaussHog [32],KCF_LinearHog [32], BACF [48], DeepSRDCF [49], DRVT [50], MemDTC [51], MemTrack [51], SRDCFdecon [52]. The obtained results, corresponding to the two considered performance metrics, have been shown by two curves for One-Pass Evaluation (OPE), such as the DP rate vs. the location error threshold which measures the proportion of frames with distance between the tracking results and the ground truth less than a certain number of pixels, and the success rate vs. the overlap threshold which describes the percentage of successful frames as shown in Figures 4 and 5 for the OBT50 and OBT100 datasets respectively. ...
Article
Full-text available
In this paper, a new Visual Object Tracking (VOT) approach is proposed to overcome the main problem the existing approaches encounter, i.e. the significant appearance changes which are mainly caused by heavy occlusion and illumination variation. The proposed approach is based on a combination of Deep Convolutional Neural Networks (DCNNs), Histogram of Oriented Gradient (HOG) features, and discrete wavelet packet transforms. The problem of illumination variation is solved by incorporating the coefficients of the image discrete wavelet packet transform instead of the image template to handle the case of images with high saturation in the input of the used CNN, whereas the inverse discrete wavelet packet transforms are used at the output for extracting the CNN features. By combining four learned correlation filters with the convolutional features, the target location is deduced using multichannel correlation maps at the CNN output. On the other side, the maximum value of the resulting maps from the correlation filters with convolutional features produced by the previously obtained HOG feature of the image template are calculated and are used as an updating parameter of the correlation filters extracted from CNN and from HOG. The major aim is to ensure long-term memory of the target appearance so that the target item may be recovered if tracking fails. In order to increase the performance of HOG, the coefficients of the discrete packet wavelet transform are employed instead of the image template. The obtained results demonstrate the superiority of the proposed approach.
Article
Spatial boundary effect can significantly reduce the performance of a learned discriminative correlation filter (DCF) model. A commonly used method to relieve this effect is to extract appearance features from a wider region of a target. However, this way would introduce unexpected features from background pixels and noises, which will lead to a decrease of the filter's discrimination power. To address this shortcoming, this paper proposes an innovative method called enhanced robust spatial feature selection and correlation filter Learning (EFSCF), which performs jointly sparse feature learning to handle boundary effects effectively while suppressing the influence of background pixels and noises. Unlike the ℓ2-norm-based tracking approaches that are prone to non-Gaussian noises, the proposed method imposes the ℓ2,1-norm on the loss term to enhance the robustness against the training outliers. To enhance the discrimination further, a jointly sparse feature selection scheme based on the ℓ2,1 -norm is designed to regularize the filter in rows and columns simultaneously. To the best of the authors' knowledge, this has been the first work exploring the structural sparsity in rows and columns of a learned filter simultaneously. The proposed model can be efficiently solved by an alternating direction multiplier method. The proposed EFSCF is verified by experiments on four challenging unmanned aerial vehicle datasets under severe noise and appearance changes, and the results show that the proposed method can achieve better tracking performance than the state-of-the-art trackers.
Article
Triple loss is widely used to detect learned descriptors and achieves promising performance. However, triple loss fails to fully consider the influence of adjacent descriptors from the same type of sample, which is one of the main reasons for image mismatching. To solve this problem, we propose a descriptor network based on triple loss with a similar triangle constraint, named as STCDesc. This network not only considers the correlation between descriptors from different types of samples but also considers the relevance of descriptors from the same type of sample. Furthermore, we propose a normalized exponential algorithm to reduce the impact of negative samples and improve calculation speed. The proposed method can effectively improve the stability of learned descriptors using the proposed triangle constraint and normalized exponential algorithm. To verify the effectiveness of the proposed descriptor network, extensive experiments were conducted using four benchmarks. The experimental results demonstrate that the proposed descriptor network achieves favorable performance compared to state-of-the-art methods.
Article
Numerous tracking approaches attempt to improve target representation through target-aware or distractor-aware. However, the unbalanced considerations of target or distractor information make it diffcult for these methods to benefit from the two aspects at the same time. In this paper, we propose a target-distractor aware model with discriminative enhancement learning loss to learn target representation, which can better distinguish the target in complex scenes. Firstly, to enlarge the gap between the target and distractor, we design a discriminative enhancement learning loss. By highlighting the hard negatives that are similar to the target and shrinking the easy negatives that are pure background, the features sensitive to the target or distractor representation can be more conveniently mined. On this basis, we further propose a target-distractor aware model. Unlike existing methods of preference target or distractor, we construct the target-specific feature space by activating the target-sensitive and the distractor-silence feature. Therefore, the appearance model can not only represent the target well but also suppress the background distractor. Finally, the target-distractor aware target representation model is integrated with a Siamese matching network for visual tracking for achieving robust and realtime visual tracking. Extensive experiments are performed on eight tracking benchmarks show that the proposed algorithm achieves favorable performance.
Article
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However , these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Article
Full-text available
Fine-grained image classification is a challenging task due to the large inter-class difference and small intra-class difference. In this paper, we propose a novel Cascade Attention Model using the Deep Convolutional Neural Network to address this problem. Our method first leverages the Spatial Confusion Attention to identify ambiguous areas of the input image. Two constraint loss functions are proposed: the Spatial Mask loss and the Spatial And loss; Second, the Cross-network Attention, applying different pre-train parameters to the two stream architecture. Also, two novel loss functions called Cross-network Similarity loss and Satisfied Rank loss are proposed to make the two-stream networks reinforce each other and get better results. Finally, the Network Fusion Attention merges intermediate results with the novel entropy add strategy to obtain the final predictions. All of these modules can work together and can be trained end to end. Besides, different from previous works, our model is fully weak-supervised and fully paralleled, which leads to easier generalization and faster computation. We obtain the state-of-the-art performance on three challenge benchmark datasets (CUB-200-2011, FGVC-Aircraft and Flower 102) with results of 90.8%, 92.1%, and 98.5%, respectively. The model will be publicly available at https://github.com/billzyx/LCA-CNN.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
Visual trackers using deep neural networks have demonstrated favorable performance in object tracking. However, training a deep classification network using overlapped initial target regions may lead an overfitted model. To increase the model generalization, we propose an appearance variation adaptation (AVA) tracker that aligns the feature distributions of target regions over time by learning an adaptation mask in an adversarial network. The proposed adversarial network consists of a generator and a discriminator network that compete with each other over optimizing a discriminator loss in a mini-max optimization problem. Specifically, the discriminator network aims to distinguish recent target regions from earlier ones by minimizing the discriminator loss, while the generator network aims to produce an adaptation mask to maximize the discriminator loss. We incorporate a gradient reverse layer in the adversarial network to solve the aforementioned mini-max optimization in an end-to-end manner. We compare the performance of the proposed AVA tracker with the most recent state-of-the-art trackers by doing extensive experiments on OTB50, OTB100, and VOT2016 tracking benchmarks. Among the compared methods, AVA yields the highest area under curve (AUC) score of 0.712 and the highest average precision score of 0.951 on the OTB50 tracking benchmark. It achieves the second best AUC score of 0.688 and the best precision score of 0.924 on the OTB100 tracking benchmark. AVA also achieves the second best expected average overlap (EAO) score of 0.366, the best failure rate of 0.68, and the second best accuracy of 0.53 on the VOT2016 tracking benchmark.
Article
Deep convolutional neural networks (CNNs) have attracted considerable interest in low-level computer vision. Researches are usually devoted to improving the performance via very deep CNNs. However, as the depth increases, influences of the shallow layers on deep layers are weakened. Inspired by the fact, we propose an attention-guided denoising convolutional neural network (ADNet), mainly including a sparse block (SB), a feature enhancement block (FEB), an attention block (AB) and a reconstruction block (RB) for image denoising. Specifically, the SB makes a tradeoff between performance and efficiency by using dilated and common convolutions to remove the noise. The FEB integrates global and local features information via a long path to enhance the expressive ability of the denoising model. The AB is used to finely extract the noise information hidden in the complex background, which is very effective for complex noisy images, especially real noisy images and bind denoising. Also, the FEB is integrated with the AB to improve the efficiency and reduce the complexity for training a denoising model. Finally, a RB aims to construct the clean image through the obtained noise mapping and the given noisy image. Additionally, comprehensive experiments show that the proposed ADNet performs very well in three tasks (i.e. synthetic and real noisy images, and blind denoising) in terms of both quantitative and qualitative evaluations. The code of ADNet is accessible at http://www.yongxu.org/lunwen.html.
Article
Correlation filtering based visual tracking has achieved impressive success in terms of both tracking accuracy and computational efficiency. In this paper, a novel correlation filtering approach is proposed by means of a joint learning to bridge the gap between the circulant filtering and the classical filtering methods. The circulant structure of tracking and the information from successive frames are simultaneously exploited in the proposed work. A new formulation for the correlation filter learning is proposed to enhance the discrimination of the learned filter by integrating both the kernel and the image feature domains. The proposed approach is computational efficient since a closed-form solution is derived for the new formulation. Extensive experiments are conducted on two popular tracking benchmarks and the experimental results demonstrate that the proposed tracker outperforms most of the state-of-the-art trackers.