Article

Dual-regression model for visual tracking

Abstract and Figures

Existing regression based tracking methods built on correlation filter model or convolution modeldo not take both accuracy and robustness into account at the same time. In this paper, we pro-pose a dual regression framework comprising a discriminative fully convolutional module and a fine-grained correlation filter component for visual tracking. The convolutional module trainedin a classification manner with hard negative mining ensures the discriminative ability of the proposed tracker, which facilitates the handling of several challenging problems, such as drastic de-formation, distracters, and complicated backgrounds. The correlation filter component built onthe shallow features with fine-grained features enables accurate localization. By fusing these twobranches in a coarse-to-fine manner, the proposed dual-regression tracking framework achievesa robust and accurate tracking performance. Extensive experiments on the OTB2013, OTB2015,and VOT2015 datasets demonstrate that the proposed algorithm performs favorably against thestate-of-the-art methods.
Content may be subject to copyright.
A preview of the PDF is not available
... The performance of the proposed tracker has been evaluated based on two performance metrics, the Area-Underthe-Curve (AUC) and the Distance Precision (DP). For the validation of the proposed tracker's performance, a comparison has been carried out based on trackers presented in previous works: ASLA [37], CSK [38], DSST [30], MEEM [39], MUSTER [40], SAMF [41], SRDCF [27], Struck [42], siamfc3s [43], HCFTs [4], HDT [11], Staple [44], CNN-SVM [45], CF2 [3], LCT [46], KCF [32], TLD [47], KCF_GaussHog [32],KCF_LinearHog [32], BACF [48], DeepSRDCF [49], DRVT [50], MemDTC [51], MemTrack [51], SRDCFdecon [52]. The obtained results, corresponding to the two considered performance metrics, have been shown by two curves for One-Pass Evaluation (OPE), such as the DP rate vs. the location error threshold which measures the proportion of frames with distance between the tracking results and the ground truth less than a certain number of pixels, and the success rate vs. the overlap threshold which describes the percentage of successful frames as shown in Figures 4 and 5 for the OBT50 and OBT100 datasets respectively. ...
Article
Full-text available
In this paper, a new Visual Object Tracking (VOT) approach is proposed to overcome the main problem the existing approaches encounter, i.e. the significant appearance changes which are mainly caused by heavy occlusion and illumination variation. The proposed approach is based on a combination of Deep Convolutional Neural Networks (DCNNs), Histogram of Oriented Gradient (HOG) features, and discrete wavelet packet transforms. The problem of illumination variation is solved by incorporating the coefficients of the image discrete wavelet packet transform instead of the image template to handle the case of images with high saturation in the input of the used CNN, whereas the inverse discrete wavelet packet transforms are used at the output for extracting the CNN features. By combining four learned correlation filters with the convolutional features, the target location is deduced using multichannel correlation maps at the CNN output. On the other side, the maximum value of the resulting maps from the correlation filters with convolutional features produced by the previously obtained HOG feature of the image template are calculated and are used as an updating parameter of the correlation filters extracted from CNN and from HOG. The major aim is to ensure long-term memory of the target appearance so that the target item may be recovered if tracking fails. In order to increase the performance of HOG, the coefficients of the discrete packet wavelet transform are employed instead of the image template. The obtained results demonstrate the superiority of the proposed approach.
... When it comes to processing speed, our proposed method satisfies the real-time criterion. Figure 10 demonstrates the comparison results of our tracker and other state-of-the-art trackers, including self-SDCF [74], UDT [75], CFNet, SRDCF, SiamFC, DSST, ARCF, SiamTri [76], TRACA [77], DCFNet, and DRM [78] on the UAV123 dataset. In these contrast tracking methods, our tracker reveals the excellent scores in both speed and AUC metrics. ...
Article
Full-text available
Abstract Siamese network based the tracker is a hot topic in the field of visual object tracking. However, Siamese trackers still have a robustness gap compared with state‐of‐the‐art algorithms. Therefore, focusing on the issue, this letter adds Frequency Channel Attention (FCA) and adaptive template feature map to the framework of Siamese neural network. FCA can enhance feature representation of effective channels and improve feature discrimination by modeling the correlation between each channel of the image. In this algorithm, by theoretical analysis and experimental validation, restriction is broken through a simple yet effective FCA network sampling strategy and a Siamese‐FCA tracker with significant performance gain is successfully trained. Meanwhile, in order to better adjust the proportion between target and background, the tracker selects suitable size of the target feature map. Moreover, extensive ablation studies are conducted to demonstrate the effectiveness of the proposed tracker. Fairly, the experimental results of five test benchmarks, including OTB2013, OTB2015, VOT2016, VOT2018 and UAV123 datasets, shows that the proposed algorithm performs outstanding. In particular, the issue of similarity and small target tracking failure is overcome. The average running frame rate reaches 86 frames per second, which can meet the real‐time requirements.
... Target tracking aims to estimate the motion trajectory of the target of interest in the subsequent frames by the initial state of the video frame and finally realize the recognition, location, and tracking of the specified target. Currently, there are more and more target tracking algorithms, such as [1][2][3][4], which have good performance. And target tracking technology also plays an increasingly important role in work and life; meanwhile it is widely used, including autonomous driving [5][6][7][8], reality augmentation [9,10], drones [11], sports competition [12], surgery [13], biology [14][15][16], and marine exploration [17]. ...
Article
Full-text available
Target tracking is currently a hot research topic in machine vision. The traditional target tracking algorithm based on the generative model selects target features manually, which has a simple structure and fast running speed, but it cannot meet the requirements of algorithm accuracy in complex scenes. Compared with traditional algorithms, due to the good performance, the tracking method based on full convolutional network has become one of the important methods of target tracking. However, the RPN-based Siamese network lacks positional reliability when predicting the target area. Aiming at the low tracking accuracy of the RPN-based Siamese network, this paper proposes an improved framework model named IoU-guided SiamRPN (IG-SiamRPN). In the proposed IG-SiamRPN, the IoU-guided branch is first constructed and sample pairs are generated through data augmentation. Then, the Jittered RoI is constructed to train the network to realize the direct prediction of the localization confidence of the candidate area. Subsequently, a target selection method based on predicted IoU scores is proposed, which uses predicted IoU scores instead of classification scores to optimize the target decision strategy of the Siamese network. Finally, an optimization-based fine-tuning method for the Siamese network frame is proposed, which solves the problem of location degradation and improves the performance of the algorithm. Compared with other state-of-the-art target tracking algorithms, experimental results on popular databases demonstrate that the proposed IG-SiamRPN can achieve better performance in both tracking accuracy and robustness.
... Tracking methods under the DCF framework have obtained significantly more efficiency than other trackers [20,24,35,40,74] and have thus attracted the attention of many researchers [4,25,28,29,41,68,69]. The DCF-based tracking methods use a trained filter to identify targets from background. ...
Article
Tracking in the unmanned aerial vehicle (UAV) scenarios is one of the main components of target tracking tasks. Different from the target tracking task in the general scenarios, the target tracking task in the UAV scenarios is very challenging because of factors such as small scale and aerial view. Although the DCFs-based tracker has achieved good results in tracking tasks in general scenarios, the boundary effect caused by the dense sampling method will reduce the tracking accuracy, especially in UAV tracking scenarios. In this work, we propose learning an adaptive spatial-temporal context-aware (ASTCA) model in the DCFs-based tracking framework to improve the tracking accuracy and reduce the influence of boundary effect, thereby enabling our tracker to more appropriately handle UAV tracking tasks. Specifically, our ASTCA model can learn a spatial-temporal context weight, which can precisely distinguish the target and background in the UAV tracking scenarios. Besides, considering the small target scale and the aerial view in UAV tracking scenarios, our ASTCA model incorporates spatial context information within the DCFs-based tracker, which could effectively alleviate background interference. Extensive experiments demonstrate that our ASTCA method performs favorably against state-of-the-art tracking methods on some standard UAV datasets.
Article
Triple loss is widely used to detect learned descriptors and achieves promising performance. However, triple loss fails to fully consider the influence of adjacent descriptors from the same type of sample, which is one of the main reasons for image mismatching. To solve this problem, we propose a descriptor network based on triple loss with a similar triangle constraint, named as STCDesc. This network not only considers the correlation between descriptors from different types of samples but also considers the relevance of descriptors from the same type of sample. Furthermore, we propose a normalized exponential algorithm to reduce the impact of negative samples and improve calculation speed. The proposed method can effectively improve the stability of learned descriptors using the proposed triangle constraint and normalized exponential algorithm. To verify the effectiveness of the proposed descriptor network, extensive experiments were conducted using four benchmarks. The experimental results demonstrate that the proposed descriptor network achieves favorable performance compared to state-of-the-art methods.
Article
In this study, we propose a novel Wasserstein distributional tracking method that can balance approximation with accuracy in terms of Monte Carlo estimation. To achieve this goal, we present three different systems: sliced Wasserstein-based (SWT), projected Wasserstein-based (PWT), and orthogonal coupled Wasserstein-based (OCWT) visual tracking systems. Sliced Wasserstein-based visual trackers can find accurate target configurations using the optimal transport plan, which minimizes the discrepancy between appearance distributions described by the estimated and ground truth configurations. Because this plan involves a finite number of probability distributions, the computation costs can be considerably reduced. Projected Wasserstein-based and orthogonal coupled Wasserstein-based visual trackers further enhance the accuracy of visual trackers using bijective mapping functions and orthogonal Monte Carlo, respectively. Experimental results demonstrate that our approach can balance computational efficiency with accuracy, and the proposed visual trackers outperform other state-of-the-art visual trackers on several benchmark visual tracking datasets.
Article
The existing body of work on video object tracking (VOT) algorithms has studied various image conditions such as occlusion, clutter, and object shape, which influence video quality and affect tracking performance. Nonetheless, there is no clear distinction between the performance reduction caused by scene-dependent challenges such as occlusion and clutter, and the effect of authentic in-capture and post-capture distortions. Despite the plethora of VOT methods in the literature, there is a lack of detailed studies analyzing the performance of videos with authentic in-capture and post-capture distortions. We introduced a new dataset of authentically distorted videos (AD-SVD) to address this issue. This dataset contains 4476 videos with different authentic distortions and surveillance activities. Furthermore, it provides benchmarking results for evaluating ten state-of-the-art visual object trackers (from VOT 2017–2018 challenges) based on the proposed dataset. In addition, this study develops an approach for performance prediction and quality-aware feature selection for single-object tracking in authentically distorted surveillance videos. The method predicts the performance of a VOT algorithm with high accuracy. Then, the probability of obtaining the reference output is maximized without executing the tracking algorithms. We also propose a framework to reduce video tracker computation resources (time and video storage space). We achieve this by balancing processing time and tracking accuracy by predicting the performance in a range of spatial resolutions. This approach can reduce the execution time by up to 34% with a slight decrease in performance of 3%.
Article
Object tracking by the Siamese network has gained its popularity for its outstanding performance and considerable potential. However, most of the existing Siamese architectures are faced with great difficulties when it comes to the scenes where the target is going through dramatic shape or environmental changes. In this work, we proposed a novel and concise generative adversarial learning method to solve the problem especially when the target is going under drastic changes of appearance, illumination variations and background clutters. We consider the above situations as distractors for tracking and joint a distractor generator into the traditional Siamese network. The component can simulate these distractors, and more robust tracking performance is achieved by eliminating the distractors from the input instance search image. Besides, we use the generalized intersection over union (GIoU) as our training loss. GIoU is a more strict metric for the bounding box regression compared to the traditional IoU, which can be used as training loss for more accurate tracking results. Experiments on five challenging benchmarks have shown favorable and state-of-the-art results against other trackers in different aspects.
Article
The Tracking-by-segmentation framework is widely used in visual tracking to handle severe appearance change such as deformation and occlusion. Tracking-by-segmentation methods first segment the target object from the background, then use the segmentation result to estimate the target state. In existing methods, target segmentation is formulated as a superpixel labeling problem constrained by a target likelihood constraint, a spatial smoothness constraint and a temporal consistency constraint. The target likelihood is calculated by a discriminative part model trained independently from the superpixel labeling framework and updated online using historical tracking results as pseudo-labels. Due to the lack of spatial and temporal constraints and inaccurate pseudo-labels, the discriminative model is unreliable and may lead to tracking failure. This paper addresses the aforementioned problems by integrating the objective function of model training into the target segmentation optimization framework. Thus, during the optimization process, the discriminative model can be constrained by spatial and temporal constraints and provides more accurate target likelihoods for part labeling, and the results produce more reliable pseudo-labels for model learning. Moreover, we also propose a supervision switch mechanism to detect erroneous pseudo-labels caused by a severe change in data distribution and switch the classifier to a semi-supervised setting in such a case. Evaluation results on OTB2013, OTB2015 and TC-128 benchmarks demonstrate the effectiveness of the proposed tracking algorithm.
Article
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However , these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific discriminative features and fine-grained correlation features for TIR tracking. Specifically, we first use an auxiliary classification network to guide the generation of TIR-specific discriminative features for distinguishing the TIR objects belonging to different classes. Second, we design a fine-grained aware module to capture more subtle information for distinguishing the TIR objects belonging to the same class. These two kinds of features complement each other and recognize TIR objects in the levels of inter-class and intra-class respectively. These two feature models are learned using a multi-task matching framework and are jointly optimized on the TIR tracking task. In addition, we develop a large-scale TIR training dataset to train the network for adapting the model to the TIR domain. Extensive experimental results on three benchmarks show that the proposed algorithm achieves a relative gain of 10% over the baseline and performs favorably against the state-of-the-art methods. Codes and the proposed TIR dataset are available at https://github.com/QiaoLiuHit/MMNet.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
Visual trackers using deep neural networks have demonstrated favorable performance in object tracking. However, training a deep classification network using overlapped initial target regions may lead an overfitted model. To increase the model generalization, we propose an appearance variation adaptation (AVA) tracker that aligns the feature distributions of target regions over time by learning an adaptation mask in an adversarial network. The proposed adversarial network consists of a generator and a discriminator network that compete with each other over optimizing a discriminator loss in a mini-max optimization problem. Specifically, the discriminator network aims to distinguish recent target regions from earlier ones by minimizing the discriminator loss, while the generator network aims to produce an adaptation mask to maximize the discriminator loss. We incorporate a gradient reverse layer in the adversarial network to solve the aforementioned mini-max optimization in an end-to-end manner. We compare the performance of the proposed AVA tracker with the most recent state-of-the-art trackers by doing extensive experiments on OTB50, OTB100, and VOT2016 tracking benchmarks. Among the compared methods, AVA yields the highest area under curve (AUC) score of 0.712 and the highest average precision score of 0.951 on the OTB50 tracking benchmark. It achieves the second best AUC score of 0.688 and the best precision score of 0.924 on the OTB100 tracking benchmark. AVA also achieves the second best expected average overlap (EAO) score of 0.366, the best failure rate of 0.68, and the second best accuracy of 0.53 on the VOT2016 tracking benchmark.
Article
Deep convolutional neural networks (CNNs) have attracted considerable interest in low-level computer vision. Researches are usually devoted to improving the performance via very deep CNNs. However, as the depth increases, influences of the shallow layers on deep layers are weakened. Inspired by the fact, we propose an attention-guided denoising convolutional neural network (ADNet), mainly including a sparse block (SB), a feature enhancement block (FEB), an attention block (AB) and a reconstruction block (RB) for image denoising. Specifically, the SB makes a tradeoff between performance and efficiency by using dilated and common convolutions to remove the noise. The FEB integrates global and local features information via a long path to enhance the expressive ability of the denoising model. The AB is used to finely extract the noise information hidden in the complex background, which is very effective for complex noisy images, especially real noisy images and bind denoising. Also, the FEB is integrated with the AB to improve the efficiency and reduce the complexity for training a denoising model. Finally, a RB aims to construct the clean image through the obtained noise mapping and the given noisy image. Additionally, comprehensive experiments show that the proposed ADNet performs very well in three tasks (i.e. synthetic and real noisy images, and blind denoising) in terms of both quantitative and qualitative evaluations. The code of ADNet is accessible at http://www.yongxu.org/lunwen.html.
Article
Fine-grained image classification is a challenging task due to the large inter-class difference and small intra-class difference. In this paper, we propose a novel Cascade Attention Model using the Deep Convolutional Neural Network to address this problem. Our method first leverages the Spatial Confusion Attention to identify ambiguous areas of the input image. Two constraint loss functions are proposed: the Spatial Mask loss and the Spatial And loss; Second, the Cross-network Attention, applying different pre-train parameters to the two stream architecture. Also, two novel loss functions called Cross-network Similarity loss and Satisfied Rank loss are proposed to make the two-stream networks reinforce each other and get better results. Finally, the Network Fusion Attention merges intermediate results with the novel entropy add strategy to obtain the final predictions. All of these modules can work together and can be trained end to end. Besides, different from previous works, our model is fully weak-supervised and fully paralleled, which leads to easier generalization and faster computation. We obtain the state-of-the-art performance on three challenge benchmark datasets (CUB-200-2011, FGVC-Aircraft and Flower 102) with results of 90.8%, 92.1%, and 98.5%, respectively. The model will be publicly available at https://github.com/billzyx/LCA-CNN.
Article
Correlation filtering based visual tracking has achieved impressive success in terms of both tracking accuracy and computational efficiency. In this paper, a novel correlation filtering approach is proposed by means of a joint learning to bridge the gap between the circulant filtering and the classical filtering methods. The circulant structure of tracking and the information from successive frames are simultaneously exploited in the proposed work. A new formulation for the correlation filter learning is proposed to enhance the discrimination of the learned filter by integrating both the kernel and the image feature domains. The proposed approach is computational efficient since a closed-form solution is derived for the new formulation. Extensive experiments are conducted on two popular tracking benchmarks and the experimental results demonstrate that the proposed tracker outperforms most of the state-of-the-art trackers.