Project

visual tracking

Updates
0 new
2
Recommendations
0 new
0
Followers
0 new
3
Reads
0 new
95

Project log

Di Yuan
added a research item
Tracking in the unmanned aerial vehicle (UAV) scenarios is one of the main components of target tracking tasks. Different from the target tracking task in the general scenarios, the target tracking task in the UAV scenarios is very challenging because of factors such as small scale and aerial view. Although the DCFs-based tracker has achieved good results in tracking tasks in general scenarios, the boundary effect caused by the dense sampling method will reduce the tracking accuracy, especially in UAV tracking scenarios. In this work, we propose learning an adaptive spatial-temporal context-aware (ASTCA) model in the DCFs-based tracking framework to improve the tracking accuracy and reduce the influence of boundary effect, thereby enabling our tracker to more appropriately handle UAV tracking tasks. Specifically, our ASTCA model can learn a spatial-temporal context weight, which can precisely distinguish the target and background in the UAV tracking scenarios. Besides, considering the small target scale and the aerial view in UAV tracking scenarios, our ASTCA model incorporates spatial context information within the DCFs-based tracker, which could effectively alleviate background interference. Extensive experiments demonstrate that our ASTCA method performs favorably against state-of-the-art tracking methods on some standard UAV datasets.
Qiao Liu
added a research item
Existing trackers usually exploit robust features or online updating mechanisms to deal with target variations which is a key challenge in visual tracking. However, the features being robust to variations remain little spatial information, and existing online updating methods are prone to overfitting. In this paper, we propose a dual-margin model for robust and accurate visual tracking. The dual-margin model comprises an intra-object margin between different target appearances and an inter-object margin between the target and the background. The proposed method is able to not only distinguish the target from the background but also perceive the target changes, which tracks target appearance changing and facilitates accurate target state estimation. In addition, to exploit rich off-line video data and learn general rules of target appearance variations, we train the dual-margin model on a large off-line video dataset. We perform tracking under a Siamese framework using the constructed appearance set as templates. The proposed method achieves accurate and robust tracking performance on five public datasets while running in real-time. The favorable performance against the state-of-the-art methods demonstrates the effectiveness of the proposed algorithm.
Qiao Liu
added a research item
Recent years have witnessed significant improvements of ensemble trackers based on independent models. However, existing ensemble trackers only combine the responses of independent models and pay less attention to the learning process, which hinders their performance from further improvements. To this end, we propose an interactive learning framework to strengthen the independent models in the learning process. Specifically, in the interactive network, we force convolutional filter models to interact with each other by sharing their responses during the learning. The interaction between the convolutional filter models can mine hard samples and prevent easy samples from overwhelming them, which improve their discriminative capacity. In addition, to achieve a more accurate target location, we develop a fusion mechanism based on the confidences of the independent predictions. We evaluate the proposed method on five public datasets including OTB-2013, OTB-2015, VOT-2016, VOT-2017, Temple-Color-128, and LaSOT. The comprehensive experimental results show that the proposed algorithm performs favorably against state-of-the-art methods.
Di Yuan
added a research item
The training of a feature extraction network typically requires abundant manually annotated training samples, making this a time-consuming and costly process. Accordingly, we propose an effective self-supervised learning-based tracker in a deep correlation framework (named: self-SDCT). Motivated by the forward-backward tracking consistency of a robust tracker, we propose a multi-cycle consistency loss as self-supervised information for learning feature extraction network from adjacent video frames. At the training stage, we generate pseudo-labels of consecutive video frames by forward-backward prediction under a Siamese correlation tracking framework and utilize the proposed multi-cycle consistency loss to learn a feature extraction network. Furthermore, we propose a similarity dropout strategy to enable some low-quality training sample pairs to be dropped and also adopt a cycle trajectory consistency loss in each sample pair to improve the training loss function. At the tracking stage, we employ the pre-trained feature extraction network to extract features and utilize a Siamese correlation tracking framework to locate the target using forward tracking alone. Extensive experimental results indicate that the proposed self-supervised deep correlation tracker (self-SDCT) achieves competitive tracking performance contrasted to state-of-the-art supervised and unsupervised tracking methods on standard evaluation benchmarks.
Qiao Liu
added a research item
Existing regression based tracking methods built on correlation filter model or convolution modeldo not take both accuracy and robustness into account at the same time. In this paper, we pro-pose a dual regression framework comprising a discriminative fully convolutional module and a fine-grained correlation filter component for visual tracking. The convolutional module trainedin a classification manner with hard negative mining ensures the discriminative ability of the proposed tracker, which facilitates the handling of several challenging problems, such as drastic de-formation, distracters, and complicated backgrounds. The correlation filter component built onthe shallow features with fine-grained features enables accurate localization. By fusing these twobranches in a coarse-to-fine manner, the proposed dual-regression tracking framework achievesa robust and accurate tracking performance. Extensive experiments on the OTB2013, OTB2015,and VOT2015 datasets demonstrate that the proposed algorithm performs favorably against thestate-of-the-art methods.
Di Yuan
added a research item
Correlation filter-based trackers (CFTs) have recently shown remarkable performance in the field of visual object tracking. The advantage of these trackers originates from their ability to convert time-domain calculations into frequency domain calculations. However, a significant problem of these CFTs is that the model is insufficiently robust when the tracking scenarios are too complicated, meaning that the ideal tracking performance cannot be acquired. Recent work has attempted to resolve this problem by reducing the boundary effects from modeling the foreground and background of the object target effectively (e.g., CFLB, BACF, and CACF). Although these methods have demonstrated reasonable performance, they are often affected by occlusion, deformation, scale variation, and other challenging scenes. In this study, considering the relationship between the current frame and the previous frame of a moving object target in a time series, we propose a temporal regularization strategy to improve the BACF tracker (denoted as TRBACF), a typical representative of the aforementioned trackers. The TRBACF tracker can efficiently adjust the model to adapt the change of the tracking scenes, thereby enhancing its robustness and accuracy. Moreover, the objective function of our TRBACF tracker can be solved by an improved alternating direction method of multipliers, which can speed up the calculation in the Fourier domain. Extensive experimental results demonstrate that the proposed TRBACF tracker achieves competitive tracking performance compared with state-of-the-art trackers.
Di Yuan
added a research item
Most existing tracking methods are based on using a classifier and multi-scale estimation to estimate the state of the target. Consequently, and as expected, trackers have become more stable while tracking accuracy has stagnated. While the ATOM \cite{ATOM} tracker adopts a maximum overlap method based on an intersection-over-union (IoU) loss to mitigate this problem, there are defects in the IoU loss itself, that make it impossible to continue to optimize the objective function when a given bounding box is completely contained within another bounding box; this makes it very challenging to accurately estimate the target state. Accordingly, in this paper, we address the above-mentioned problem by proposing a novel tracking method based on a distance-IoU (DIoU) loss, such that the proposed tracker consists of a target estimation component and a target classification component. The target estimation component is trained to predict the DIoU score between the target ground-truth bounding-box and the estimated bounding-box. The DIoU loss can maintain the advantage provided by the IoU loss while minimizing the distance between the center points of two bounding boxes, thereby making the target estimation more accurate. Moreover, we introduce a classification component that is trained online to guarantee real-time tracking speed. Comprehensive experimental results demonstrate that our DIoUTrack achieves competitive tracking accuracy when compared with state-of-the-art trackers while also tracking speed is over $50$ $fps$.
Di Yuan
added a research item
Discriminative correlation filters (DCFs) have been widely used in the visual tracking community in recent years. The DCFs-based trackers determine the target location through a response map generated by the correlation filters and determine the target scale by a fixed scale factor. However, the response map is vulnerable to noise interference and the fixed scale factor also cannot reflect the real scale change of the target, which can obviously reduce the tracking performance. In this paper, to solve the aforementioned drawbacks, we propose to learn a metric learning model in correlation filters framework for visual tracking (called CFML). This model can use a metric learning function to solve the target scale problem. In particular, we adopt a hard negative mining strategy to alleviate the influence of the noise on the response map, which can effectively improve the tracking accuracy. Extensive experimental results demonstrate that the proposed CFML tracker achieves competitive performance compared with the state-of-the-art trackers.
Di Yuan
added a research item
Convolutional Neural Networks (CNN) have been demonstrated to achieve state-of-the-art performance in visual object tracking task. However, existing CNN-based trackers usually use holistic target samples to train their networks. Once the target undergoes complicated situations (e.g., occlusion, background clutter, and deformation), the tracking performance degrades badly. In this paper, we propose an adaptive structural convolutional filter model to enhance the robustness of deep regression trackers (named: ASCT). Specifically, we first design a mask set to generate local filters to capture local structures of the target. Meanwhile, we adopt an adaptive weighting fusion strategy for these local filters to adapt to the changes in the target appearance, which can enhance the robustness of the tracker effectively. Besides, we develop an end-to-end trainable network comprising feature extraction, decision making, and model updating modules for effective training. Extensive experimental results on large benchmark datasets demonstrate the effectiveness of the proposed ASCT tracker performs favorably against the state-of-the-art trackers.
Di Yuan
added a research item
Discriminative correlation filters (DCFs) have been widely used in the tracking community recently. DCFs-based trackers utilize samples generated by circularly shifting from an image patch to train a ridge regression model, and estimate target location using a response map generated by the correlation filters. However, the generated samples produce some negative effects and the response map is vulnerable to noise interference, which degrades tracking performance. In this paper, to solve the aforementioned drawbacks, we propose a target-focusing convolutional regression (CR) model for visual object tracking tasks (called TFCR). This model uses a target-focusing loss function to alleviate the influence of background noise on the response map of the current tracking image frame, which effectively improves the tracking accuracy. In particular, it can effectively balance the disequilibrium of positive and negative samples by reducing some effects of the negative samples that act on the object appearance model. Extensive experimental results illustrate that our TFCR tracker achieves competitive performance compared with state-of-the-art trackers.
Qiao Liu
added 2 research items
The performance of the tracking task directly depends on target object appearance features. Therefore, a robust method for constructing appearance features is crucial for avoiding tracking failure. The tracking methods based on Convolution Neural Network (CNN) have exhibited excellent performance in the past years. However, the features from each original convolutional layer are not robust to the size change of target object. Once the size of the target object has significant changes, the tracker drifts away from the target object. In this paper, we present a novel tracker based on multi-scale feature, spatiotemporal features and deep residual network to accurately estimate the size of the target object. Our tracker can successfully locate the target object in the consecutive video frames. To solve the multi-scale change issue in visual object tracking, we sample each input image with 67 different size templates and resize the samples to a fixed size. And then these samples are used to offline train deep residual network model with multi-scale feature that we have built up. After that spatial feature and temporal feature are fused into the deep residual network model with multi-scale feature, so that we can get deep multi-scale spatiotemporal features model, which is named MSST-ResNet feature model. Finally, MSST-ResNet feature model is transferred into the tracking tasks and combined with three different Kernelized Correlation Filters (KCFs) to accurately locate target object in the consecutive video frames. Unlike the previous trackers, we directly learn various change of the target appearance by building up a MSST-ResNet feature model. The experimental results demonstrate that the proposed tracking method outperforms the state-of-the-art tracking methods.
Qiao Liu
added 2 research items
The performance of tracking task is directly dependent on the appearance features of target object, a robust approach for constructing appearance features is crucial for adaptation the appearance change. To construct an accurate and robust appearance model for visual object tracking, we modify original deep residual learning network architecture and name it Multi-Scale Residual Network (MSResNet). The first video frame image and its related information of the current input video sequence are used to learn a multi-scale appearance model of target object and a loss function is minimized over the appearance features. Meanwhile, spatial information of each video frame and temporal information between successive video frames effectively combine with MSResNet. And thus the features are generated by Multi-Scale Spatio-Temporal Residual Network, which is named MSSTResNet feature model, can adapt to scale variation, illumination variation, background clutters, severe deformation of the target object, and so on. We implement a robust tracking method based on tracking-learning-detection framework by using our proposed MSSTResNet feature model and name it MSSTResNet-TLD tracker. Unlike the previous tracking methods, the MSResNet architecture is not offline pre-trained on a large auxiliary datasets but is directly learned end-to-end with a multi-task loss by using the current input video sequence. Furthermore, the multi-task loss function utilizes the classification loss and regression loss that is more accurate for target localization. Our experimental results demonstrate that the proposed tracking method outperforms the current state-of-the-art tracking methods on Visual Object Tracking Benchmark (VOT-2016), Object Tracking Benchmark (OTB-2015), and Unmanned Aerial Vehicles (UAV20L) test datasets. Furthermore, our MSSTResNet-TLD tracker is faster than previous most trackers based on deep Convolutional Neural Network (ConvNet or CNN) and our tracker is extremely robust to the tiny target object. Our source code is available for download at https://github.com/binger1225/MSSTResNet-TLD-Tracker.
Di Yuan
added a research item
Common tracking algorithms only use a single feature to describe the target appearance, which makes the appearance model easily disturbed by noise. Furthermore, the tracking performance and robustness of these trackers are obviously limited. In this paper, we propose a novel multiple feature fused model into a correlation filter framework for visual tracking to improve the tracking performance and robustness of the tracker. In different tracking scenarios, the response maps generated by the correlation filter framework are different for each feature. Based on these response maps, different features can use an adaptive weight-ing function to eliminate noise interference and maintain their respective advantages. It can enhance the tracking performance and robustness of the tracker efficiently. Meanwhile, the correlation filter framework can provide a fast training and accurate locating mechanism. In addition, we give a simple yet effective scale variation detection method, which can appropriately handle scale variation of the target in the tracking sequences. We evaluate our tracker on OTB2013/OTB50/OBT2015 benchmarks, which are including more than 100 video sequences. Extensive experiments on these benchmark datasets demonstrate that the proposed MFFT tracker performs favorably against the state-of-the-art trackers.
Di Yuan
added a research item
Most of the correlation filter based tracking algorithms can achieve good performance and maintain fast computational speed. However, in some complicated tracking scenes, there is a fatal defect that causes the object to be located inaccurately, which is the trackers excessively dependent on the maximum response value to determine the object location. In order to address this problem, we propose a particle filter redetection based tracking approach for accurate object localization. During the tracking process, the kernelized correlation filter (KCF) based tracker can locate the object by relying on the maximum response value of the response map; when the response map becomes ambiguous, the tracking result becomes unreliable correspondingly. Our redetection model can provide abundant object candidates by particle resampling strategy to detect the object accordingly. Additionally, for the target scale variation problem, we give a new object scale evaluation mechanism, which merely considers the differences between the maximum response values in consecutive frames to determine the scale change of the object target. Extensive experiments on OTB2013 and OTB2015 datasets demonstrate that the proposed tracker performs favorably in relation to the state-of-the-art methods.
Qiao Liu
added a research item
Discriminative methods have been widely applied to construct the appearance model for visual tracking. Most existing methods incorporate online updating strategy to adapt to the appearance variations of targets. The focus of online updating for discriminative methods is to select the positive samples emerged in past frames to represent the appearances. However, the appearances of positive samples might be very dissimilar to each other; traditional online updating strategies easily overfit on some appearances and neglect the others. To address this problem, we propose an effective method to learn a discriminative template, which maintains the multiple appearances information of targets in the long-term variations. Our method is based on the obvious observation that the target appearances vary very little in a certain number of successive video frames. Therefore, we can use a few instances to represent the appearances in the scope of the successive video frames. We propose exclusive group sparse to describe the observation and provide a novel algorithm, called coefficients constrained exclusive group LASSO, to solve it in a single objective function. The experimental results on CVPR2013 benchmark datasets demonstrate that our approach achieves promising performance.
Di Yuan
added 4 research items
Robust visual object tracking is one of the most challenging issues in the field of computer vision. Because of the circular shifts strategy, correlation filter-based trackers show a great efficiency in tracking task and thus receive lots of attentions. However, most of the correlation filter-based trackers fix the scale of the targets in each frame and use single template to update the filters, which makes the trackers unreliable in the tracking task. In this paper, we intend to promote the robustness of the kernelized correlation filters (KCF) in the tracking task, through a fast scale pyramid solution to solve the scale variations problems. Furthermore, we introduce a sparse model selection scheme on template sets to solve the problem of contaminated templates in single template methods. We test our method on OTB-2013 dataset and the experimental results show the robustness of our method. The proposed tracker achieves promising performance both in terms of accuracy and speed comparing with the state-of-the-art trackers.
In this paper, we propose a novel thermal infrared (TIR) tracker via a deep Siamese convolutional neural network (CNN), named Siamese tir. Different from the most existing discriminative TIR tracking methods which treat the tracking problem as a classification problem, we treat the TIR tracking problem as a similarity verification problem. Specifically, we design a novel Siamese convolutional neural network which coa-lesces the multiple convolution layers to obtain richer information for tracking. Then, we train this network end to end on a large video detection dataset to learn the similarity of two arbitrary objects. Next, this pre-trained Siamese network is regarded as a similarity function simply used to evaluate the similarity between the initial target and candidates. Finally, we locate the most similar one without any adapting in the tracking process. To evaluate the performance of our TIR tracker, we conduct the experiments on the TIR tracking benchmark VOT-TIR2016. The experimental results show that the proposed method achieves very competitive performance.
The accurate localization of a object target is a challenging research issue in visual tracking. Most correlation filter based tracking algorithms has been degraded their performances because of the weaknesses of their search strategy. This paper investigates the problem of accurate location the object target in visual tracking sequences. We propose a novel particle filter re-detection tracking approach for target re-location, when the kernelized correlation filters tracking result becomes unreliable. Additionally, we give a new target scale evaluation. Different from other proposed scale search strategies, our method merely consider the difference between the maximum value of the response map of adjacent frames. Extensive experiments are performed on the OTB2013 dataset. On the result of this benchmark, the proposed approach achieves a pretty performance.
Di Yuan
added 2 research items
In the most tracking approaches, a score function is utilized to determine which candidate is the optimal one by measuring the similarity between the candidate and the template. However, the representative samples selection in the template update is challenging. To address this problem, in this paper, we treat the template as a linear combination of representative samples and propose a novel approach to select representative samples based on the coefficient constrained model. We formulate the objective function into a non-negative least square problem and obtain the solution utilizing standard non-negative least square. The experimental results show that the observation module of our approach outperforms several other observation modules under the same feature and motion module, such as support vector machine, logistic regression, ridge regression and structured outputs support vector machine.
Occlusion is one of the most challenging problems in visual object tracking. Recently, a lot of discriminative methods have been proposed to deal with this problem. For the discriminative methods, it is difficult to select the representative samples for the target template updating. In general, the holistic bounding boxes that contain tracked results are selected as the positive samples. However, when the objects are occluded, this simple strategy easily introduces the noises into the training data set and the target template and then leads the tracker to drift away from the target seriously. To address this problem, we propose a robust patch-based visual tracker with online representative sample selection. Different from previous works, we divide the object and the candidates into several patches uniformly and propose a score function to calculate the score of each patch independently. Then, the average score is adopted to determine the optimal candidate. Finally, we utilize the non-negative least square method to find the representative samples, which are used to update the target template. The experimental results on the object tracking benchmark 2013 and on the 13 challenging sequences show that the proposed method is robust to the occlusion and achieves promising results.