Preprint

Learning Discriminative Model Prediction for Tracking

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

The current strive towards end-to-end trainable computer vision systems imposes major challenges for the task of visual tracking. In contrast to most other vision problems, tracking requires the learning of a robust target-specific appearance model online, during the inference stage. To be end-to-end trainable, the online learning of the target model thus needs to be embedded in the tracking architecture itself. Due to these difficulties, the popular Siamese paradigm simply predicts a target feature template. However, such a model possesses limited discriminative power due to its inability of integrating background information. We develop an end-to-end tracking architecture, capable of fully exploiting both target and background appearance information for target model prediction. Our architecture is derived from a discriminative learning loss by designing a dedicated optimization process that is capable of predicting a powerful model in only a few iterations. Furthermore, our approach is able to learn key aspects of the discriminative loss itself. The proposed tracker sets a new state-of-the-art on 6 tracking benchmarks, achieving an EAO score of 0.440 on VOT2018, while running at over 40 FPS.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In the field of generic object tracking numerous attempts have been made to exploit deep features. Despite all expectations, deep trackers are yet to reach an outstanding level of performance compared to methods solely based on handcrafted features. In this paper, we investigate this key issue and propose an approach to unlock the true potential of deep features for tracking. We systematically study the characteristics of both deep and shallow features, and their relation to tracking accuracy and robustness. We identify the limited data and low spatial resolution as the main challenges, and propose strategies to counter these issues when integrating deep features for tracking. Furthermore, we propose a novel adaptive fusion approach that leverages the complementary properties of deep and shallow features to improve both robustness and accuracy. Extensive experiments are performed on four challenging datasets. On VOT2017, our approach significantly outperforms the top performing tracker from the challenge with a relative gain of 17% in EAO.
Article
Full-text available
We propose a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN). Our algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. Our network is composed of shared layers and multiple branches of domain-specific layers, where domains correspond to individual training sequences and each branch is responsible for binary classification to identify target in each domain. We train each domain in the network iteratively to obtain generic target representations in the shared layers. When tracking a target in a new sequence, we construct a new network by combining the shared layers in the pretrained CNN with a new binary classification layer, which is updated online. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state. The proposed algorithm illustrates outstanding performance in existing tracking benchmarks.
Article
Full-text available
The Correlation Filter is an algorithm that trains a linear template to discriminate between images and their translations. It is well suited to object tracking because its formulation in the Fourier domain provides a fast solution, enabling the detector to be re-trained once per frame. Previous works that use the Correlation Filter, however, have adopted features that were either manually designed or trained for a different task. This work is the first to overcome this limitation by interpreting the Correlation Filter learner, which has a closed-form solution, as a differentiable layer in a deep neural network. This enables learning deep features that are tightly coupled to the Correlation Filter. Experiments illustrate that our method has the important practical benefit of allowing lightweight architectures to achieve state-of-the-art performance at high framerates.
Conference Paper
Full-text available
The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object’s appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.
Conference Paper
Full-text available
In this paper, we propose a new aerial video dataset and benchmark for low altitude UAV target tracking, as well as, a photo-realistic UAV simulator that can be coupled with tracking methods. Our benchmark provides the first evaluation of many state-of-the-art and popular trackers on 123 new and fully annotated HD video sequences captured from a low-altitude aerial perspective. Among the compared trackers, we determine which ones are the most suitable for UAV tracking both in terms of tracking accuracy and run-time. The simulator can be used to evaluate tracking algorithms in real-time scenarios before they are deployed on a UAV “in the field”, as well as, generate synthetic but photo-realistic tracking datasets with automatic ground truth annotations to easily extend existing real-world datasets. Both the benchmark and simulator are made publicly available to the vision community on our website to further research in the area of object tracking from UAVs. (https:// ivul. kaust. edu. sa/ Pages/ pub-benchmark-simulator-uav. aspx.).
Article
Full-text available
Robust and accurate visual tracking is one of the most challenging computer vision problems. Due to the inherent lack of training data, a robust approach for constructing a target appearance model is crucial. Recently, discriminatively learned correlation filters (DCF) have been successfully applied to address this problem for tracking. These methods utilize a periodic assumption of the training samples to efficiently learn a classifier on all patches in the target neighborhood. However, the periodic assumption also introduces unwanted boundary effects, which severely degrade the quality of the tracking model. We propose Spatially Regularized Discriminative Correlation Filters (SRDCF) for tracking. A spatial regularization component is introduced in the learning to penalize correlation filter coefficients depending on their spatial location. Our SRDCF formulation allows the correlation filters to be learned on a significantly larger set of negative training samples, without corrupting the positive samples. We further propose an optimization strategy, based on the iterative Gauss-Seidel method, for efficient online learning of our SRDCF. Experiments are performed on four benchmark datasets: OTB-2013, ALOV++, OTB-2015, and VOT2014. Our approach achieves state-of-the-art results on all four datasets. On OTB-2013 and OTB-2015, we obtain an absolute gain of 8.0% and 8.2% respectively, in mean overlap precision, compared to the best existing trackers.
Article
Full-text available
Discriminative Correlation Filters (DCF) have demonstrated excellent performance for visual object tracking. The key to their success is the ability to efficiently exploit available negative data by including all shifted versions of a training sample. However, the underlying DCF formulation is restricted to single-resolution feature maps, significantly limiting its potential. In this paper, we go beyond the conventional DCF framework and introduce a novel formulation for training continuous convolution filters. We employ an implicit interpolation model to pose the learning problem in the continuous spatial domain. Our proposed formulation enables efficient integration of multi-resolution deep feature maps, leading to superior results on three object tracking benchmarks: OTB-2015 (+5.1% in mean OP), Temple-Color (+4.6% in mean OP), and VOT2015 (20% relative reduction in failure rate). Additionally, our approach is capable of sub-pixel localization, crucial for the task of accurate feature point tracking. We also demonstrate the effectiveness of our learning formulation in extensive feature point tracking experiments. Code and supplementary material are available at http://www.cvl.isy.liu.se/research/objrec/visualtracking/conttrack/index.html.
Conference Paper
Full-text available
Although not commonly used, correlation filters can track complex objects through rotations, occlusions and other distractions at over 20 times the rate of current state-of- the-art techniques. The oldest and simplest correlation filters use simple templates and generally fail when ap- plied to tracking. More modern approaches such as ASEF and UMACE perform better, but their training needs are poorly suited to tracking. Visual tracking requires robust filters to be trained from a single frame and dynamically adapted as the appearance of the target object changes. This paper presents a new type of correlation filter, a Minimum Output Sum of Squared Error (MOSSE) filter, which produces stable correlation filters when initialized using a single frame. A tracker based upon MOSSE fil- ters is robust to variations in lighting, scale, pose, and non-rigid deformations while operating at 669 frames per second. Occlusion is detected based upon the peak-to- sidelobe ratio, which enables the tracker to pause and re- sume where it left off when the object reappears. Note: This paper contains additional figures and con- tent that was excluded from CVPR 2010 to meet length requirements.
Article
We introduce here a large tracking database that offers an unprecedentedly wide coverage of common moving objects in the wild, called GOT-10k. Specifically, GOT-10k is built upon the backbone of WordNet structure [1] and it populates the majority of over 560 classes of moving objects and 87 motion patterns, magnitudes wider than the most recent similar-scale counterparts [19] , [20] , [23] , [26] . By releasing the large high-diversity database, we aim to provide a unified training and evaluation platform for the development of class-agnostic, generic purposed short-term trackers. The features of GOT-10k and the contributions of this article are summarized in the following. (1) GOT-10k offers over 10,000 video segments with more than 1.5 million manually labeled bounding boxes, enabling unified training and stable evaluation of deep trackers. (2) GOT-10k is by far the first video trajectory dataset that uses the semantic hierarchy of WordNet to guide class population, which ensures a comprehensive and relatively unbiased coverage of diverse moving objects. (3) For the first time, GOT-10k introduces the one-shot protocol for tracker evaluation, where the training and test classes are zero-overlapped . The protocol avoids biased evaluation results towards familiar objects and it promotes generalization in tracker development. (4) GOT-10k offers additional labels such as motion classes and object visible ratios, facilitating the development of motion-aware and occlusion-aware trackers. (5) We conduct extensive tracking experiments with 39 typical tracking algorithms and their variants on GOT-10k and analyze their results in this paper. (6) Finally, we develop a comprehensive platform for the tracking community that offers full-featured evaluation toolkits, an online evaluation server, and a responsive leaderboard. The annotations of GOT-10k’s test data are kept private to avoid tuning parameters on it.
Article
With efficient appearance learning models, Discriminative Correlation Filter (DCF) has been proven to be very successful in recent video object tracking benchmarks and competitions. However, the existing DCF paradigm suffers from two major issues, i.e., spatial boundary effect and temporal filter degradation. To mitigate these challenges, we propose a new DCF-based tracking method. The key innovations of the proposed method include adaptive spatial feature selection and temporal consistent constraints, with which the new tracker enables joint spatial-temporal filter learning in a lower dimensional discriminative manifold. More specifically, we apply structured spatial sparsity constraints to multi-channel filers. Consequently, the process of learning spatial filters can be approximated by the lasso regularisation. To encourage temporal consistency, the filter model is restricted to lie around its historical value and updated locally to preserve the global structure in the manifold. Last, a unified optimisation framework is proposed to jointly select temporal consistency preserving spatial features and learn discriminative filters with the augmented Lagrangian method. Qualitative and quantitative evaluations have been conducted on a number of well-known benchmarking datasets such as OTB2013, OTB50, OTB100, Temple-Colour, UAV123 and VOT2018. The experimental results demonstrate the superiority of the proposed method over the state-of-the-art approaches.
Article
For visual tracking, an ideal filter learned by the correlation filter (CF) method should take both discrimination and reliability information. However, existing attempts usually focus on the former one while pay less attention to reliability learning. This may make the learned filter be dominated by the unexpected salient regions on the feature map, thereby resulting in model degradation. To address this issue, we propose a novel CF-based optimization problem to jointly model the discrimination and reliability information. First, we treat the filter as the element-wise product of a base filter and a reliability term. The base filter is aimed to learn the discrimination information between the target and backgrounds, and the reliability term encourages the final filter to focus on more reliable regions. Second, we introduce a local response consistency regular term to emphasize equal contributions of different regions and avoid the tracker being dominated by unreliable regions. The proposed optimization problem can be solved using the alternating direction method and speeded up in the Fourier domain. We conduct extensive experiments on the OTB-2013, OTB-2015 and VOT-2016 datasets to evaluate the proposed tracker. Experimental results show that our tracker performs favorably against other state-of-the-art trackers.
Article
Despite the numerous developments in object tracking, further development of current tracking algorithms is limited by small and mostly saturated datasets. As a matter of fact, data-hungry trackers based on deep-learning currently rely on object detection datasets due to the scarcity of dedicated large-scale tracking datasets. In this work, we present TrackingNet, the first large-scale dataset and benchmark for object tracking in the wild. We provide more than 30K videos with more than 14 million dense bounding box annotations. Our dataset covers a wide selection of object classes in broad and diverse context. By releasing such a large-scale dataset, we expect deep trackers to further improve and generalize. In addition, we introduce a new benchmark composed of 500 novel videos, modeled with a distribution similar to our training dataset. By sequestering the annotation of the test set and providing an online evaluation server, we provide a fair benchmark for future development of object trackers. Deep trackers fine-tuned on a fraction of our dataset improve their performance by up to 1.6% on OTB100 and up to 1.7% on TrackingNet Test. We provide an extensive benchmark on TrackingNet by evaluating more than 20 trackers. Our results suggest that object tracking in the wild is far from being solved.
Article
Offline training for object tracking has recently shown great potentials in balancing tracking accuracy and speed. However, it is still difficult to adapt an offline trained model to a target tracked online. This work presents a Residual Attentional Siamese Network (RASNet) for high performance object tracking. The RASNet model reformulates the correlation filter within a Siamese tracking framework, and introduces different kinds of the attention mechanisms to adapt the model without updating the model online. In particular, by exploiting the offline trained general attention, the target adapted residual attention, and the channel favored feature attention, the RASNet not only mitigates the over-fitting problem in deep network training, but also enhances its discriminative capacity and adaptability due to the separation of representation learning and discriminator learning. The proposed deep architecture is trained from end to end and takes full advantage of the rich spatial temporal information to achieve robust visual tracking. Experimental results on two latest benchmarks, OTB-2015 and VOT2017, show that the RASNet tracker has the state-of-the-art tracking accuracy while runs at more than 80 frames per second.
Article
This paper improves state-of-the-art on-line trackers that use deep learning. Such trackers train a deep network to pick a specified object out from the background in an initial frame (initialization) and then keep training the model as tracking proceeds (updates). Our core contribution is a meta-learning-based method to adjust deep networks for tracking using off-line training. First, we learn initial parameters and per-parameter coefficients for fast online adaptation. Second, we use training signal from future frames for robustness to target appearance variations and environment changes. The resulting networks train significantly faster during the initialization, while improving robustness and accuracy. We demonstrate this approach on top of the current highest accuracy tracking approach, tracking-by-detection based MDNet and close competitor, the correlation-based CREST. Experimental results on both standard benchmarks, OTB and VOT2016, show improvements in speed, accuracy, and robustness on both trackers.
Conference Paper
How to effectively learn temporal variation of target appearance, to exclude the interference of cluttered background , while maintaining real-time response, is an essential problem of visual object tracking. Recently, Siamese networks have shown great potentials of matching based trackers in achieving balanced accuracy and beyond real-time speed. However, they still have a big gap to classification & updating based trackers in tolerating the temporal changes of objects and imaging conditions. In this paper, we propose dynamic Siamese network, via a fast transformation learning model that enables effective online learning of target appearance variation and background suppression from previous frames. We then present elementwise multi-layer fusion to adaptively integrate the network outputs using multi-level deep features. Unlike state-of-the-art trackers, our approach allows the usage of any feasible generally-or particularly-trained features, such as SiamFC and VGG. More importantly, the proposed dynamic Siamese network can be jointly trained as a whole directly on the labeled video sequences, thus can take full advantage of the rich spatial temporal information of moving objects. As a result, our approach achieves state-of-the-art performance on OTB-2013 and VOT-2015 benchmarks, while exhibits superiorly balanced accuracy and real-time response over state-of-the-art competitors.
Conference Paper
Machine learning techniques are often used in computer vision due to their ability to leverage large amounts of training data to improve performance. Unfortunately, most generic object trackers are still trained from scratch online and do not benefit from the large number of videos that are readily available for offline training. We propose a method for offline training of neural networks that can track novel objects at test-time at 100 fps. Our tracker is significantly faster than previous methods that use neural networks for tracking, which are typically very slow to run and not practical for real-time applications. Our tracker uses a simple feed-forward network with no online training required. The tracker learns a generic relationship between object motion and appearance and can be used to track novel objects that do not appear in the training set. We test our network on a standard tracking benchmark to demonstrate our tracker’s state-of-the-art performance. Further, our performance improves as we add more videos to our offline training set. To the best of our knowledge, our tracker (Our tracker is available at http:// davheld. github. io/ GOTURN/ GOTURN. html) is the first neural-network tracker that learns to track generic objects at 100 fps.
Article
In this paper we present a tracker, which is radically different from state-of-the-art trackers: we apply no model updating, no occlusion detection, no combination of trackers, no geometric matching, and still deliver state-of-the-art tracking performance, as demonstrated on the popular online tracking benchmark (OTB) and six very challenging YouTube videos. The presented tracker simply matches the initial patch of the target in the first frame with candidates in a new frame and returns the most similar patch by a learned matching function. The strength of the matching function comes from being extensively trained generically, i.e., without any data of the target, using a Siamese deep neural network, which we design for tracking. Once learned, the matching function is used as is, without any adapting, to track previously unseen targets. It turns out that the learned matching function is so powerful that a simple tracker built upon it, coined Siamese INstance search Tracker, SINT, which only uses the original observation of the target from the first frame, suffices to reach state-of-the-art performance. Further, we show the proposed tracker even allows for target re-identification after the target was absent for a complete video shot.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
The core component of most modern trackers is a discriminative classifier, tasked with distinguishing between the target and the surrounding environment. To cope with natural image changes, this classifier is typically trained with translated and scaled sample patches. Such sets of samples are riddled with redundancies -- any overlapping pixels are constrained to be the same. Based on this simple observation, we propose an analytic model for datasets of thousands of translated patches. By showing that the resulting data matrix is circulant, we can diagonalize it with the Discrete Fourier Transform, reducing both storage and computation by several orders of magnitude. Interestingly, for linear regression our formulation is equivalent to a correlation filter, used by some of the fastest competitive trackers. For kernel regression, however, we derive a new Kernelized Correlation Filter (KCF), that unlike other kernel algorithms has the exact same complexity as its linear counterpart. Building on it, we also propose a fast multi-channel extension of linear correlation filters, via a linear kernel, which we call Dual Correlation Filter (DCF). Both KCF and DCF outperform top-ranking trackers such as Struck or TLD on a 50 videos benchmark, despite running at hundreds of frames-per-second, and being implemented in a few lines of code (Algorithm 1). To encourage further developments, our tracking framework was made open-source.
Acquisition of localization confidence for accurate object detection
  • B Jiang
  • R Luo
  • J Mao
  • T Xiao
  • Y Jiang
B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang. Acquisition of localization confidence for accurate object detection. In ECCV, 2018. 5, 6
Joint representation and truncated inference learning for correlation filter based tracking
  • Y Yao
  • X Wu
  • S Shan
  • W Zuo
Y. Yao, X. Wu, S. Shan, and W. Zuo. Joint representation and truncated inference learning for correlation filter based tracking. In ECCV, 2018. 2
Distractor-aware siamese networks for visual object tracking
  • Z Zhu
  • Q Wang
  • L Bo
  • W Wu
  • J Yan
  • W Hu
Z. Zhu, Q. Wang, L. Bo, W. Wu, J. Yan, and W. Hu. Distractor-aware siamese networks for visual object tracking. In ECCV, 2018. 2, 7, 8 [0.643] ETCOM50 [0.610] ETCOM18 [0.576] ATOM [0.569] SiamRPN++ [0.460] MDNet [0.453] VITAL [0.420] SiamFC [0.418] StructSiam [0.405] DSiam [0.338] ECO