ArticlePDF Available

Visual object tracking via coefficients constrained exclusive group LASSO

Authors:
Article

Visual object tracking via coefficients constrained exclusive group LASSO

Abstract and Figures

Discriminative methods have been widely applied to construct the appearance model for visual tracking. Most existing methods incorporate online updating strategy to adapt to the appearance variations of targets. The focus of online updating for discriminative methods is to select the positive samples emerged in past frames to represent the appearances. However, the appearances of positive samples might be very dissimilar to each other; traditional online updating strategies easily overfit on some appearances and neglect the others. To address this problem, we propose an effective method to learn a discriminative template, which maintains the multiple appearances information of targets in the long-term variations. Our method is based on the obvious observation that the target appearances vary very little in a certain number of successive video frames. Therefore, we can use a few instances to represent the appearances in the scope of the successive video frames. We propose exclusive group sparse to describe the observation and provide a novel algorithm, called coefficients constrained exclusive group LASSO, to solve it in a single objective function. The experimental results on CVPR2013 benchmark datasets demonstrate that our approach achieves promising performance.
The framework of our approach. During the tracking process, we maintain a set of key samples, called dictionary D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {D}$$\end{document} and its corresponding coefficients vector α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{{\alpha }}$$\end{document}. The atoms in D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {D}$$\end{document} are divided into different groups according to the scores. For the appearances group, their corresponding coefficients are positive and the coefficients of the negative group are negative. Furthermore, the entries of intra-groups are encouraged to be sparse. In the updating stage, we design the coefficients constrained exclusive group LASSO to solve α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{{\alpha }}$$\end{document}. Utilizing the group LASSO, we can obtain a compact discriminative template set, which are adopted to find the optimal tracked results
… 
Content may be subject to copyright.
A preview of the PDF is not available
... In order to improve the tracking performance, it is required to develop optimal appearance model of object targets [35,36]. Most of the existing object appearance models can be can be classified into two camps: appearance model based on conventional hand-crafted feature [24,25,27], and appearance model based on CNN feature [13,23,34]. ...
Article
Full-text available
The performance of the tracking task directly depends on target object appearance features. Therefore, a robust method for constructing appearance features is crucial for avoiding tracking failure. The tracking methods based on Convolution Neural Network (CNN) have exhibited excellent performance in the past years. However, the features from each original convolutional layer can usually represent spatial information, but not temporal information. They only use additionally the temporal information at the testing stage. To solve the lacks of prediction in the pretrained networks, we train both the spatial features and the temporal information for training at the pretraining stage. Firstly, the spatial features are trained by a domain-wise learning with the augmented data to prepare the training data to learn the temporal information. Secondly, the posterior probability maps are calculated by the particle filter and the above pretrained model. The posterior probability maps are used as the prior and the posterior respectively corresponding to the input and the output of the final network at the next stage. Thirdly, the temporal information is trained by using the augmented image sequences and the probability maps. The experimental results demonstrate that the proposed tracking method outperforms the state-of-the-art tracking methods.
... To improve the tracking performance, one may need to address all of these challenges by developing optimal appearance model of object targets [3,4,7]. Most of the existing object appearance models can be divided into two major categories: appearance model based on conventional hand-crafted feature [19][20][21], and appearance model based on CNN feature [22][23][24]. ...
... In the past decade, numerous TIR pedestrian tracking methods have been proposed to solve various challenges. Similar to visual object tracking [17]- [25] and grayscalethermal tracking [26], there are two categories of TIR pedestrian trackers: generative and discriminative. Generative TIR pedestrian trackers focus on the modeling of the pedestrian's appearance at current frame and search for the most similar candidates in next frame. ...
Article
Full-text available
Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
... Thermal InfraRed (TIR) object tracking is an important branch of visual object tracking, which receives more and more attention recently. Compared with visual tracking [1,2,3,4,5,6,7,8,9,10], TIR tracking has several superiorities such as the illumination insensitivity and privacy protection. Since the TIR tracking method can track the object in total darkness, it has a wide range of applications such as video surveillance, maritime rescue, and driver assistance at night [11]. ...
Preprint
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to describe the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images.To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities on two convolutional layers using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we construct the first large scale TIR video sequence dataset for training the proposed model. The proposed TIR dataset not only benefits the training for TIR tracking but also can be applied to numerous TIR vision tasks. Extensive experimental results on the VOT-TIR2015 and VOT-TIR2017 benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
... In our work, all matrixbased features such as images need to be pre-transformed into the vector-based features by stacking columns of the corresponding matrix. For matrix A ∈ R m×n , ‖A‖ F , ‖A‖ 1 , and ‖A‖ 2,1 denote the 'Frobenius' norm, l 1 norm, and l 2,1 norm of [49][50][51][52]. Tr(⋅) denotes the trace operation. ...
Article
Full-text available
Conventional linear discriminant analysis methods commonly ignore the information loss and locality preserving, which greatly limits their performance. To address these issues, we propose a novel discriminant analysis method for feature extraction in this paper. Specially, the proposed method simultaneously exploits the local information and label information to guide the projection learning by constraining the margins of samples from the same class with an adaptively learned weighted matrix, which enables the method to obtain a more compact and discriminative projection. To catch as much discriminant information as possible, a variant of principle component analysis (PCA) term is further introduced to constrain the projection. Besides, to reduce the negative influence of noise and redundant features, a spares error term and a sparse projection constraint are simultaneously introduced to the framework, which enables the method to adaptively select those important features during feature extraction. Compared with the other methods, the proposed method simultaneously holds many good properties including discriminability, locality, data reconstruction, and feature selection in a framework, and is robust to noise. These good properties encourage the method to perform better than the other methods. Extensive experimental results conducted on face, object, scene, and noisy databases verify the effectiveness of the proposed in feature extraction.
... Thermal infrared (TIR) object tracking is often used as a subroutine that plays an important role in these vision tasks. It has several superiorities over visual object tracking [3,4,5,6,7,8,9,10,11,12,13,14]. For example, TIR tracking is not sensitive to variation of the illumination, whereas visual tracking usually fails in poor visibility. ...
Article
Full-text available
Considering the problems of motion blur, partial occlusion and fast motion in target tracking, a target tracking method based on adaptive structured sparse representation with attention is proposed. Under the framework of particle filtering, the performance of high-quality templates is enhanced through an attention mechanism. Structure sparseness is used to build candidate target sets and sparse models between candidate samples and local patches of target templates. Combined with the sparse residual method, reconstruction error is reduced. After optimally solving the model, the particle with the highest similarity is selected as the prediction target. The most appropriate scale is selected according to the multiscale factor method. Experiments show that the proposed algorithm has a strong performance when dealing with motion blur, fast motion, partial occlusion.
Article
The performance of tracking task is directly dependent on the appearance features of target object, a robust approach for constructing appearance features is crucial for adaptation the appearance change. To construct an accurate and robust appearance model for visual object tracking, we modify original deep residual learning network architecture and name it Multi-Scale Residual Network (MSResNet). The first video frame image and its related information of the current input video sequence are used to learn a multi-scale appearance model of target object and a loss function is minimized over the appearance features. Meanwhile, spatial information of each video frame and temporal information between successive video frames effectively combine with MSResNet. And thus the features are generated by Multi-Scale Spatio-Temporal Residual Network, which is named MSSTResNet feature model, can adapt to scale variation, illumination variation, background clutters, severe deformation of the target object, and so on. We implement a robust tracking method based on tracking-learning-detection framework by using our proposed MSSTResNet feature model and name it MSSTResNet-TLD tracker. Unlike the previous tracking methods, the MSResNet architecture is not offline pre-trained on a large auxiliary datasets but is directly learned end-to-end with a multi-task loss by using the current input video sequence. Furthermore, the multi-task loss function utilizes the classification loss and regression loss that is more accurate for target localization. Our experimental results demonstrate that the proposed tracking method outperforms the current state-of-the-art tracking methods on Visual Object Tracking Benchmark (VOT-2016), Object Tracking Benchmark (OTB-2015), and Unmanned Aerial Vehicles (UAV20L) test datasets. Furthermore, our MSSTResNet-TLD tracker is faster than previous most trackers based on deep Convolutional Neural Network (ConvNet or CNN) and our tracker is extremely robust to the tiny target object. Our source code is available for download at https://github.com/binger1225/MSSTResNet-TLD-Tracker.
Article
Correlation filters (CF) have demonstrated a good performance in visual tracking. However, the base training sample region is larger than the object region, including the interference region (IR). IRs in training samples from cyclic shifts of the base training sample severely degrade the quality of the tracking model. In this paper, a region-filtering correlation tracking (RFCT) algorithm is proposed to address this problem. In this algorithm, we filter training samples by introducing a spatial map into the standard CF formulation. Compared with the existing correlation filter trackers, the proposed tracker has the following advantages. (1) Using a spatial map, the correlation filter can be learned on a larger search region without the interference of IR. (2) Due to processing training samples by a spatial map, it is a more general way to control background information and target information in training samples. In addition, a better spatial map can be explored, the values of which are not restricted. Quantitative evaluations are performed on four benchmark datasets: OTB-2013, OTB-2015, VOT2015, and VOT2016. Experimental results demonstrate that the proposed RFCT algorithm performs favorably against several state-of-the-art methods.
Article
Full-text available
In the most tracking approaches, a score function is utilized to determine which candidate is the optimal one by measuring the similarity between the candidate and the template. However, the representative samples selection in the template update is challenging. To address this problem, in this paper, we treat the template as a linear combination of representative samples and propose a novel approach to select representative samples based on the coefficient constrained model. We formulate the objective function into a non-negative least square problem and obtain the solution utilizing standard non-negative least square. The experimental results show that the observation module of our approach outperforms several other observation modules under the same feature and motion module, such as support vector machine, logistic regression, ridge regression and structured outputs support vector machine.
Article
Full-text available
In the past years, discriminative methods are popular in visual tracking. The main idea of the discriminative method is to learn a classifier to distinguish the target from the background. The key step is the update of the classifier. Usually, the tracked results are chosen as the positive samples to update the classifier, which results in the failure of the updating of the classifier when the tracked results are not accurate. After that the tracker will drift away from the target. Additionally, a large number of training samples would hinder the online updating of the classifier without an appropriate sample selection strategy. To address the drift problem, we propose a score function to predict the optimal candidate directly instead of learning a classifier. Furthermore, to solve the problem of a large number of training samples, we design a sparsity-constrained sample selection strategy to choose some representative support samples from the large number of training samples on the updating stage. To evaluate the effectiveness and robustness of the proposed method, we implement experiments on the object tracking benchmark and 12 challenging sequences. The experiment results demonstrate that our approach achieves promising performance.
Article
Full-text available
Robustness and efficiency are the two main goals of existing trackers. Most robust trackers are implemented with combined features or models accompanied with a high computational cost. To achieve a robust and efficient tracking performance, we propose a multi-view correlation tracker to do tracking. On one hand, the robustness of the tracker is enhanced by the multi-view model, which fuses several features and selects the more discriminative features to do tracking. On the other hand, the correlation filter framework provides a fast training and efficient target locating. The multiple features are well fused on the model level of correlation filer, which are effective and efficient. In addition, we raise a simple but effective scale-variation detection mechanism, which strengthens the stability of scale variation tracking. We evaluate our tracker on online tracking benchmark (OTB) and two visual object tracking benchmarks (VOT2014, VOT2015). These three datasets contains more than 100 video sequences in total. On all the three datasets, the proposed approach achieves promising performance.
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Training example collection is of great importance for discriminative trackers. Most existing algorithms use a sampling-and-labeling strategy, and treat the training example collection as a task that is independent of classifier learning. However, the examples collected directly by sampling are not intended to be useful for classifier learning. Updating the classifier with these examples might introduce ambiguity to the tracker. In this paper, we introduce an active example selection stage between sampling and labeling, and propose a novel online object tracking algorithm which explicitly couples the objectives of semi-supervised learning and example selection. Our method uses Laplacian Regularized Least Squares (LapRLS) to learn a robust classifier that can sufficiently exploit unlabeled data and preserve the local geometrical structure of feature space. To ensure the high classification confidence of the classifier, we propose an active example selection approach to automatically select the most informative examples for LapRLS. Part of the selected examples that satisfy strict constraints are labeled to enhance the adaptivity of our tracker, which actually provides robust supervisory information to guide semi-supervised learning. With active example selection, we are able to avoid the ambiguity introduced by an independent example collection strategy, and to alleviate the drift problem caused by misaligned examples. Comparison with the state-of-the-art trackers on the comprehensive benchmark demonstrates that our tracking algorithm is more effective and accurate.
Recently, sparse representation has been applied to visual tracking to find the target with the minimum reconstruction error from the target template subspace. Though effective, these L1 trackers require high computational costs due to numerous calculations for L1 minimization. In addition, the inherent occlusion insensitivity of the L1 minimization has not been fully utilized. In this paper, we propose an efficient L1 tracker with minimum error bound and occlusion detection which we call Bounded Particle Resampling (BPR)-L1 tracker. First, the minimum error bound is quickly calculated from a linear least squares equation, and serves as a guide for particle resampling in a particle filter framework. Without loss of precision during resampling, most insignificant samples are removed before solving the computationally expensive `1 minimization function. The BPR technique enables us to speed up the L1 tracker without sacrificing accuracy. Second, we perform occlusion detection by investigating the trivial coefficients in the L1 minimization. These coefficients, by design, contain rich information about image corruptions including occlusion. Detected occlusions enhance the template updates to effectively reduce the drifting problem. The proposed method shows good performance as compared with several state-of-the-art trackers on challenging benchmark sequences.
Article
Nonlocal image representation methods, including group-based sparse coding and BM3D, have shown their great performance in application to low-level tasks. The nonlocal prior is extracted from each group consisting of patches with similar intensities. Grouping patches based on intensity similarity, however, gives rise to disturbance and inaccuracy in estimation of the true images. To address this problem, we propose a structure-based low-rank model with graph nuclear norm regularization. We exploit the local manifold structure inside a patch and group the patches by the distance metric of manifold structure. With the manifold structure information, a graph nuclear norm regularization is established and incorporated into a low-rank approximation model. We then prove that the graph-based regularization is equivalent to a weighted nuclear norm and the proposed model can be solved by a weighted singular-value thresholding algorithm. Extensive experiments on additive white Gaussian noise removal and mixed noise removal demonstrate that the proposed method achieves better performance than several state-of-the-art algorithms.
Article
Wound area changes over multiple weeks are highly predictive of the wound healing process. A big data eHealth system would be very helpful in evaluating these changes. We usually analyze images of the wound bed for diagnosing injury. Unfortunately, accurate measurements of wound region changes from images are difficult. Many factors affect the quality of images, such as intensity inhomogeneity and color distortion. To this end, we propose a fast level set model-based method for intensity inhomogeneity correction and a spectral properties-based color correction method to overcome these obstacles. State-of-the-art level set methods can segment objects well. However, such methods are time-consuming and inefficient. In contrast to conventional approaches, the proposed model integrates a new signed energy force function that can detect contours at weak or blurred edges efficiently. It ensures the smoothness of the level set function and reduces the computational complexity of re-initialization. To increase the speed of the algorithm further, we also include an additive operator-splitting algorithm in our fast level set model. In addition, we consider using a camera, lighting, and spectral properties to recover the actual color. Numerical synthetic and real-world images demonstrate the advantages of the proposed method over state-of-the-art methods. Experimental results also show that the proposed model is at least twice as fast as methods used widely. Copyright