ArticlePDF Available

Visual object tracking via coefficients constrained exclusive group LASSO


Abstract and Figures

Discriminative methods have been widely applied to construct the appearance model for visual tracking. Most existing methods incorporate online updating strategy to adapt to the appearance variations of targets. The focus of online updating for discriminative methods is to select the positive samples emerged in past frames to represent the appearances. However, the appearances of positive samples might be very dissimilar to each other; traditional online updating strategies easily overfit on some appearances and neglect the others. To address this problem, we propose an effective method to learn a discriminative template, which maintains the multiple appearances information of targets in the long-term variations. Our method is based on the obvious observation that the target appearances vary very little in a certain number of successive video frames. Therefore, we can use a few instances to represent the appearances in the scope of the successive video frames. We propose exclusive group sparse to describe the observation and provide a novel algorithm, called coefficients constrained exclusive group LASSO, to solve it in a single objective function. The experimental results on CVPR2013 benchmark datasets demonstrate that our approach achieves promising performance.
The framework of our approach. During the tracking process, we maintain a set of key samples, called dictionary D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {D}$$\end{document} and its corresponding coefficients vector α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{{\alpha }}$$\end{document}. The atoms in D\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathbf {D}$$\end{document} are divided into different groups according to the scores. For the appearances group, their corresponding coefficients are positive and the coefficients of the negative group are negative. Furthermore, the entries of intra-groups are encouraged to be sparse. In the updating stage, we design the coefficients constrained exclusive group LASSO to solve α\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{{\alpha }}$$\end{document}. Utilizing the group LASSO, we can obtain a compact discriminative template set, which are adopted to find the optimal tracked results
Content may be subject to copyright.
A preview of the PDF is not available
... Thermal infrared (TIR) object tracking is often used as a subroutine that plays an important role in these vision tasks. It has several superiorities over visual object tracking [3,4,5,6,7,8,9,10,11,12,13,14]. For example, TIR tracking is not sensitive to variation of the illumination, whereas visual tracking usually fails in poor visibility. ...
Most thermal infrared (TIR) tracking methods are discriminative, treating the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is coupled well to the objective of the tracking task. We propose a TIR tracker via a Hierarchical Spatial-aware Siamese Convolutional Neural Network (CNN), named HSSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN that coalesces the multiple hierarchical convolutional layers. Then, we propose a spatial-aware network to enhance the discriminative ability of the coalesced hierarchical feature. Subsequently, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the candidate that is most similar to the tracked target. Extensive experimental results on the benchmarks VOT-TIR 2015 and VOT-TIR 2016 show that our proposed method achieves favourable performance compared to the state-of-the-art methods.
... In contrast, discriminative tracking algorithms have received unprecedented research interest in the last decade. Most discriminative algorithms follow the tracking-by-detection paradigm [6,7,8,9,10,11], which treats the tracking task as a detection problem. They employ a classifier or a regressor to process both target and background representations, and produce an optimal decision boundary that can efficiently discriminate the target from the background. ...
In this paper, a novel circular and structural operator tracker (CSOT) is proposed for high performance visual tracking, it not only possesses the powerful discriminative capability of SOSVM but also efficiently inherits the superior computational efficiency of DCF. Based on the proposed circular and structural operators, a set of primal confidence score maps can be obtained by circular correlating feature maps with their corresponding structural correlation filters. Furthermore, an implicit interpolation is applied to convert the multi-resolution feature maps to the continuous domain and make all primal confidence score maps have the same spatial resolution. Then, we exploit an efficient ensemble post-processor based on relative entropy, which can coalesce primal confidence score maps and create an optimal confidence score map for more accurate localization. The target is localized on the peak of the optimal confidence score map. Besides, we introduce a collaborative optimization strategy to update circular and structural operators by iteratively training structural correlation filters, which significantly reduces computational complexity and improves robustness. Experimental results demonstrate that our approach achieves state-of-the-art performance in mean AUC scores of 71.5% and 69.4% on the OTB-2013 and OTB-2015 benchmarks respectively, and obtains a third-best expected average overlap (EAO) score of 29.8% on the VOT-2017 benchmark.
... In order to improve the tracking performance, it is required to develop optimal appearance model of object targets [35,36]. Most of the existing object appearance models can be can be classified into two camps: appearance model based on conventional hand-crafted feature [24,25,27], and appearance model based on CNN feature [13,23,34]. ...
Full-text available
The performance of the tracking task directly depends on target object appearance features. Therefore, a robust method for constructing appearance features is crucial for avoiding tracking failure. The tracking methods based on Convolution Neural Network (CNN) have exhibited excellent performance in the past years. However, the features from each original convolutional layer can usually represent spatial information, but not temporal information. They only use additionally the temporal information at the testing stage. To solve the lacks of prediction in the pretrained networks, we train both the spatial features and the temporal information for training at the pretraining stage. Firstly, the spatial features are trained by a domain-wise learning with the augmented data to prepare the training data to learn the temporal information. Secondly, the posterior probability maps are calculated by the particle filter and the above pretrained model. The posterior probability maps are used as the prior and the posterior respectively corresponding to the input and the output of the final network at the next stage. Thirdly, the temporal information is trained by using the augmented image sequences and the probability maps. The experimental results demonstrate that the proposed tracking method outperforms the state-of-the-art tracking methods.
... To improve the tracking performance, one may need to address all of these challenges by developing optimal appearance model of object targets [3,4,7]. Most of the existing object appearance models can be divided into two major categories: appearance model based on conventional hand-crafted feature [19][20][21], and appearance model based on CNN feature [22][23][24]. ...
... In the past decade, numerous TIR pedestrian tracking methods have been proposed to solve various challenges. Similar to visual object tracking [17]- [25] and grayscalethermal tracking [26], there are two categories of TIR pedestrian trackers: generative and discriminative. Generative TIR pedestrian trackers focus on the modeling of the pedestrian's appearance at current frame and search for the most similar candidates in next frame. ...
Full-text available
Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
... Thermal InfraRed (TIR) object tracking is an important branch of visual object tracking, which receives more and more attention recently. Compared with visual tracking [1,2,3,4,5,6,7,8,9,10], TIR tracking has several superiorities such as the illumination insensitivity and privacy protection. Since the TIR tracking method can track the object in total darkness, it has a wide range of applications such as video surveillance, maritime rescue, and driver assistance at night [11]. ...
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to describe the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images.To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities on two convolutional layers using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we construct the first large scale TIR video sequence dataset for training the proposed model. The proposed TIR dataset not only benefits the training for TIR tracking but also can be applied to numerous TIR vision tasks. Extensive experimental results on the VOT-TIR2015 and VOT-TIR2017 benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
Full-text available
The aging of the population in today’s society is one of the unavoidable practical problems, and the deepening of the aging degree in China also promotes varying degrees of mental health problems in the aging society especially in the nursing homes. This study mainly discusses the emotion detection of elderly people in nursing homes based on AI robot vision. This study adopts the emotional research method of measuring the physical and psychological indicators of the elderly in nursing homes, through comprehensive comparison and analysis of objective physiological indicators such as heart rate and blood pressure, as well as the final measured data of positive and negative emotion scales, elderly negative emotion scales in nursing homes, and score scales, to quantitatively evaluate and test the effectiveness of the selected optimization design method, and then to reveal the influence mechanism of the change of interior space selection design mode in nursing homes on the emotional health of the elderly in nursing homes. Combined with the information of the older adults in the personal files, daily activities and physiological indicators, this paper designs and implements a system that can monitor the negative emotions of the elderly, pay attention to the mental health of the elderly in the aging society and help the elderly in better serve the elderly. In this paper, OpenCV is used to realize the function of the video surveillance. The system consist of three parts: cloud server, elderly client and nurse client. The cloud sever is responsible for processing data and storing models. The client for the older adults provides mental health-related services for the elderly in nursing institutions, and the client for the nursing staff provides assistance for the nursing staff to carry out nursing work. The AI emotion detection scene designed in this study is better than manual intervention in triggering positive emotions and improving emotions in the elderly (p < 0.01). This research helps to promote the development of using AI technology in aging institutions widely and help aging populations get good emotional experiences especially who has mental problems.
Full-text available
Considering the problems of motion blur, partial occlusion and fast motion in target tracking, a target tracking method based on adaptive structured sparse representation with attention is proposed. Under the framework of particle filtering, the performance of high-quality templates is enhanced through an attention mechanism. Structure sparseness is used to build candidate target sets and sparse models between candidate samples and local patches of target templates. Combined with the sparse residual method, reconstruction error is reduced. After optimally solving the model, the particle with the highest similarity is selected as the prediction target. The most appropriate scale is selected according to the multiscale factor method. Experiments show that the proposed algorithm has a strong performance when dealing with motion blur, fast motion, partial occlusion.
The performance of tracking task is directly dependent on the appearance features of target object, a robust approach for constructing appearance features is crucial for adaptation the appearance change. To construct an accurate and robust appearance model for visual object tracking, we modify original deep residual learning network architecture and name it Multi-Scale Residual Network (MSResNet). The first video frame image and its related information of the current input video sequence are used to learn a multi-scale appearance model of target object and a loss function is minimized over the appearance features. Meanwhile, spatial information of each video frame and temporal information between successive video frames effectively combine with MSResNet. And thus the features are generated by Multi-Scale Spatio-Temporal Residual Network, which is named MSSTResNet feature model, can adapt to scale variation, illumination variation, background clutters, severe deformation of the target object, and so on. We implement a robust tracking method based on tracking-learning-detection framework by using our proposed MSSTResNet feature model and name it MSSTResNet-TLD tracker. Unlike the previous tracking methods, the MSResNet architecture is not offline pre-trained on a large auxiliary datasets but is directly learned end-to-end with a multi-task loss by using the current input video sequence. Furthermore, the multi-task loss function utilizes the classification loss and regression loss that is more accurate for target localization. Our experimental results demonstrate that the proposed tracking method outperforms the current state-of-the-art tracking methods on Visual Object Tracking Benchmark (VOT-2016), Object Tracking Benchmark (OTB-2015), and Unmanned Aerial Vehicles (UAV20L) test datasets. Furthermore, our MSSTResNet-TLD tracker is faster than previous most trackers based on deep Convolutional Neural Network (ConvNet or CNN) and our tracker is extremely robust to the tiny target object. Our source code is available for download at
Full-text available
In the most tracking approaches, a score function is utilized to determine which candidate is the optimal one by measuring the similarity between the candidate and the template. However, the representative samples selection in the template update is challenging. To address this problem, in this paper, we treat the template as a linear combination of representative samples and propose a novel approach to select representative samples based on the coefficient constrained model. We formulate the objective function into a non-negative least square problem and obtain the solution utilizing standard non-negative least square. The experimental results show that the observation module of our approach outperforms several other observation modules under the same feature and motion module, such as support vector machine, logistic regression, ridge regression and structured outputs support vector machine.
Full-text available
In the past years, discriminative methods are popular in visual tracking. The main idea of the discriminative method is to learn a classifier to distinguish the target from the background. The key step is the update of the classifier. Usually, the tracked results are chosen as the positive samples to update the classifier, which results in the failure of the updating of the classifier when the tracked results are not accurate. After that the tracker will drift away from the target. Additionally, a large number of training samples would hinder the online updating of the classifier without an appropriate sample selection strategy. To address the drift problem, we propose a score function to predict the optimal candidate directly instead of learning a classifier. Furthermore, to solve the problem of a large number of training samples, we design a sparsity-constrained sample selection strategy to choose some representative support samples from the large number of training samples on the updating stage. To evaluate the effectiveness and robustness of the proposed method, we implement experiments on the object tracking benchmark and 12 challenging sequences. The experiment results demonstrate that our approach achieves promising performance.
Full-text available
Robustness and efficiency are the two main goals of existing trackers. Most robust trackers are implemented with combined features or models accompanied with a high computational cost. To achieve a robust and efficient tracking performance, we propose a multi-view correlation tracker to do tracking. On one hand, the robustness of the tracker is enhanced by the multi-view model, which fuses several features and selects the more discriminative features to do tracking. On the other hand, the correlation filter framework provides a fast training and efficient target locating. The multiple features are well fused on the model level of correlation filer, which are effective and efficient. In addition, we raise a simple but effective scale-variation detection mechanism, which strengthens the stability of scale variation tracking. We evaluate our tracker on online tracking benchmark (OTB) and two visual object tracking benchmarks (VOT2014, VOT2015). These three datasets contains more than 100 video sequences in total. On all the three datasets, the proposed approach achieves promising performance.
Conference Paper
We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.
Conference Paper
Training example collection is of great importance for discriminative trackers. Most existing algorithms use a sampling-and-labeling strategy, and treat the training example collection as a task that is independent of classifier learning. However, the examples collected directly by sampling are not intended to be useful for classifier learning. Updating the classifier with these examples might introduce ambiguity to the tracker. In this paper, we introduce an active example selection stage between sampling and labeling, and propose a novel online object tracking algorithm which explicitly couples the objectives of semi-supervised learning and example selection. Our method uses Laplacian Regularized Least Squares (LapRLS) to learn a robust classifier that can sufficiently exploit unlabeled data and preserve the local geometrical structure of feature space. To ensure the high classification confidence of the classifier, we propose an active example selection approach to automatically select the most informative examples for LapRLS. Part of the selected examples that satisfy strict constraints are labeled to enhance the adaptivity of our tracker, which actually provides robust supervisory information to guide semi-supervised learning. With active example selection, we are able to avoid the ambiguity introduced by an independent example collection strategy, and to alleviate the drift problem caused by misaligned examples. Comparison with the state-of-the-art trackers on the comprehensive benchmark demonstrates that our tracking algorithm is more effective and accurate.
Recently, sparse representation has been applied to visual tracking to find the target with the minimum reconstruction error from the target template subspace. Though effective, these L1 trackers require high computational costs due to numerous calculations for L1 minimization. In addition, the inherent occlusion insensitivity of the L1 minimization has not been fully utilized. In this paper, we propose an efficient L1 tracker with minimum error bound and occlusion detection which we call Bounded Particle Resampling (BPR)-L1 tracker. First, the minimum error bound is quickly calculated from a linear least squares equation, and serves as a guide for particle resampling in a particle filter framework. Without loss of precision during resampling, most insignificant samples are removed before solving the computationally expensive `1 minimization function. The BPR technique enables us to speed up the L1 tracker without sacrificing accuracy. Second, we perform occlusion detection by investigating the trivial coefficients in the L1 minimization. These coefficients, by design, contain rich information about image corruptions including occlusion. Detected occlusions enhance the template updates to effectively reduce the drifting problem. The proposed method shows good performance as compared with several state-of-the-art trackers on challenging benchmark sequences.
Nonlocal image representation methods, including group-based sparse coding and BM3D, have shown their great performance in application to low-level tasks. The nonlocal prior is extracted from each group consisting of patches with similar intensities. Grouping patches based on intensity similarity, however, gives rise to disturbance and inaccuracy in estimation of the true images. To address this problem, we propose a structure-based low-rank model with graph nuclear norm regularization. We exploit the local manifold structure inside a patch and group the patches by the distance metric of manifold structure. With the manifold structure information, a graph nuclear norm regularization is established and incorporated into a low-rank approximation model. We then prove that the graph-based regularization is equivalent to a weighted nuclear norm and the proposed model can be solved by a weighted singular-value thresholding algorithm. Extensive experiments on additive white Gaussian noise removal and mixed noise removal demonstrate that the proposed method achieves better performance than several state-of-the-art algorithms.
Wound area changes over multiple weeks are highly predictive of the wound healing process. A big data eHealth system would be very helpful in evaluating these changes. We usually analyze images of the wound bed for diagnosing injury. Unfortunately, accurate measurements of wound region changes from images are difficult. Many factors affect the quality of images, such as intensity inhomogeneity and color distortion. To this end, we propose a fast level set model-based method for intensity inhomogeneity correction and a spectral properties-based color correction method to overcome these obstacles. State-of-the-art level set methods can segment objects well. However, such methods are time-consuming and inefficient. In contrast to conventional approaches, the proposed model integrates a new signed energy force function that can detect contours at weak or blurred edges efficiently. It ensures the smoothness of the level set function and reduces the computational complexity of re-initialization. To increase the speed of the algorithm further, we also include an additive operator-splitting algorithm in our fast level set model. In addition, we consider using a camera, lighting, and spectral properties to recover the actual color. Numerical synthetic and real-world images demonstrate the advantages of the proposed method over state-of-the-art methods. Experimental results also show that the proposed model is at least twice as fast as methods used widely. Copyright