Article

Visual object tracking with adaptive structural convolutional network

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Convolutional Neural Networks (CNN) have been demonstrated to achieve state-of-the-art performance in visual object tracking task. However, existing CNN-based trackers usually use holistic target samples to train their networks. Once the target undergoes complicated situations (e.g., occlusion, background clutter, and deformation), the tracking performance degrades badly. In this paper, we propose an adaptive structural convolutional filter model to enhance the robustness of deep regression trackers (named: ASCT). Specifically, we first design a mask set to generate local filters to capture local structures of the target. Meanwhile, we adopt an adaptive weighting fusion strategy for these local filters to adapt to the changes in the target appearance, which can enhance the robustness of the tracker effectively. Besides, we develop an end-to-end trainable network comprising feature extraction, decision making, and model updating modules for effective training. Extensive experimental results on large benchmark datasets demonstrate the effectiveness of the proposed ASCT tracker performs favorably against the state-of-the-art trackers.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Compared with the thermal infrared target tracking task, the general target tracking task has been studied extensively. A large number of excellent tracking methods have emerged in the general target tracking task, such as the discriminative correlation filters based tracking methods [10,11,12,13,14,15,16] and the deep learning based tracking methods [17,18,19,20,21,22]. The discriminative correlation filters (DCFs) based trackers attempt to train filters to learn the correlation between features and Gaussian-shaped response maps, which can improve the computational efficiency [11,23]. ...
... There are a lot of reviews that describe the target tracking task from different aspects in detail [30,31,32,33]. In this section, we mainly discuss some of the most relevant works to our tracker, which including tracking methods based on correlation filters framework [11,12,13,15] and tracking methods based on deep learning framework [19,20,21,22,34]. ...
... We determine the trustworthiness of each search patch based on the PSR (peak sidelobe ratio) [21,53,54], and then determine the weight η P of the corresponding patch in the target searching area. The PSR could be calculated as: ...
Article
Thermal InfraRed (TIR) target trackers are easy to be interfered by similar objects, while susceptible to the influence of the target occlusion. To solve these problems, we propose a structural target-aware model (STAMT) for the thermal infrared target tracking tasks. Specifically, the proposed STAMT tracker can learn a target-aware model, which can add more attention to the target area to accurately identify the target from similar objects. In addition, considering the situation that the target is partially occluded in the tracking process, a structural weight model is proposed to locate the target through the unoccluded reliable target part. Ablation studies show the effectiveness of each component in the proposed tracker. Without bells and whistles, the experimental results demonstrate that our STAMT tracker performs favorably against state-of-the-art trackers on PTB-TIR and LSOTB-TIR datasets.
... and its goal is to estimate the position and size of the target in subsequent frames under the basis of a given state in the initial frame. Despite the great advances in recent years, which can be summarized to three aspects, including: 1) regressors [1], [2]; 2) classifiers [3], [4]; and 3) deep convolutional networks [5]- [8], tracking remains challenging due to several issues, such as illumination variations, occlusions, scale variations, background clutters, etc. ...
... , K, is introduced to simplify the model, where F is an L × L constant matrix that is employed to map any L-dimensional vectorized signal to the Fourier domain. Therefore, we re-express (8) in the frequency domain as follows: ...
... SiamATLwithCC_fix is the same method as SiamATLwithCC except that it uses (3) to update the template. SiamATLwithITDCF_fix is the same method as SiamATLwithITDCF except that it ignores the last term in (8) and update the filter like BACF [37]. SiamATLnonATL represents the method that is not equipped with an ATL framework, and it utilizes both ITDCF and CC as decisionmaking layers. ...
Article
Full-text available
Visual object tracking with semantic deep features has recently attracted much attention in computer vision. Especially, Siamese trackers, which aim to learn a decision making-based similarity evaluation, are widely utilized in the tracking community. However, the online updating of the Siamese fashion is still a tricky issue due to the limitation, which is a tradeoff between model adaption and degradation. To address such an issue, in this article, we propose a novel attentional transfer learning-based Siamese network (SiamATL), which fully exploits the previous knowledge to inspire the current tracker learning in the decision-making module. First, we explicitly model the template and surroundings by using an attentional online update strategy to avoid template pollution. Then, we introduce an instance-transfer discriminative correlation filter (ITDCF) to enhance the distinguishing ability of the tracker. Finally, we suggest a mutual compensation mechanism that integrates cross-correlation matching and ITDCF detection into the decision-making subnetwork to achieve online tracking. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art tracking algorithms on multiple large-scale tracking datasets.
... Where N gr is the total number of images for each frame, (Xg i , Y g i ) is the position of the ground truth, and (Xr i , Y r i ) is the tracking result at frame I i , respectively. According to the (22), the smaller CLE we get, the better optimal algorithm is. Moreover, as defined in (22), the size of the bounding boxes is neglected by this quantitative metric. ...
... According to the (22), the smaller CLE we get, the better optimal algorithm is. Moreover, as defined in (22), the size of the bounding boxes is neglected by this quantitative metric. Furthermore, we also use the overlap ratio (OR) [35] to perform evaluation, it measures the success rate, and it also gives an idea on how our proposal performers over sequences in which the object changes its size. ...
Article
Full-text available
Tracking a mobile object is one of the important topics in pattern recognition, but style has some obstacles. A Reliable tracking system must adjust their tracking windows in real time according to appearance changes of the tracked object. Furthermore, it has to deal with many challenges when one or multiple objects need to be tracked, for instance when the target is partially or fully occluded, background clutter, or even some target region is blurred. In this paper, we will present a novel approach for a single object tracking that combines particle filter algorithm and kernel distribution that update its tracking window according to object scale changes, whose name is multi-scale adaptive particle filter tracker. We will demonstrate that the use of particle filter combined with kernel distribution inside the resampling process will provide more accurate object localization within a research area. Furthermore, its average error for target localization was significantly lower than 21.37 pixels as the mean value. We have conducted several experiments on real video sequences and compared acquired results to other existing state of the art trackers to demonstrate the effectiveness of the multi-scale adaptive particle filter tracker.
... Visual object tracking (VOT) is one of the most fundamental tasks of computer vision and is widely used in a large number of applications, such as autonomous driving, surveillance, pedestrian detection and UAVs [1][2][3][4][5]. Despite great progress in recent years, visual tracking is still prone to failure under challenges such as occlusion, scale variation, fast motion and low resolution. ...
... After the anchor and the ground-truth are regressed, the translation [0], [1]and scaling [2], [3] are obtained. ...
Article
Full-text available
Abstract Existing Siamese trackers usually do not update templates or adopt single‐updating strategies. However, historical information cannot be effectively utilized when using these strategies, and model drift from complex tracking challenges cannot be addressed. To address this issue, a novel tracking framework that learns the model update with local trusted templates is proposed in this paper. The authors propose a complementary confidence evaluation method to select local trusted templates in a sliding window. This provides high‐confidence historical information. The authors also propose a method including linear learning and deep learning to learn to model updates. Different from traditional update strategies, the authors’ method combines non‐linear and linear updates to obtain reliable templates with the most abundant historical information, which solves the complex tracking challenges to a certain extent. Finally, the adaptive fusion response maps of the two strategies determine the final tracking based on the confidence evaluation. Experimental results on NFS, UAVDT, UAV123, UAV20L and VOT2016 show that our method performs favourably when compared with current state‐of‐the‐art methods.
... Additionally, deep learning methods have been applied to estimate age and gender from electrocardiogram signals [10], and food recognition has been automated via deep learning models [11]. Over the past decade, deep learning has made significant advancements in medical information science and image analysis [12], particularly through the use of convolutional neural networks (CNNs). CNNs have made notable advancements in fields like computer vision and speech recognition. ...
Article
Full-text available
Background In recent years, as deep learning has received widespread attention in the field of heart disease, some studies have explored the potential of deep learning based on coronary angiography (CAG) or coronary CT angiography (CCTA) images in detecting the extent of coronary artery stenosis. However, there is still a lack of a systematic understanding of its diagnostic accuracy, impeding the advancement of intelligent diagnosis of coronary artery stenosis. Therefore, we conducted this study to review the accuracy of image-based deep learning in detecting coronary artery stenosis. Methods We retrieved PubMed, Cochrane, Embase, and Web of Science until April 11, 2023. The risk of bias in the included studies was appraised using the QUADAS-2 tool. We extracted the accuracy of deep learning in the test set and performed subgroup analyses by binary and multiclass classification scenarios. We performed a subgroup analysis based on different degrees of stenosis and applied a double arcsine transformation to process the data. The analysis was done by using R. Results Our systematic review finally included 18 studies, involving 3568 patients and 13,362 images. In the included studies, deep learning models were constructed based on CAG and CCTA. In binary classification tasks, the accuracy for detecting > 25%, > 50% and > 70% degrees of stenosis at the vessel level were 0.81 (95% CI: 0.71–0.85), 0.73 (95% CI: 0.58–0.88) and 0.61 (95% CI: 0.56–0.65), respectively. In multiclass classification tasks, the accuracy for detecting 0–25%, 25–50%, 50–70%, and 70–100% degrees of stenosis at the vessel level were 0.78 (95% CI: 0.73–0.84), 0.86 (95% CI: 0.78–0.93), 0.83 (95% CI: 0.70–0.97), and 0.70 (95% CI: 0.42–0.98), respectively. Conclusions Our study shows that deep learning models based on CAG and CCTA appear to be relatively accurate in diagnosing different degrees of coronary artery stenosis. However, for various degrees of stenosis, their accuracy still needs to be further improved.
... AlexNet was successful in its bid to win the ImageNet Large-Scale Visual Recognition Competition (ILSVRC) in the year 2012 [39]. After that, deep network designs such as VGG [42] and GoogLeNet [43] were utilized in a wide variety of different fields, such as low-level computer vision [44,45] and image [19,12], video [35,46], natural language processing [47], and voice processing [48]. In 2015, deep neural networks were utilized for the first time to do the task of photo denoising [49,28]. ...
Article
The utilization of deep learning techniques has garnered significant attention in the domain of image denoising. Each kind of deep learning methods for picture denoising possesses distinct qualities that differentiate them significantly. To be more precise, discriminative learning based on deep comprehension can effectively tackle the issue of Gaussian noise and other types of noise. This is the case because deep learning utilizes a larger and more comprehensive training set. Subsequently, a study conducted by researchers and subsequently published in the journal Science unveiled this potential. Optimization algorithms based on profound comprehension offer several advantages, such as the ability to produce precise assessments of the ambient noise. However, limited research has been conducted in this domain to categorize the many types of deep learning algorithms employed for image denoising. This is an area that needs future improvement. This post seeks to examine different advanced techniques that can be used to effectively remove noise from photos. Initially, we categories the actual noisy photographs based on the blind denoising capabilities of deep convolutional neural networks (CNNs) for both noisy hybrid images and additive white noisy photos. Subsequently, the grainy, hazy, and low-resolution images were merged to produce composite photos with significant noise. Our next step is to examine different methodologies for deep learning, with a specific focus on the underlying ideas and assumptions that drive these methodologies. Subsequently, we provide a comprehensive analysis of the most advanced approaches for reducing noise in data, utilizing publicly accessible datasets. We then proceed to compare these techniques. To summarize, we have examined many obstacles and opportunities for further investigation that may be explored in the near or far future.
... One of the keys to object tracking is to robustly track the object with minimal object label information. Among the many existing methods [1][2][3][4][5], the Discriminative Model Prediction (DiMP) [5] tracker has achieved much success. This series of methods [5][6][7][8] are focused on training a target model to achieve precise target localization in each frame while dynamically updating the target model based on prediction results through the minimization of a discriminative objective function. ...
Article
Full-text available
The discriminative model prediction (DiMP) object tracking model is an excellent end-to-end tracking framework and have achieved the best results of its time. However, there are two problems with DiMP in the process of actual use: (1) DiMP is prone to interference from similar objects during the tracking process, and (2) DiMP requires a large amount of labeled data for training. In this paper, we propose two methods to enhance the robustness of interference to similar objects in target tracking: multi-scale region search and Gaussian convolution-based response map processing. Simultaneously, aiming at tackling the issue of requiring a large amount of labeled data for training, we implement self-supervised training based on forward-backward tracking for the DiMP tracking method. Furthermore, a new consistency loss function is designed to better self-supervised training. Extensive experiments show that the enhancements implemented in the DiMP tracking framework can bolster its robustness, and the tracker based on self-supervised training has outstanding tracking performance.
... When the target deviates greatly from the center of the sampling window due to some factors, such as the deformation of the target, occlusion, and so on, such method may weaken the target information and enhance the background information of the target severely. Inspired by the probability map window and the wide application of filtered window in the field of deep learning [41,42], this paper presents a power-law probability map (PPM), which uses the actual pixel distribution of the target block to replace the traditional Gaussian-like distribution, to filter the visual features of the target. We first extract a target block and generate a probability map (PM) from it. ...
Article
Full-text available
Traditional tracking algorithms based on correlation filtering usually use a filter based on Gaussian-like distribution to highlight the information of the target and weaken the interference of the background on the target. The characteristic of such method is that it highlights the importance of the target center, but it weakens the marginal information of the target. To overcome this problem, this paper proposes a new filter based on power-law probability map and proposes a new tracking algorithm based on the power-law probability map filter and the technology of ridge regression. The filter can screens the overall information of the target effectively, including the marginal information of the target. Firstly, a target block is extracted from the actual image at time t according to the position of the target at time t-1t1t-1, and then a probability map is generated from the target block. Next, the power-law probability map filter is generated by several image processing technologies. On this basis, two parts of HOG features are extracted from the target block, and then these two parts HOG features are filtered and are combined with the ridge regression model to locate and track the target. Finally, a large number of experiments show that our proposed tracking algorithm achieves competitive performance compared with the state-of-the-art tracking algorithms.
... In the past decade, computer aided technology has enabled fully automated segmentation of fundus vascular images. Recent advances in machine learning techniques, specifically deep neural networks, have led to remarkable breakthroughs in information science and image analysis [11]. Especially, AlexNet [12] emerged as a deep convolutional neural network, paving the way for numerous follow-up networks, including VGGNet [13], ResNet [14,15], and ResNeXt [16]. ...
Preprint
Full-text available
Accurately segmenting blood vessels in retinal fundus images is crucial in the early screening, diagnosing, and evaluating some ocular diseases. However, significant light variations and non-uniform contrast in these images make segmentation quite challenging. Thus, this paper employ an attention fusion mechanism that combines the channel attention and spatial attention mechanisms constructed by Transformer to extract information from retinal fundus images in both spatial and channel dimensions. To eliminate noise from the encoder image, a spatial attention mechanism is introduced in the skip connection. Moreover, a Dropout layer is employed to randomly discard some neurons, which can prevent overfitting of the neural network and improve its generalization performance. Experiments were conducted on publicly available datasets DERIVE, STARE, and CHASEDB1. The results demonstrate that our method produces satisfactory results compared to some recent retinal fundus image segmentation algorithms.
... After that, the process of object tracking 54,55 is adopted to continuously estimate the state of the object in subsequent video sequences based on the given position and size of the object in the initial frame. Moreover, AI-enabled computer vision technology [56][57][58][59][60] is evolving rapidly, which can solve more complicated problems and serve as an aid to intelligent communications. ...
Article
Full-text available
The fifth-generation (5G) wireless communication has an urgent need for target tracking. Digital programmable metasurface (DPM) may offer an intelligent and efficient solution owing to its powerful and flexible controls of electromagnetic waves and advantages of lower cost, less complexity and smaller size than the traditional antenna array. Here, we report an intelligent metasurface system to perform target tracking and wireless communications, in which computer vision integrated with a convolutional neural network (CNN) is used to automatically detect the locations of moving targets, and the dual-polarized DPM integrated with a pre-trained artificial neural network (ANN) serves to realize the smart beam tracking and wireless communications. Three groups of experiments are conducted for demonstrating the intelligent system: detection and identification of moving targets, detection of radio-frequency signals, and real-time wireless communications. The proposed method sets the stage for an integrated implementation of target identification, radio environment tracking, and wireless communications. This strategy opens up an avenue for intelligent wireless networks and self-adaptive systems.
... As a representative architecture, convolutional neural networks (CNNs) have achieved remarkable results in visual tracking due to their powerful feature expression ability. Three types of CNNs have been used for VOT: (a) CNNs based on pure convolutional features [23]; (b) Siamese network-based trackers [21,24,25]; (c) DCF-based trackers based on the VGG-Net [26] and other network training features [22,27,28]. These trackers provide high-precision results by utilizing the graphics card. ...
Article
Full-text available
In recent years, discriminative correlation filters (DCF) based trackers have been widely used in mobile robots due to their efficiency. However, underground coal mines are typically a low illumination environment, and tracking in this environment is a challenging problem that has not been adequately addressed in the literature. Thus, this paper proposes a Low-illumination Long-term Correlation Tracker (LLCT) and designs a visual tracking system for coal mine drilling robots. A low-illumination tracking framework combining image enhancement strategies and long-time tracking is proposed. A long-term memory correlation filter tracker with an interval update strategy is utilized. In addition, a local area illumination detection method is proposed to prevent the failure of the enhancement algorithm due to local over-exposure. A convenient image enhancement method is proposed to boost efficiency. Extensive experiments on popular object tracking benchmark datasets demonstrate that the proposed tracker significantly outperforms the baseline trackers, achieving high real-time performance. The tracker’s performance is verified on an underground drilling robot in a coal mine. The results of the field experiment demonstrate that the performance of the novel tracking framework is better than that of state-of-the-art trackers in low-illumination environments.
... They extracted features from different layers of ResNet to produce response maps fused based on the AdaBoost algorithm, prevented the filters from updating when occlusion occurs, and used a scale filter to estimate the target scale [29]. In 2020, Di Yuan et al. designed a mask set to generate local filters to capture the local structures of the target and adopted an adaptive weighting fusion strategy for these local filters to adapt to the changes in the target appearance, which could enhance the robustness of the tracker effectively [30]. In 2021, Yuan Tai et al. constructed the subspace with image patches of the search window in previous frames. ...
Article
Full-text available
In the field of computer vision and robotics, scholars use object tracking technology to track objects of interest in various video streams and extend practical applications, such as unmanned vehicles, self-driving cars, robotics, drones, and security surveillance. Object tracking is a mature technology in the field of computer vision and robotics; however, there is still no one object tracking algorithm that can comprehensively and simultaneously solve the four problems encountered by tracking objects, namely deformation, illumination variation, motion blur, and occlusion. We propose an algorithm called an adaptive dynamic multi-template correlation filter (ADMTCF) which can simultaneously solve the above four difficulties encountered in tracking moving objects. The ADMTCF encodes local binary pattern (LBP) features in the HSV color space, so the encoded features can resist the pollution of the tracking image caused by illumination variation. The ADMTCF has four templates that can be adaptively and dynamically resized to maintain tracking accuracy to combat tracking problems such as deformation, motion blur, and occlusion. In this paper, we experimented with our ADMTCF algorithm and various state-of-the-art tracking algorithms in scenarios such as deformation, illumination variation, motion blur, and occlusion. Experimental results show that our proposed ADMTCF exhibits excellent performance, stability, and robustness in various scenarios.
... In order to obtain the desired output response map, Bibi et al. [39] used the score of the real sample to replace the score of the cyclic shift sample, which made up for the shortcomings of the manually set response map. Based on the good properties of the CFbased tracking framework, many attempts are made to introduce it into the TIR target tracking task [1,4,6,29,40]. He et al. [1] introduce a weighted correlation filter-based infrared target tracking method to obtain efficient tracking results. ...
Article
Full-text available
When dealing with complex thermal infrared (TIR) tracking scenarios, the single category feature is not sufficient to portray the appearance of the target, which drastically affects the accuracy of the TIR target tracking method. In order to address these problems, we propose an adaptively multi-feature fusion model (AMFT) for the TIR tracking task. Specifically, our AMFT tracking method adaptively integrates hand-crafted features and deep convolutional neural network (CNN) features. In order to accurately locate the target position, it takes advantage of the complementarity between different features. Additionally, the model is updated using a simple but effective model update strategy to adapt to changes in the target during tracking. In addition, a simple but effective model update strategy is adopted to adapt the model to the changes of the target during the tracking process. We have shown through ablation studies that the adaptively multi-feature fusion model in our AMFT tracking method is very effective. Our AMFT tracker performs favorably on PTB-TIR and LSOTB-TIR benchmarks compared with state-of-the-art trackers.
... There appears to be a lack of interest in manual design information that necessitates lengthy preprocessing. To make good predictions, we used flocs images as input data because a deep learning model, such as a convolutional neural network (CNN), can extract image features without the need for any pre-treatment (Yamamura et al. 2020), which has been successful in the development of other learning networks, particularly in the field of computer vision (Traore et al. 2018;Yuan et al. 2020). Few studies have looked into the use of floc image features extracted by a CNN model to shorten the flocculation time-delay. ...
Article
Full-text available
The increasing quantities of polluted waters are calling for advanced purification methods. Flocculation is an essential component of the water purification process, yet flocculation is commonly not optimal due to our poor understanding of the flocculation process. In particular, there is little knowledge on the mechanisms ruling the migration of pollutants during treatment. Here we have created the first tensor diagram, a mathematical framework for the flocculation process, analyzed its properties with a deep learning model, and developed a classification scheme for its relationship with pollutants. The tensor was constructed by combining pixel matrices from a variety of floc images, each with a particular flocculation period. Changing the factors used to make flocs images, such as coagulant dose and pH, resulted in tensors, which were used to generate matrices, that is the tensor diagram. Our deep learning algorithm employed a tensor diagram to identify pollution levels. Results show tensor map attributes with over 98% of sample images correctly classified. This approach offers potential to reduce the time delay of feedback from the flocculation process with deep learning categorization based on its clustering capabilities. The advantage of the tensor data from the flocculation process improves the efficiency and speed of response for commercial water treatment.
... In the past decade, deep learning [1][2][3] and evolutionary computation [4][5][6][7] has made great progress in information sciences and image analysis [8]. This is largely thanks to the revival of neural networks, particularly convolutional neural networks, which attain distinct progress in the field of computer vision and speech recognition [9]. ...
Article
Recently, ConvNeXts constructing from standard ConvNet modules has produced competitive performance in various image applications. In this paper, an efficient model based on the classical UNet, which can achieve promising results with a low number of parameters, is proposed for medical image segmentation. Inspired by ConvNeXt, the designed model is called ConvUNeXt and towards reduction in the amount of parameters while retaining outstanding segmentation superiority. Specifically, we firstly improved the convolution block of UNet by using large convolution kernels and depth-wise separable convolution to considerably decrease the number of parameters; then residual connections in both encoder and decoder are added and pooling is abandoned via adopting convolution for down-sampling; during skip Connection, a lightweight attention mechanism is designed to filter out noise in low-level semantic information and suppress irrelevant features, so that the network can pay more attention to the target area. Compared to the standard UNet, our model has 20% fewer parameters, meanwhile, experimental results on different datasets show that it exhibits superior segmentation performance when the amount of data is scarce or sufficient. Code will be available at https://github.com/1914669687/ConvUNeXt.
... To improve tracking accuracy, a group feature selection strategy has been proposed under the DCF-based tracking framework that can select group features across channels and spatial dimensions to determine the structural correlation between feature channel and filter system [1]. The DCF-based trackers mentioned above are only able to determine the target center location, most of these trackers use a multi-scale search strategy to predict the target state, which usually results in relatively inaccurate tracking results [34,35]. The recently proposed ATOM [14] tracker incorporates IoU modulation and IoU prediction to improve tracking performance. ...
Article
Full-text available
Most existing trackers are based on using a classifier and multi-scale estimation to estimate the target state. Consequently, and as expected, trackers have become more stable while tracking accuracy has stagnated. While trackers adopt a maximum overlap method based on an intersection-over-union (IoU) loss to mitigate this problem, there are defects in the IoU loss itself, that make it impossible to continue to optimize the objective function when a given bounding box is completely contained within/without another bounding box; this makes it very challenging to accurately estimate the target state. Accordingly, in this paper, we address the above-mentioned problem by proposing a novel tracking method based on a distance-IoU (DIoU) loss, such that the proposed tracker consists of target estimation and target classification. The target estimation part is trained to predict the DIoU score between the target ground-truth bounding-box and the estimated bounding-box. The DIoU loss can maintain the advantage provided by the IoU loss while minimizing the distance between the center points of two bounding boxes, thereby making the target estimation more accurate. Moreover, we introduce a classification part that is trained online and optimized with a Conjugate-Gradient-based strategy to guarantee real-time tracking speed. Comprehensive experimental results demonstrate that the proposed method achieves competitive tracking accuracy when compared to state-of-the-art trackers while with a real-time tracking speed.
... Therefore, whether it is image processing or medical diagnosis, accurate image segmentation is particularly important. Image segmentation plays an important role not only in lesion segmentation and organ segmentation but also in pattern recognition [1,2] and object tracking [3,4] . ...
Article
Medical image segmentation has a huge challenge due to intensity inhomogeneity and the similarity of the background and the object. To meet this challenge, we propose an improved active contour model, in which we combine the level set method and the split Bregman method, and provide the two-phase formulation, the multi-phase formulation and 3D formulation. In this paper, the proposed model is presented in a level set framework by including the neighbor region information for segmenting medical images in which the energy functional contains the data fitting term and the length term. The neighbor region and the local intensity variances in the data fitting term are designed to optimize the minimization process. To minimize the energy functional then we apply the split Bregman method which contributes to get faster convergence. Besides, we extend our model to the multi-phase segmentation model and the 3D segmentation model for cardiac MR images, which have all achieved good results. Experimental results show that the new model not only has strong robustness to other cardiac tissue effects and image intensity inhomogeneity, but it also can much better conduce to the extraction of effective tissues. As we expected, our model has higher segmentation accuracy and efficiency for medical image segmentation.
... e network not only serves as a feature extraction tool but also judges the target candidate position to obtain the target position. Literature [15] and so on generate a series of candidate samples through particle filter or sliding window and score these candidate samples to get the final tracking result. Literature [16] uses a sliding window to obtain a series of candidate samples and then uses a convolutional neural network to evaluate the maximum likelihood estimation of the samples to obtain the final tracking result. ...
Article
Full-text available
With the advent of the artificial intelligence era, target adaptive tracking technology has been rapidly developed in the fields of human-computer interaction, intelligent monitoring, and autonomous driving. Aiming at the problem of low tracking accuracy and poor robustness of the current Generic Object Tracking Using Regression Network (GOTURN) tracking algorithm, this paper takes the most popular convolutional neural network in the current target-tracking field as the basic network structure and proposes an improved GOTURN target-tracking algorithm based on residual attention mechanism and fusion of spatiotemporal context information for data fusion. The algorithm transmits the target template, prediction area, and search area to the network at the same time to extract the general feature map and predicts the location of the tracking target in the current frame through the fully connected layer. At the same time, the residual attention mechanism network is added to the target template network structure to enhance the feature expression ability of the network and improve the overall performance of the algorithm. A large number of experiments conducted on the current mainstream target-tracking test data set show that the tracking algorithm we proposed has significantly improved the overall performance of the original tracking algorithm.
... By introducing an online learning scale filter [2], a segmentation based regularized module [3], an adaptive spatial feature selection module [4] and a hierarchical attentional module with contextual attentional correlation filter [5], the performance of the discriminant correlation filters has been greatly improved. Another kind of trackers transform the tracking problem into a target matching problem and we call them siamese trackers [6][7][8][9][10][11][12][13][14][15][16]. The network structure of the siamese tracker consists of two branches that use the shared network structure and parameters. ...
Article
Full-text available
Most of the current tracking methods use bounding box to describe objects, which only provides a rough outline and is unable to accurately capture the shape and posture of the target. Instead of using a bounding box directly, we use points adaptively positioned on the target to describe the target and transform these points to bounding boxes. In this way, we can use a training strategy based on bounding box while describing the target more accurately. Furthermore, a Coarse-Fine classification module is employed to improve the robustness, which is important in the case of scale variation and deformation. Combining the above modules, we propose our SiamPCF, which is an anchor-free tracking method that avoids the carefully selected hyperparameters needed to design anchors. Extensive experiments conducted on five benchmarks show that our SiamPCF can achieve state-of-the-art results. In the analysis of video attributes, our SiamPCF ranks first in scale variance, which demonstrates its effectiveness. Our SiamPCF runs more than 45 frames per second.
... Yuan et al. [23] proposed a target-focusing convolutional regression model to complicated situations and achieved better performance. Moreover, Yuan et al. [24] used multiple features to build the target model for object tracking to deal with the tracking accuracy problem caused by a single feature. Zhang et al. [25] combined the texture features and color features to perform optimal similarity matching. ...
Article
Full-text available
Aiming at the problems of illumination changes, target deformation and background clutter in the target tracking field, a visual tracking algorithm based on peak sidelobe ratio is proposed. The object-interference model is used to represent the target appearance model, and context information is added to the relevant filtering framework. Training the filter internally to enhance the ability of filter discrimination. In the model update process, it is easier to introduce samples that cannot characterize the target, and use peak sidelobe comparison to update the tracking parameters, which can enhance the generalization ability of the model. Tested with some classic and recently algorithms in the OTB50, OTB100, UAV123, TC128 experimental video data set, the experiment’ results show that the visual tracking algorithm that is proposed in the article can track the target more accurately. It has important research on the development of intelligent video surveillance value.
... Owing to end-to-end connection architectures, CNNs with flexible plugins are used far in many tasks, i.e., image [23,24], video [25,26] and text applications [27]. Specifically, modules or blocks in CNNs are used in low-level computer vision, especially, image super-resolution [28][29][30] and denoising [16]. ...
Article
Deep convolutional neural networks (CNNs) for image denoising have recently attracted increasing research interest. However, plain networks cannot recover fine details for a complex task, such as real noisy images. In this paper, we propose a Dual denoising Network (DudeNet) to recover a clean image. Specifically, DudeNet consists of four modules: a feature extraction block, an enhancement block, a compression block, and a reconstruction block. The feature extraction block with a sparse mechanism extracts global and local features via two sub-networks. The enhancement block gathers and fuses the global and local features to provide complementary information for the latter network. The compression block refines the extracted information and compresses the network. Finally, the reconstruction block is utilized to reconstruct a denoised image. The DudeNet has the following advantages: (1) The dual networks with a sparse mechanism can extract complementary features to enhance the generalized ability of denoiser. (2) Fusing global and local features can extract salient features to recover fine details for complex noisy images. (3) A small-size filter is used to reduce the complexity of denoiser. Extensive experiments demonstrate the superiority of DudeNet over existing current state-of-the-art denoising methods.
... It is known that differences in textures and edges of different LR images have great influence on SR model. To address this problem, image augmentation has obtained good performance in image [47,51] and video applications [52]. Based on this idea, a twostep mechanism [19,35] is used to enlarge the training dataset for improving the generalization ability of the SR model. ...
Article
Full-text available
Deep convolutional neural networks (CNNs) have been widely applied for low-level vision over the past five years. According to the nature of different applications, designing appropriate CNN architectures is developed. However, customized architectures gather different features via treating all pixel points as equal to improve the performance of given application, which ignores the effects of local power pixel points and results in low training efficiency. In this article, we propose an asymmetric CNN (ACNet) comprising an asymmetric block (AB), a memory enhancement block (MEB), and a high-frequency feature enhancement block (HFFEB) for image superresolution (SR). The AB utilizes one-dimensional (1-D) asymmetric convolutions to intensify the square convolution kernels in horizontal and vertical directions for promoting the influences of local salient features for single image SR (SISR). The MEB fuses all hierarchical low-frequency features from AB via a residual learning technique to resolve the long-term dependency problem and transforms obtained low-frequency features into high-frequency features. The HFFEB exploits low- and high-frequency features to obtain more robust SR features and address the excessive feature enhancement problem. Additionally, it also takes charge of reconstructing a high-resolution image. Extensive experiments show that our ACNet can effectively address SISR, blind SISR, and blind SISR of blind noise problems. The code of the ACNet is shown at https://github.com/hellloxiaotian/ACNet.
... It is known that differences in textures and edges of different LR images have great influence on SR model. To address this problem, image augmentation has obtained good performance in image [47,51] and video applications [52]. Based on this idea, a twostep mechanism [19,35] is used to enlarge the training dataset for improving the generalization ability of the SR model. ...
Preprint
Full-text available
Deep convolutional neural networks (CNNs) have been widely applied for low-level vision over the past five years. According to nature of different applications, designing appropriate CNN architectures is developed. However, customized architectures gather different features via treating all pixel points as equal to improve the performance of given application, which ignores the effects of local power pixel points and results in low training efficiency. In this paper, we propose an asymmetric CNN (ACNet) comprising an asymmetric block (AB), a mem?ory enhancement block (MEB) and a high-frequency feature enhancement block (HFFEB) for image super-resolution. The AB utilizes one-dimensional asymmetric convolutions to intensify the square convolution kernels in horizontal and vertical directions for promoting the influences of local salient features for SISR. The MEB fuses all hierarchical low-frequency features from the AB via residual learning (RL) technique to resolve the long-term dependency problem and transforms obtained low-frequency fea?tures into high-frequency features. The HFFEB exploits low- and high-frequency features to obtain more robust super-resolution features and address excessive feature enhancement problem. Ad?ditionally, it also takes charge of reconstructing a high-resolution (HR) image. Extensive experiments show that our ACNet can effectively address single image super-resolution (SISR), blind SISR and blind SISR of blind noise problems. The code of the ACNet is shown at https://github.com/hellloxiaotian/ACNet.
... Deep learning-based methods have been widely used in feature extraction [10][11][12][13], image recognition [14,15], fault diagnosis [16,17], target tracking [18] and other fields. Although classical segmentation methods [19][20][21] can solve part of segmentation problems, with the rise of deep learning, end-to-end medical image segmentation methods [22][23][24][25] have become a good choice. ...
Article
Deep convolutional neural networks have shown great potential in medical image segmentation. However, automatic cardiac segmentation is still challenging due to the heterogeneous intensity distributions and indistinct boundaries in the images. In this paper, we propose a multiscale dual-path feature aggregation network (MDFA-Net) to solve misclassification and shape discontinuity problems. The proposed network is aimed to maintain a realistic shape of the segmentation results and divided into two parts: the first part is a non-downsampling multiscale nested network (MN-Net) which restrains the cardiac continuous shape and maintains the shallow information, and the second part is a non-symmetric encoding and decoding network (nSED-Net) that can retain deep details and overcome misclassification. We conducted four-fold cross-validation experiments on balanced steady-state, free precession cine cardiac magnetic resonance (bSSFP cine CMR) sequence, edema-sensitive T2-weighted, black blood spectral presaturation attenuated inversion-recovery (T2-SPAIR) CMR sequence and late gadolinium enhancement (LGE) CMR sequence which include 45 cases in each sequence. The data are provided by the organizer of the Multi-sequence Cardiac MR Segmentation Challenge (MS-CMRSeg 2019) in conjunction with 2019 Medical Image Computing and Computer Assisted Interventions (MICCAI). We also conducted external validation experimnets on the data of 2020 MICCAI myocardial pathology segmentation challenge (MyoPS 2020). Whether it is a four-fold cross-validation experiment or an external validation experiment, the proposed method ranks first or second in the segmentation tasks of multi-sequence CMR images. The subjective evaluation also shows the same results as the objective evaluation metrics. The code will be posted at https://github.com/fly1995/MDFA-Net/.
Article
Full-text available
Medical ultrasound imaging involves the use of high-frequency sound waves to produce images of various body parts. A transducer generates these sound waves, which traverse through bodily tissues, providing measurements of soft tissue and organ dimensions, shapes, and consistencies. The quality of an ultrasound image depends on the frequency of the transducer used. Higher-frequency transducers yield better resolution, but they are limited in their ability to penetrate deeply into the body due to their shorter wavelengths, making them more susceptible to absorption. Lower-frequency transducers can penetrate deeper because they have longer wavelengths, although their resolution isn’t as sharp as that of high-frequency transducers. The primary drawback associated with ultrasound imaging is the introduction of noise during the signal processing stage, which can lead to images that are challenging to interpret. Efficient medical image processing plays a pivotal role in improving the quality and utility of ultrasound imaging for medical applications, enhancing image comprehension, and aiding in accurate diagnoses. One critical aspect of the pre-processing stage in medical image analysis, particularly in ultrasound images, is speckle noise reduction. To improve the analysis and diagnosis in various applications, it has become essential to employ software tools for speckle noise removal. In this paper we propose hybrid filter, which is combination of filters to be used to remove noise along with Convolutional Neural Networks. Experiments were performed at different speckle variances and the results showed that hybrid filter performed well when the speckle variance is between 0.1 and 0.3, while DnCNNL5 performed well when the speckle variance was high (between 0.5 and 0.9).
Article
Multi-object tracking (MOT) is essential for solving the majority of computer vision issues related to crowd analytics. In an MOT system designing object detection and association are the two main steps. Every frame of the video stream is examined to find the desired objects in the first step. Their trajectories are determined in the second step by comparing the detected objects in the current frame to those in the previous frame. Less missing detections are made possible by an object detection system with high accuracy, which results in fewer segmented tracks. We propose a new deep learning-based model for improving the performance of object detection and object tracking in this research. First, object detection is performed by using the adaptive Mask-RCNN model. After that, the ResNet-50 model is used to extract more reliable and significant features of the objects. Then the effective adaptive feature channel selection method is employed for selecting feature channels to determine the final response map. Finally, an adaptive combination kernel correlation filter is used for multiple object tracking. Extensive experiments were conducted on large object tracking databases like MOT-20 and KITTI-MOTS. According to the experimental results, the proposed tracker performs better than other cutting-edge trackers when faced with various problems. The experimental simulation is carried out in python. The overall success rate and precision of the proposed algorithm are 95.36% and 93.27%.
Article
Maintaining the identity of multiple objects in real-time video is a challenging task, as it is not always feasible to run a detector on every frame. Thus, motion estimation systems are often employed, which either do not scale well with the number of targets or produce features with limited semantic information. To solve the aforementioned problems and allow the tracking of dozens of arbitrary objects in real-time, we propose SiamMOTION. SiamMOTION includes a novel proposal engine that produces quality features through an attention mechanism and a region-of-interest extractor fed by an inertia module and powered by a feature pyramid network. Finally, the extracted tensors enter a comparison head that efficiently matches pairs of exemplars and search areas, generating quality predictions via a pairwise depthwise region proposal network and a multi-object penalization module. SiamMOTION has been validated on five public benchmarks, achieving leading performance against current state-of-the-art trackers. Code available at: https://www.github.com/lorenzovaquero/SiamMOTION
Article
In recent years, convolutional regression trackers have shown increasing attention for visual object tracking due to their favorable performance and easy implementation. However, most of them are restricted to features from a certain layer and hardly benefit from temporal spatial information, which limits the potential to significant appearance changes. In this work, we go beyond the traditional deep regression trackers and build a novel twofold tracking network, which exploits rich hierarchical features and incorporates both temporal and spatial information to boost the tracking performance. The proposed network is composed of two streams, i.e., an appearance stream and a semantic stream, each stream is independently learned from different convolutional layers. Specially, we propose temporal and spatial mechanism for robust target representation by considering historical information in previous frames as well as spatial information. By design, the proposed twofold convolutional regression tracking network with spatial and temporal mechanism can better tolerate the target appearance changes and improve the tracking accuracy. Extensive experimental results on the benchmarks OTB-2015, Temple-Color, UAV123, and VOT-2018 demonstrate the effectiveness of our method, as compared with a number of state-of-the-art trackers.
Article
Despite the great success achieved in visual tracking, it is still hard for most trackers to address scenes with targets subject to large-scale changes and similar objects. The capacity of existing methods is first insufficient to efficiently extract multi-scale features. Then, convolutional neural networks focus primarily on local characteristics while easily ignoring global characteristics, which is essential for visual tracking. Furthermore, the recently popular tracking methods based on Siamese-like networks can perform the image matching of two branches through simple cross-correlation operations, and cannot effectively establish their connection. An improved Siamese tracking network, called GSiamMS, is proposed to address these challenges via the integration of Res2Net blocks and transformer modules. Within this network, a feature extraction module based on Res2Net blocks is constructed to obtain multi-scale information from the granular level without relying on multi-layer outputs. Then, the cross-attention mechanism is utilized to learn the connection between template features and search features while the self-attention mechanism focusing on the global information establishes long-range dependencies between the object and the background. Finally, numerous experiments on visual tracking benchmarks including TrackingNet, GOT-10k, LaSOT, NFS, UAV123, and TNL2K are implemented to verify that the developed method running at 38fps achieves the superior performance compared with several state-of-the-art methods.
Preprint
Full-text available
The fifth-generation (5G) wireless communication has an urgent need for target tracking. Digital programmable metasurface (DPM) may offer an intelligent and efficient solution owing to its powerful and flexible controls of electromagnetic waves and advantages of lower cost, less complexity and smaller size than the traditional antenna array. Here, we report an intelligent metasurface system to perform target tracking and wireless communications, in which computer vision integrated with a convolutional neural network (CNN) is used to automatically detect the locations of moving targets, and the dual-polarized DPM integrated with a pre-trained artificial neural network (ANN) serves to realize smart beam tracking and wireless communications. Three groups of experiments are conducted for demonstrating the intelligent system: detection of moving targets, detection of radio-frequency signals, and real-time wireless communications. The proposed method sets the stage for an integrated implementation of target identification, radio environment tracking, and wireless communications. This strategy opens up a new avenue for intelligent wireless networks and self-adaptive systems.
Article
The Siamese architecture has shown remarkable performance in the field of visual tracking. Although the existing Siamese-based tracking methods have achieved a relative balance between accuracy and speed, the performance of many trackers in complex scenes is often unsatisfactory, which is mainly caused by interference factors, such as target scale changes, occlusion, and fast movement. In these cases, excessive trackers cannot employ sufficiently the target feature information and face the dilemma of information loss. In this work, we propose a novel parallel Transformer network architecture to achieve robust visual tracking. The proposed method designs the Transformer-1 module, the Transformer-2 module, and the feature fusion head (FFH) based on the attention mechanism. The Transformer-1 module and the Transformer-2 module are regarded as corresponding complementary branches in the parallel architecture. The FFH is used to integrate the feature information of the two parallel branches, which can efficiently exploit the feature dependence relationship between the template and the search region, and comprehensively explore rich contextual information. Finally, by combining the core ideas of Siamese and Transformer, we present a simple and robust tracking framework called RPformer, which does not require any prior knowledge and avoids the trouble of adjusting hyperparameters. Numerous experiments show that the proposed tracking method achieves more outstanding performance than the state-of-the-art trackers on seven tracking benchmarks, which can meet the real-time requirements at a running speed exceeding 50.0 frames/s.
Article
Moving object tracking is one of the applied fields in artificial intelligence and robotic. The main objective of object tracking is to detect and locate targets in video frames of real scenes. Although various methods have been proposed for object tracking so far, tracking in challenging conditions remains an open issue. Recently, different evolutionary and heuristics algorithms like swarm intelligence have been used to address the tracking challenges, which have shown promising performance. In this paper, a new approach based on modified biogeography based optimization (mBBO) method is introduced. The BBO algorithm includes migration and mutation steps. In the migration phase, the search space is properly explored by sharing information between habitats and weaker solutions to improve their position. On the other hand, the mutation phase leads to diversity and change in solutions. In this algorithm, the elitist method has been also used to keep better solutions. The performance of modified BBO tracker has been evaluated on benchmark video datasets and compared with several other tracking methods. Experimental results demonstrate that the proposed method estimates the location of targets with high accuracy and achieves better performance and robustness compared to other trackers.
Article
Full-text available
Visual ship tracking provides crucial kinematic traffic information to maritime traffic participants, which helps to accurately predict ship traveling behaviors in the near future. Traditional ship tracking models obtain a satisfactory performance by exploiting distinct features from maritime images, which may fail when the ship scale varies in image sequences. Moreover, previous frameworks have not paid much attention to weather condition interferences (e.g., visibility). To address this challenge, we propose a scale-adaptive ship tracking framework with the help of a kernelized correlation filter (KCF) and a log-polar transformation operation. First, the proposed ship tracker employs a conventional KCF model to obtain the raw ship position in the current maritime image. Second, both the previous step output and ship training sample are transformed into a log-polar coordinate system, which are further processed with the correlation filter to determine ship scale factor and to suppress the negative influence of the weather conditions. We verify the proposed ship tracker performance on three typical maritime scenarios under typical navigational weather conditions (i.e., sunny, fog). The findings of the study can help traffic participants efficiently obtain maritime situation awareness information from maritime videos, in real time, under different visibility weather conditions.
Article
Compressed image quality enhancement has attracted a large amount of attention in recent years. In general, the primary goal of compression is artifact reduction to produce a higher-quality output from a low-quality input. Information loss and compression artifacts are mostly due to quantization. The quantization matrix is determined by the compression quality factor (QF). However, there has thus far been little related research to estimate the compression quality factor for JPEG images. To address this issue, in this paper, we propose a deep dual-domain semi-blind network (D³SN) that combines compression quality factor detection and compressed image quality enhancement. Specifically, a quality factor detection (QFD) module is designed to capture contextual information of the space and frequency domains. Furthermore, we build a novel deep dual-domain compressed image quality enhancement network to remove the compression artifacts by using the prior in terms of both the discrete cosine transform (DCT) and pixel domains. Different from previous algorithms, our proposed approach can remove compression artifacts generated at different quality factors by inferring the image quality. Experimental results demonstrate the superiority of our deep dual-domain semi-blind network over state-of-the-art methods in terms of objective quality and visual results.
Article
Tracking in the unmanned aerial vehicle (UAV) scenarios is one of the main components of target tracking tasks. Different from the target tracking task in the general scenarios, the target tracking task in the UAV scenarios is very challenging because of factors such as small scale and aerial view. Although the DCFs-based tracker has achieved good results in tracking tasks in general scenarios, the boundary effect caused by the dense sampling method will reduce the tracking accuracy, especially in UAV tracking scenarios. In this work, we propose learning an adaptive spatial-temporal context-aware (ASTCA) model in the DCFs-based tracking framework to improve the tracking accuracy and reduce the influence of boundary effect, thereby enabling our tracker to more appropriately handle UAV tracking tasks. Specifically, our ASTCA model can learn a spatial-temporal context weight, which can precisely distinguish the target and background in the UAV tracking scenarios. Besides, considering the small target scale and the aerial view in UAV tracking scenarios, our ASTCA model incorporates spatial context information within the DCFs-based tracker, which could effectively alleviate background interference. Extensive experiments demonstrate that our ASTCA method performs favorably against state-of-the-art tracking methods on some standard UAV datasets.
Article
Full-text available
Automation is spread in all daily life and business activities to facilitate human life and working conditions. Robots, automated cars, unmanned vehicles, robot arms, automated factories etc. are getting place in our lives. For these automated actors, one important task is recognizing objects and obstacles in the target environment. Object detection, determining the objects and their location in the environment, is one of the most important solution for this task. With deep learning techniques like Convolutional Neural Network and GPU processing, object detection has become more accurate and faster, and getting attention of researchers. In recent years, many articles about object detection algorithms and usage of object detection have been published. There are surveys about the object detection algorithms, but they have introduced algorithms and focused on common application areas. With this survey, we aim to show that object detection algorithms have very large and different application area. In this study, we have given a brief introduction to deep learning. We have then focused on standard object detection algorithms based on deep learning and their applications in different research areas in recent years to give an idea for future works. Also, the datasets and evaluation metrics used in the research are listed.
Article
Most video analytics applications rely on object detectors to localize objects in frames. However, when real-time is a requirement, running the detector at all the frames is usually not possible. This is somewhat circumvented by instantiating visual object trackers between detector calls, but this does not scale with the number of objects. To tackle this problem, we present SiamMT, a new deep learning multiple visual object tracking solution that applies single-object tracking principles to multiple arbitrary objects in real-time. To achieve this, SiamMT reuses feature computations, implements a novel crop-and-resize operator, and defines a new and efficient pairwise similarity operator. SiamMT naturally scales up to several dozens of targets, reaching 25 fps with 122 simultaneous objects for VGA videos, or up to 100 simultaneous objects in HD720 video. SiamMT has been validated on five large real-time benchmarks, achieving leading performance against current state-of-the-art trackers.
Article
As a typical ill-posed problem, JPEG compressed image restoration (CIR) aims to recover a high-quality (HQ) image from the compressed version. Although many model-based and learning-based methods have been proposed for conventional image restoration (IR), proposing a general and effective framework for various CIR tasks is still a challenging work. The model-based methods are flexible for handling different IR tasks, but they suffer from high complexity and the difficulty in designing sophisticated priors. The learning-based methods have shown promising results in various IR tasks. However, most of them need to retrain their models for each IR task separately, which sacrifices methods’ flexibility. In this paper, we propose a novel and high-performance deep deblocker driven unified framework to flexibly address various CIR tasks without retraining. First, a novel fidelity (NF) is introduced into CIR, and then the CIR problem is divided into inversion and deblocking subproblems by our improved split Bregman iteration (ISBI) algorithm. Next, we design a set of compact yet effective deep deblockers. Since simultaneously modelling the data fidelity term and implicit priors via deblockers is necessary, these deblockers are used as implicit priors and also used for NF in the CIR problem. The convergence of our method is proved as well. To the best of our knowledge, our method is the first work to use deblockers as implicit priors, and it could also contribute to other deblocking methods to obtain better flexibility. The effectiveness of the proposed method is demonstrated both visually and quantitatively.
Article
Object tracking is challenging and recently correlation filters methods have been proposed for this task. Most of these methods focus on the central portion of the target, and are negatively affected by changes in the target size and shape. This work proposes a collaborative scheme using several local correlation filters combined with a global correlation filter for improving the performance of object tracking methods based on correlation filters. The proposed correlation filter used in this scheme is based on features extracted from multiple layers of deep convolutional neural networks, and a strategy to identify when these models should be updated also is presented. Experiments show that the proposed scheme tends to be consistent and to achieve better results than other comparative tracking approaches. The proposed collaborative approach can be applied to other correlation filters, which tends to further improve the tracker performance.
Article
Local features have been widely used in visual tracking to improve robustness in the presence of partial occlusion, deformation, and rotation. In this paper, a structured object tracking algorithm, which uses local discriminative color (LoDC) patch representation and discriminative patch attributed relational graph (DPARG) matching, is proposed. Unlike several existing local feature-based algorithms that divide an object into some rectangular patches of fixed sizes while separately locating each patch to track the object, the proposed algorithm relies on a discriminative color model to distinguish the outstanding colors of the given object. Thus, the multimodal color object is represented by multiple unimodal, homogeneous, and discriminative patches. Moreover, these patches are assembled in a structured DPARG, in which vertexes describe the object’s local discriminative patches while encoding the appearance information, and edges express the relations between vertexes while encoding inner geometric structure information. The object tracking is then formulated as inexact matching of the dynamic undirected graph. The changes of DPARG, along with dynamic environments, are used to filter out invalid patches at the current frame, which usually correspond to those abnormal patches emerging from partial occlusion, similar color disturbances, etc. Finally, the valid patches are assembled to locate the object. The experimental results on the popular tracking benchmark datasets exhibit that the proposed algorithm is reliable enough in tracking even in the presence of serious appearance changes, partial occlusion, and background clutter.
Article
Nighttime construction has been widely conducted in many construction scenarios, but it is also much riskier due to low lighting conditions and fatiguing environments. Therefore, this study proposes a vision-based method specifically for automatic tracking of construction machines at nighttime by integrating the deep learning illumination enhancement. Five main modules are involved in the proposed method, including illumination enhancement, machine detection, Kalman filter tracking, machine association, and linear assignment. Then, a testing experiment based on nine nighttime videos is conducted to evaluate the tracking performance using this approach. The results show that the method developed in this study achieved 95.1% in MOTA and 75.9% in MTOP. Compared with the baseline method SORT, the proposed method has improved the tracking robustness of 21.7% in nighttime construction scenarios. The proposed methodology can also be used to help accomplish automated surveillance tasks in nighttime construction to improve the productivity and safety performance.
Article
Deep learning technology has greatly improved the performance of target tracking, but most recently developed tracking algorithms are short-term tracking algorithms, which cannot meet the actual engineering needs. Based on the Siamese network structure, this paper proposes a long-term tracking framework with a persistent tracking capability. The global proposal module extends the search area globally through the construction of a feature pyramid. The local regression module is mainly responsible for the confidence evaluation of the candidate regions and for performing more accurate bounding box regression. To improve the discriminative ability of the regression network, the error samples are eliminated by synthesizing the temporal information and are then classified through a verification module in advance. Experiments on the VOT long-term tracking dataset and the UAV20L aerial dataset show that the proposed algorithm achieves state-of-the-art performance.
Article
Full-text available
To tackle the problem that traditional particle-filter- or correlation-filter-based trackers are prone to low tracking accuracy and poor robustness when the target faces challenges such as occlusion, rotation and scale variation in the case of complex scenes, an accurate reliable-patch-based tracker is proposed through exploiting and complementing the advantages of particle filter and correlation filter. Specifically, to cope with the challenge of continuous full occlusion, the target is divided into numerous patches by combining random with hand-crafted partition methods, and then, an effective target position estimation strategy is presented. Subsequently, according to the motion law between the patch and global target in the particle filter framework, two effective resampling rules are designed to remove unreliable particles to avoid tracking drift, and then, the target position can be estimated by the most reliable patches identified. Finally, an effective scale estimation approach is presented, in which the Manhattan distance between the reliable patches is utilized to estimate the target scale, including the target width and height, respectively. Experimental results illustrate that our tracker can not only be robust against the challenges of occlusion, rotation and scale variation, but also outperform state-of-the-art trackers for comparison in overall performance.
Article
Full-text available
Although numerous recent tracking approaches have made tremendous advances in the last decade, achieving high-performance visual tracking remains a challenge. In this paper, we propose an end-to-end network model to learn reinforced attentional representation for accurate target object discrimination and localization. We utilize a novel hierarchical attentional module with long short-term memory and multi-layer perceptrons to leverage both inter- and intra-frame attention to effectively facilitate visual pattern emphasis. Moreover, we incorporate a contextual attentional correlation filter into the backbone network to make our model trainable in an end-to-end fashion. Our proposed approach not only takes full advantage of informative geometries and semantics but also updates correlation filters online without fine-tuning the backbone network to enable the adaptation of variations in the target object's appearance. Extensive experiments conducted on several popular benchmark datasets demonstrate that our proposed approach is effective and computationally efficient.
Article
Full-text available
Visual tracking is one of the most fundamental topics in computer vision. Numerous tracking approaches based on discriminative correlation filters or Siamese convolutional networks have attained remarkable performance over the past decade. However, it is still commonly recognized as an open research problem to develop robust and effective trackers which can achieve satisfying performance with high computational and memory storage efficiency in real-world scenarios. In this paper, we investigate the impacts of three main aspects of visual tracking, i.e., the backbone network, the attentional mechanism, and the detection component, and propose a Siamese Attentional Keypoint Network, dubbed SATIN, for efficient tracking and accurate localization. Firstly, a new Siamese lightweight hourglass network is specially designed for visual tracking. It takes advantage of the benefits of the repeated bottom-up and top-down inference to capture more global and local contextual information at multiple scales. Secondly, a novel cross-attentional module is utilized to leverage both channel-wise and spatial intermediate attentional information, which can enhance both discriminative and localization capabilities of feature maps. Thirdly, a keypoints detection approach is invented to trace any target object by detecting the top-left corner point, the centroid point, and the bottom-right corner point of its bounding box. Therefore, our SATIN tracker not only has a strong capability to learn more effective object representations, but also is computational and memory storage efficiency, either during the training or testing stages. To the best of our knowledge, we are the first to propose this approach. Without bells and whistles, experimental results demonstrate that our approach achieves state-of-the-art performance on several recent benchmark datasets, at a speed far exceeding 27 frames per second.
Article
Full-text available
Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Article
Full-text available
Common tracking algorithms only use a single feature to describe the target appearance, which makes the appearance model easily disturbed by noise. Furthermore, the tracking performance and robustness of these trackers are obviously limited. In this paper, we propose a novel multiple feature fused model into a correlation filter framework for visual tracking to improve the tracking performance and robustness of the tracker. In different tracking scenarios, the response maps generated by the correlation filter framework are different for each feature. Based on these response maps, different features can use an adaptive weight-ing function to eliminate noise interference and maintain their respective advantages. It can enhance the tracking performance and robustness of the tracker efficiently. Meanwhile, the correlation filter framework can provide a fast training and accurate locating mechanism. In addition, we give a simple yet effective scale variation detection method, which can appropriately handle scale variation of the target in the tracking sequences. We evaluate our tracker on OTB2013/OTB50/OBT2015 benchmarks, which are including more than 100 video sequences. Extensive experiments on these benchmark datasets demonstrate that the proposed MFFT tracker performs favorably against the state-of-the-art trackers.
Article
Full-text available
Most of the correlation filter based tracking algorithms can achieve good performance and maintain fast computational speed. However, in some complicated tracking scenes, there is a fatal defect that causes the object to be located inaccurately, which is the trackers excessively dependent on the maximum response value to determine the object location. In order to address this problem, we propose a particle filter redetection based tracking approach for accurate object localization. During the tracking process, the kernelized correlation filter (KCF) based tracker can locate the object by relying on the maximum response value of the response map; when the response map becomes ambiguous, the tracking result becomes unreliable correspondingly. Our redetection model can provide abundant object candidates by particle resampling strategy to detect the object accordingly. Additionally, for the target scale variation problem, we give a new object scale evaluation mechanism, which merely considers the differences between the maximum response values in consecutive frames to determine the scale change of the object target. Extensive experiments on OTB2013 and OTB2015 datasets demonstrate that the proposed tracker performs favorably in relation to the state-of-the-art methods.
Conference Paper
Full-text available
We propose a new context-aware correlation filter based tracking framework to achieve both high computational speed and state-of-the-art performance among real-time trackers. The major contribution to the high computational speed lies in the proposed deep feature compression that is achieved by a context-aware scheme utilizing multiple expert auto-encoders; a context in our framework refers to the coarse category of the tracking target according to appearance patterns. In the pre-training phase, one expert auto-encoder is trained per category. In the tracking phase, the best expert auto-encoder is selected for a given target, and only this auto-encoder is used. To achieve high tracking performance with the compressed feature map, we introduce extrinsic denoising processes and a new orthogonality loss term for pre-training and fine-tuning of the expert auto-encoders. We validate the proposed context-aware framework through a number of experiments, where our method achieves a comparable performance to state-of-the-art trackers which cannot run in real-time, while running at a significantly fast speed of over 100 fps.
Article
Full-text available
We propose a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN). Our algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. Our network is composed of shared layers and multiple branches of domain-specific layers, where domains correspond to individual training sequences and each branch is responsible for binary classification to identify target in each domain. We train each domain in the network iteratively to obtain generic target representations in the shared layers. When tracking a target in a new sequence, we construct a new network by combining the shared layers in the pretrained CNN with a new binary classification layer, which is updated online. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state. The proposed algorithm illustrates outstanding performance in existing tracking benchmarks.
Conference Paper
Full-text available
We propose a new tracking framework with an attentional mechanism that chooses a subset of the associated correlation filters for increased robustness and computational efficiency. The subset of filters is adaptively selected by a deep attentional network according to the dynamic properties of the tracking target. Our contributions are manifold, and are summarised as follows: (i) Introducing the Attentional Correlation Filter Network which allows adaptive tracking of dynamic targets. (ii) Utilising an attentional network which shifts the attention to the best candidate modules, as well as predicting the estimated accuracy of currently inactive modules. (iii) Enlarging the variety of correlation filters which cover target drift, blurriness, occlusion, scale changes, and flexible aspect ratio. (iv) Validating the robustness and efficiency of the attentional mechanism for visual tracking through a number of experiments. Our method achieves similar performance to non real-time trackers, and state-of-the-art performance amongst real-time trackers.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
Full-text available
In recent years, Discriminative Correlation Filter (DCF) based methods have significantly advanced the state-of-the-art in tracking. However, in the pursuit of ever increasing tracking performance, their characteristic speed and real-time capability have gradually faded. Further, the increasingly complex models, with massive number of trainable parameters, have introduced the risk of severe over-fitting. In this work, we tackle the key causes behind the problems of computational complexity and over-fitting, with the aim of simultaneously improving both speed and performance. We revisit the core DCF formulation and introduce: (i) a factorized convolution operator, which drastically reduces the number of parameters in the model; (ii) a compact generative model of the training sample distribution, that significantly reduces memory and time complexity, while providing better diversity of samples; (iii) a conservative model update strategy with improved robustness and reduced complexity. We perform comprehensive experiments on four benchmarks: VOT2016, UAV123, OTB-2015, and TempleColor. When using expensive deep features, our tracker provides a 20-fold speedup and achieves a 13.3% relative gain in Expected Average Overlap compared to the top ranked method in the VOT2016 challenge. Moreover, our fast variant, using hand-crafted features, operates at 60 Hz on a single CPU, while obtaining 64.8% AUC on OTB-2015.
Conference Paper
Full-text available
The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object’s appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.
Conference Paper
Full-text available
The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-the-art trackers makes the VOT 2016 the largest and most challenging benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the Appendix. The VOT2016 goes beyond its predecessors by (i) introducing a new semi-automatic ground truth bounding box annotation methodology and (ii) extending the evaluation system with the no-reset experiment. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (http:// votchallenge. net).
Article
Full-text available
Robustness and efficiency are the two main goals of existing trackers. Most robust trackers are implemented with combined features or models accompanied with a high computational cost. To achieve a robust and efficient tracking performance, we propose a multi-view correlation tracker to do tracking. On one hand, the robustness of the tracker is enhanced by the multi-view model, which fuses several features and selects the more discriminative features to do tracking. On the other hand, the correlation filter framework provides a fast training and efficient target locating. The multiple features are well fused on the model level of correlation filer, which are effective and efficient. In addition, we raise a simple but effective scale-variation detection mechanism, which strengthens the stability of scale variation tracking. We evaluate our tracker on online tracking benchmark (OTB) and two visual object tracking benchmarks (VOT2014, VOT2015). These three datasets contains more than 100 video sequences in total. On all the three datasets, the proposed approach achieves promising performance.
Conference Paper
Full-text available
In this paper, we propose a new aerial video dataset and benchmark for low altitude UAV target tracking, as well as, a photo-realistic UAV simulator that can be coupled with tracking methods. Our benchmark provides the first evaluation of many state-of-the-art and popular trackers on 123 new and fully annotated HD video sequences captured from a low-altitude aerial perspective. Among the compared trackers, we determine which ones are the most suitable for UAV tracking both in terms of tracking accuracy and run-time. The simulator can be used to evaluate tracking algorithms in real-time scenarios before they are deployed on a UAV “in the field”, as well as, generate synthetic but photo-realistic tracking datasets with automatic ground truth annotations to easily extend existing real-world datasets. Both the benchmark and simulator are made publicly available to the vision community on our website to further research in the area of object tracking from UAVs. (https:// ivul. kaust. edu. sa/ Pages/ pub-benchmark-simulator-uav. aspx.).
Article
Full-text available
Robust and accurate visual tracking is one of the most challenging computer vision problems. Due to the inherent lack of training data, a robust approach for constructing a target appearance model is crucial. Recently, discriminatively learned correlation filters (DCF) have been successfully applied to address this problem for tracking. These methods utilize a periodic assumption of the training samples to efficiently learn a classifier on all patches in the target neighborhood. However, the periodic assumption also introduces unwanted boundary effects, which severely degrade the quality of the tracking model. We propose Spatially Regularized Discriminative Correlation Filters (SRDCF) for tracking. A spatial regularization component is introduced in the learning to penalize correlation filter coefficients depending on their spatial location. Our SRDCF formulation allows the correlation filters to be learned on a significantly larger set of negative training samples, without corrupting the positive samples. We further propose an optimization strategy, based on the iterative Gauss-Seidel method, for efficient online learning of our SRDCF. Experiments are performed on four benchmark datasets: OTB-2013, ALOV++, OTB-2015, and VOT2014. Our approach achieves state-of-the-art results on all four datasets. On OTB-2013 and OTB-2015, we obtain an absolute gain of 8.0% and 8.2% respectively, in mean overlap precision, compared to the best existing trackers.
Article
Full-text available
Discriminative Correlation Filters (DCF) have demonstrated excellent performance for visual object tracking. The key to their success is the ability to efficiently exploit available negative data by including all shifted versions of a training sample. However, the underlying DCF formulation is restricted to single-resolution feature maps, significantly limiting its potential. In this paper, we go beyond the conventional DCF framework and introduce a novel formulation for training continuous convolution filters. We employ an implicit interpolation model to pose the learning problem in the continuous spatial domain. Our proposed formulation enables efficient integration of multi-resolution deep feature maps, leading to superior results on three object tracking benchmarks: OTB-2015 (+5.1% in mean OP), Temple-Color (+4.6% in mean OP), and VOT2015 (20% relative reduction in failure rate). Additionally, our approach is capable of sub-pixel localization, crucial for the task of accurate feature point tracking. We also demonstrate the effectiveness of our learning formulation in extensive feature point tracking experiments. Code and supplementary material are available at http://www.cvl.isy.liu.se/research/objrec/visualtracking/conttrack/index.html.
Article
Discriminative correlation filters (DCFs) have been widely used in the tracking community recently. DCFs-based trackers utilize samples generated by circularly shifting from an image patch to train a ridge regression model, and estimate target location using a response map generated by the correlation filters. However, the generated samples produce some negative effects and the response map is vulnerable to noise interference, which degrades tracking performance. In this paper, to solve the aforementioned drawbacks, we propose a target-focusing convolutional regression (CR) model for visual object tracking tasks (called TFCR). This model uses a target-focusing loss function to alleviate the influence of background noise on the response map of the current tracking image frame, which effectively improves the tracking accuracy. In particular, it can effectively balance the disequilibrium of positive and negative samples by reducing some effects of the negative samples that act on the object appearance model. Extensive experimental results illustrate that our TFCR tracker achieves competitive performance compared with state-of-the-art trackers.
Article
Deep convolutional neural networks (CNNs) have attracted considerable interest in low-level computer vision. Researches are usually devoted to improving the performance via very deep CNNs. However, as the depth increases, influences of the shallow layers on deep layers are weakened. Inspired by the fact, we propose an attention-guided denoising convolutional neural network (ADNet), mainly including a sparse block (SB), a feature enhancement block (FEB), an attention block (AB) and a reconstruction block (RB) for image denoising. Specifically, the SB makes a tradeoff between performance and efficiency by using dilated and common convolutions to remove the noise. The FEB integrates global and local features information via a long path to enhance the expressive ability of the denoising model. The AB is used to finely extract the noise information hidden in the complex background, which is very effective for complex noisy images, especially real noisy images and bind denoising. Also, the FEB is integrated with the AB to improve the efficiency and reduce the complexity for training a denoising model. Finally, a RB aims to construct the clean image through the obtained noise mapping and the given noisy image. Additionally, comprehensive experiments show that the proposed ADNet performs very well in three tasks (i.e. synthetic and real noisy images, and blind denoising) in terms of both quantitative and qualitative evaluations. The code of ADNet is accessible at http://www.yongxu.org/lunwen.html.
Article
Deep convolutional neural networks (CNNs) have attracted great attention in the field of image denoising. However, there are two drawbacks: (1) it is very difficult to train a deeper CNN for denoising tasks, and (2) most of deeper CNNs suffer from performance saturation. In this paper, we report the design of a novel network called a batch-renormalization denoising network (BRDNet). Specifically, we combine two networks to increase the width of the network, and thus obtain more features. Because batch renormalization is fused into BRDNet, we can address the internal covariate shift and small mini-batch problems. Residual learning is also adopted in a holistic way to facilitate the network training. Dilated convolutions are exploited to extract more information for denoising tasks. Extensive experimental results show that BRDNet outperforms state-of-the-art image-denoising methods. The code of BRDNet is accessible at http://www.yongxu.org/lunwen.html.
Article
Correlation filters (CF) have demonstrated a good performance in visual tracking. However, the base training sample region is larger than the object region, including the interference region (IR). IRs in training samples from cyclic shifts of the base training sample severely degrade the quality of the tracking model. In this paper, a region-filtering correlation tracking (RFCT) algorithm is proposed to address this problem. In this algorithm, we filter training samples by introducing a spatial map into the standard CF formulation. Compared with the existing correlation filter trackers, the proposed tracker has the following advantages. (1) Using a spatial map, the correlation filter can be learned on a larger search region without the interference of IR. (2) Due to processing training samples by a spatial map, it is a more general way to control background information and target information in training samples. In addition, a better spatial map can be explored, the values of which are not restricted. Quantitative evaluations are performed on four benchmark datasets: OTB-2013, OTB-2015, VOT2015, and VOT2016. Experimental results demonstrate that the proposed RFCT algorithm performs favorably against several state-of-the-art methods.
Article
The performance of the tracking task directly depends on target object appearance features. Therefore, a robust method for constructing appearance features is crucial for avoiding tracking failure. The tracking methods based on Convolution Neural Network (CNN) have exhibited excellent performance in the past years. However, the features from each original convolutional layer are not robust to the size change of target object. Once the size of the target object has significant changes, the tracker drifts away from the target object. In this paper, we present a novel tracker based on multi-scale feature, spatiotemporal features and deep residual network to accurately estimate the size of the target object. Our tracker can successfully locate the target object in the consecutive video frames. To solve the multi-scale change issue in visual object tracking, we sample each input image with 67 different size templates and resize the samples to a fixed size. And then these samples are used to offline train deep residual network model with multi-scale feature that we have built up. After that spatial feature and temporal feature are fused into the deep residual network model with multi-scale feature, so that we can get deep multi-scale spatiotemporal features model, which is named MSST-ResNet feature model. Finally, MSST-ResNet feature model is transferred into the tracking tasks and combined with three different Kernelized Correlation Filters (KCFs) to accurately locate target object in the consecutive video frames. Unlike the previous trackers, we directly learn various change of the target appearance by building up a MSST-ResNet feature model. The experimental results demonstrate that the proposed tracking method outperforms the state-of-the-art tracking methods.
Article
Psychological and cognitive findings indicate that human visual perception is attentive and selective, which may process spatial and appearance selective attentions in parallel. By reflecting some aspects of these attentions, this paper presents a novel correlation filter (CF) based tracking approach, corresponding to processing a local and a semi-local background domains, respectively. In the local domain, inspired by the Gestalt principle of figure-ground segregation, we leverage an efficient Boolean map representation, which characterizes an image by a set of Boolean maps via randomly thresholding its color channels, yielding a location response map as a weighted sum of all Boolean maps. The Boolean maps capture the topological structures of target and its scene with different granularities, thereby enabling to effectively improve tracking of non-rectangular objects. Alternatively, in the semi-local domains, we introduce a novel distractor-resilient metric regularization into CF, which acts as a force to push distractors into negative space. Consequently, the unwanted boundary effects of CF can be effectively alleviated. Finally, both models associated with the local and the semi-local domains are seamlessly integrated into a Bayesian framework, and the tracked location is determined by maximizing its likelihood function. Extensive evaluations on the OTB50, OTB100, VOT2016 and VOT2017 tracking benchmarks demonstrate that the proposed method achieves favorable performance against a variety of state-of-the-art trackers with a speed of 45 fps on a single CPU.
Article
Online object tracking is a fundamental problem in computer vision and it is crucial to application in numerous fileds such as guided missile, video surveillance and unmanned aerial vehicle. Despite many studies on visual tracking, there are still many challenges during the tracking process including illumination variation, rotation, scale change, deformation, occlusion and camera motion. To make a clear understanding of visual tracking, visual tracking algorithms are summarized in this paper. Firstly, the meaning and the related work are briefly introduced. Secondly, the typical algorithms are classified, summarized and analyzed from two aspects: traditional algorithms and deep learning algorithms. Finally, the problems and the prediction of the future of visual tracking are discussed.
Article
Due to the factors like rapidly fast motion, cluttered backgrounds, arbitrary object appearance variation and shape deformation, an effective target representation plays a key role in robust visual tracking. Existing methods often employ bounding boxes for target representations, which are easily polluted by noisy clutter backgrounds that may cause drifting problem when the target undergoes large-scale non-rigid or articulated motions. To address this issue, in this paper, motivated by the spatio-temporal nonlocality of target appearance reoccurrence in a video, we explore the nonlocal information to accurately represent and segment the target, yielding an object likelihood map to regularize a correlation filter (CF) for visual tracking. Specifically, given a set of tracked target bounding boxes, we first generate a set of superpixels to represent the foreground and background, and then update the appearance of each superpixel with its long-term spatio-temporally nonlocal counterparts. Then, with the updated appearances, we formulate a spatio-temporally graphical model comprised of the superpixel label consistency potentials. Afterwards, we generate segmentation by optimizing the graphical model via iteratively updating the appearance model and estimating the labels. Finally, with the segmentation mask, we obtain an object likelihood map that is employed to adaptively regularize the CF learning by suppressing the clutter background noises while making full use of the long-term stable target appearance information. Extensive evaluations on the OTB50, SegTrack, Youtube-Objects datasets demonstrate the effectiveness of the proposed method, which performs favorably against some state-of-art methods.
Article
Thermal infrared (TIR) pedestrian tracking is one of the most important components in numerous applications of computer vision, which has a major advantage: it can track the pedestrians in total darkness. How to evaluate the TIR pedestrian tracker fairly on a benchmark dataset is significant for the development of this field. However, there is no a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carry out the large-scale evaluation experiments on our benchmark dataset using nine public available trackers. The experimental results help us to understand the strength and weakness of these trackers. What's more, in order to get insight into the TIR pedestrian tracker more sufficiently, we divide a tracker into three components: feature extractor, motion model, and observation model. Then, we conduct three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Article
This paper improves state-of-the-art on-line trackers that use deep learning. Such trackers train a deep network to pick a specified object out from the background in an initial frame (initialization) and then keep training the model as tracking proceeds (updates). Our core contribution is a meta-learning-based method to adjust deep networks for tracking using off-line training. First, we learn initial parameters and per-parameter coefficients for fast online adaptation. Second, we use training signal from future frames for robustness to target appearance variations and environment changes. The resulting networks train significantly faster during the initialization, while improving robustness and accuracy. We demonstrate this approach on top of the current highest accuracy tracking approach, tracking-by-detection based MDNet and close competitor, the correlation-based CREST. Experimental results on both standard benchmarks, OTB and VOT2016, show improvements in speed, accuracy, and robustness on both trackers.
Article
Most thermal infrared (TIR) tracking methods are discriminative, which treat the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is well coupled to the objective of tracking task. We propose a TIR tracker via a hierarchical Siamese convolutional neural network (CNN), named HSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN coalescing the multiple hierarchical convolutional layers. Then, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the most similar one as the tracked target. Extensive experimental results on the benchmarks: VOT-TIR 2015 and VOT-TIR 2016, show that our proposed method achieves favorable performance against the state-of-the-art methods.