Article

Visual object tracking with adaptive structural convolutional network

If you want to read the PDF, try requesting it from the authors.

Abstract

Convolutional Neural Networks (CNN) have been demonstrated to achieve state-of-the-art performance in visual object tracking task. However, existing CNN-based trackers usually use holistic target samples to train their networks. Once the target undergoes complicated situations (e.g., occlusion, background clutter, and deformation), the tracking performance degrades badly. In this paper, we propose an adaptive structural convolutional filter model to enhance the robustness of deep regression trackers (named: ASCT). Specifically, we first design a mask set to generate local filters to capture local structures of the target. Meanwhile, we adopt an adaptive weighting fusion strategy for these local filters to adapt to the changes in the target appearance, which can enhance the robustness of the tracker effectively. Besides, we develop an end-to-end trainable network comprising feature extraction, decision making, and model updating modules for effective training. Extensive experimental results on large benchmark datasets demonstrate the effectiveness of the proposed ASCT tracker performs favorably against the state-of-the-art trackers.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... and its goal is to estimate the position and size of the target in subsequent frames under the basis of a given state in the initial frame. Despite the great advances in recent years, which can be summarized to three aspects, including: 1) regressors [1], [2]; 2) classifiers [3], [4]; and 3) deep convolutional networks [5]- [8], tracking remains challenging due to several issues, such as illumination variations, occlusions, scale variations, background clutters, etc. ...
... , K, is introduced to simplify the model, where F is an L × L constant matrix that is employed to map any L-dimensional vectorized signal to the Fourier domain. Therefore, we re-express (8) in the frequency domain as follows: ...
... SiamATLwithCC_fix is the same method as SiamATLwithCC except that it uses (3) to update the template. SiamATLwithITDCF_fix is the same method as SiamATLwithITDCF except that it ignores the last term in (8) and update the filter like BACF [37]. SiamATLnonATL represents the method that is not equipped with an ATL framework, and it utilizes both ITDCF and CC as decisionmaking layers. ...
Article
Full-text available
Visual object tracking with semantic deep features has recently attracted much attention in computer vision. Especially, Siamese trackers, which aim to learn a decision making-based similarity evaluation, are widely utilized in the tracking community. However, the online updating of the Siamese fashion is still a tricky issue due to the limitation, which is a tradeoff between model adaption and degradation. To address such an issue, in this article, we propose a novel attentional transfer learning-based Siamese network (SiamATL), which fully exploits the previous knowledge to inspire the current tracker learning in the decision-making module. First, we explicitly model the template and surroundings by using an attentional online update strategy to avoid template pollution. Then, we introduce an instance-transfer discriminative correlation filter (ITDCF) to enhance the distinguishing ability of the tracker. Finally, we suggest a mutual compensation mechanism that integrates cross-correlation matching and ITDCF detection into the decision-making subnetwork to achieve online tracking. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art tracking algorithms on multiple large-scale tracking datasets.
... Compared with the thermal infrared target tracking task, the general target tracking task has been studied extensively. A large number of excellent tracking methods have emerged in the general target tracking task, such as the discriminative correlation filters based tracking methods [10,11,12,13,14,15,16] and the deep learning based tracking methods [17,18,19,20,21,22]. The discriminative correlation filters (DCFs) based trackers attempt to train filters to learn the correlation between features and Gaussian-shaped response maps, which can improve the computational efficiency [11,23]. ...
... There are a lot of reviews that describe the target tracking task from different aspects in detail [30,31,32,33]. In this section, we mainly discuss some of the most relevant works to our tracker, which including tracking methods based on correlation filters framework [11,12,13,15] and tracking methods based on deep learning framework [19,20,21,22,34]. ...
... We determine the trustworthiness of each search patch based on the PSR (peak sidelobe ratio) [21,53,54], and then determine the weight η P of the corresponding patch in the target searching area. The PSR could be calculated as: ...
Article
Thermal InfraRed (TIR) target trackers are easy to be interfered by similar objects, while susceptible to the influence of the target occlusion. To solve these problems, we propose a structural target-aware model (STAMT) for the thermal infrared target tracking tasks. Specifically, the proposed STAMT tracker can learn a target-aware model, which can add more attention to the target area to accurately identify the target from similar objects. In addition, considering the situation that the target is partially occluded in the tracking process, a structural weight model is proposed to locate the target through the unoccluded reliable target part. Ablation studies show the effectiveness of each component in the proposed tracker. Without bells and whistles, the experimental results demonstrate that our STAMT tracker performs favorably against state-of-the-art trackers on PTB-TIR and LSOTB-TIR datasets.
... A multi-domain network (MDNet) [9] is trained by adding a target classification layer to a convolutional neural network, enabling it to achieve significant improvement in tracking performance. Yuan et al. [10] propose an adaptive structural convolutional filter model to enhance the robustness of deep regression trackers This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/. ...
... Content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2020.3003347, IEEE Access 2 VOLUME XX, 2020 ...
Article
Full-text available
With the introduction of deep learning technology into the field of visual tracking, the accuracy and robustness of visual tracking have greatly improved. Recently, trackers based on Siamese region proposal networks have attracted a great deal of attention because of their favorable performance, especially when faced with challenges of heavy target deformation and out-of-plane rotation. However, strategies of local-area search and using a fixed template degrade the performance of such networks in the long-term tracking. In this paper, we focus on a wide-area search tracking approach with adaptive template updating for a Siamese region proposal network. By embedding a correlation filter module into the Siamese region proposal networks, a structure of moving anchors distribution is designed to make the anchor regression centered around the target. Through adaptive reliability evaluation and online template update, the discriminative performance of the model improves greatly. Furthermore, in order to address the problem of being misled by false positives, a multi-trajectory tracking mechanism is joined to consistently boost the classification ability. Experiments on OTB100 and VOT short-term benchmarks and UAV123 long-term benchmark show that our tracker significantly outperforms the original algorithm and achieves comparable performance compared with other state-of-the-art trackers.
... To enhance the quality of response map in correlation filter based algorithms, metric learning model strategy is given in [20]. Convolutional Neural network based tracking strategies are also proposed by numerous researcher during recent years, for latest examples, see [21], [22]. These neural network based algorithms require a lot of training data and large computational time. ...
Article
Full-text available
During recent years correlation tracking is considered fast and effective by the virtue of circulant structure of the sampling data for learning phase of filter and Fourier domain calculation of correlation. During the occurrence of occlusion, motion blur and out of view movement of target, most of the correlation filter based trackers start to learn using erroneous samples and tracker starts drifting. Currently, adaptive correlation filter based tracking algorithms are being combined with redetection modules. This hybridization helps in redetection of the target in long term tracking. The redetection modules are mostly classifier, which classify the true object after tracking failure occurrence. The methods perform favorable during short term occlusion or partial occlusion. To further increase the tracking efficiency specifically during long term occlusion, while maintaining real time processing speed, this study proposes tracking failure avoidance method. We first propose, a strategy to detect the occlusion using two cues from the response map i.e., peak correlation score and peak to side lobe ratio. After successful detection of tracking failure, second strategy is proposed to save the target being getting more erroneous. Kalman filter based predictor continuously predicts the location during occlusion. Kalman filter passes this result to Support Vector Machine (SVM). When the target reappears in frame, support vector machine based classifier classifies the correct object using the predicted location of Kalman filter. This decreases the chance of tracking failure as Kalman filter continuously updates itself during occlusion and predicts the next location using its own previous prediction. Once the true object is detected by classifier after the clearance of occlusion, this result is forwarded to correlation filter tracker to resume its operation of tracking and updating its parameters. Together these two proposed schemes show significant improvement in tracking efficiency. Furthermore, this collaboration in redetection phase shows significant improvement in the tracking accuracy over videos containing six challenging aspects of visual object tracking as mentioned in the literature.
... Inspired by convolutional neural network (CNN) technology [35], [36], several studies have sought to improve supervised learning using deep network structures [37]- [40]. Typical approaches applying a structured autoencoder or attention-guided CNN network are known as block-wise CNN methods, which include the DCNN [41], GoogleNet [42], CrackNet [43], and CrackNet II [44]. ...
Article
Full-text available
Tiny cracks that exist in steel beams have poor continuity and low contrast in images, posing a huge challenge to crack detection using image-based approaches. When complex backgrounds exist, the existing deep learning methods are usually unable to perform effective feature transfer and fusion for crack feature mapping, and they cannot accurately distinguish crack features from similar backgrounds. In this article, we propose a fusion segmentation algorithm, using the fully convolutional network (FCN) and structured forests with wavelet transform (SFW) to detect tiny cracks in steel beams. First, five neural networks based on the FCN framework are constructed to extend the global characteristics of tiny cracks. Second, a fine edge detection approach using multi-scale structured forests and wavelet maximum modulus edge detection to refine the characteristics of tiny cracks are proposed. Here, a competitive training strategy is used to address the SFW parameter optimization problem. Finally, we fuse the multiple probability maps, acquired from both the optimal FCN model and the SFW classifier, into a merged map, which can segment tiny cracks with robustness better than the comparison approaches. The experimental results show that compared with state-of-the-art algorithms and other segmentation approaches, the proposed algorithm realizes better segmentation in terms of quantitative metrics.
... However, this method has a poor effect when the target and background information are similar. Yuan et al. proposed an adaptive structural convolutional filter model (ASCT) to enhance the robustness of deep regression trackers [4]. ASCT can effectively improve the robustness of the tracker through adaptive weighted fusion of local filters. ...
Article
Full-text available
In the field of correlation filter object tracking, the traditional template-update method easily causes template drift, so it performs poorly in complex scenes. To enhance the robustness of the template, a novel incremental multi-template update strategy is proposed in this paper. We find that reliability varies among all historical filters and that highly reliable filters are key to achieving accurate tracking. The incremental multi-template update strategy combines the local maximum-reliability filter template with the historical filter template incrementally, which is obviously different from the traditional update method. We apply this strategy to two trackers with superior performance. The experimental results of three test benchmarks, including the VOT2016, OTB100 and UAV123 datasets, show that the performance of our trackers is superior to that of the state-of-the-art trackers.
... The second one is that most deep convolutional network-based trackers require a network with multiple layers to extract features and fine-tune their pre-trained networks in online tracking phases, which results in high computational complexity. Some deep CNN-based trackers unable to achieve a real-time tracking speed because of the high dimension of the feature extraction network [7], [9], [13]. For example, the MDNet [9] tracker needs to pre-train a deep CNN architecture for the similarity-matching task. ...
Article
Full-text available
The training of a feature extraction network typically requires abundant manually annotated training samples, making this a time-consuming and costly process. Accordingly, we propose an effective self-supervised learning-based tracker in a deep correlation framework (named: self-SDCT). Motivated by the forward-backward tracking consistency of a robust tracker, we propose a multi-cycle consistency loss as self-supervised information for learning feature extraction network from adjacent video frames. At the training stage, we generate pseudo-labels of consecutive video frames by forward-backward prediction under a Siamese correlation tracking framework and utilize the proposed multi-cycle consistency loss to learn a feature extraction network. Furthermore, we propose a similarity dropout strategy to enable some low-quality training sample pairs to be dropped and also adopt a cycle trajectory consistency loss in each sample pair to improve the training loss function. At the tracking stage, we employ the pre-trained feature extraction network to extract features and utilize a Siamese correlation tracking framework to locate the target using forward tracking alone. Extensive experimental results indicate that the proposed self-supervised deep correlation tracker (self-SDCT) achieves competitive tracking performance contrasted to state-of-the-art supervised and unsupervised tracking methods on standard evaluation benchmarks.
... Therefore, to enhance the robustness of deep regression trackers to complicated situations, (e.g., occlusion, background clutter, and deformation), Yuan et al. [52] further proposed an adaptive structural convolutional filter model. Considering that the appearance model is easily disturbed by noise in the tracking algorithms with a single feature, Yuan et al. [53] proposed a multiple feature fused model into a correlation filter framework for object tracking. ...
Article
Full-text available
In recent years, trackers based on correlation filters have attracted more and more attention due to the impressive tracking accuracy and real-time performance. However, in real scenarios, the tracking results are often been interfered with by the occlusion, illumination variation, appearance variation and background clutter. In order to find a tracker with better tracking performances, this paper proposed a multi-information fusion correlation filter tracker, which uses channel and spatial reliabilities and time regularization information on samples for filter training, and which not only extends the target search areas but also has a stronger ability to track the targets with significant appearance variations. Thus, results from extensive experiments conducted on OTB100, VOT2016, TC128, and UAV123 data sets show that our tracker with only directional gradient histogram (HOG) and color name (CN) features, performs favorably against the state-of-the-art trackers in terms of tracking precision, tracking success rate, tracking accuracy, and A-R rank.
... In recent years, deep learning has achieved good results in the area of visual tracking. The convolutional neural network (CNN) [27,28,29,30,31,32,33,34,35] is the most popular deep learning model for visual tracking tasks, because of its powerful ability for feature extraction. Chen et al. [36] proposed to learn a linear regression model via a single convolutional layer for visual object tracking. ...
Article
Full-text available
Correlation filter-based trackers (CFTs) have recently shown remarkable performance in the field of visual object tracking. The advantage of these trackers originates from their ability to convert time-domain calculations into frequency domain calculations. However, a significant problem of these CFTs is that the model is insufficiently robust when the tracking scenarios are too complicated, meaning that the ideal tracking performance cannot be acquired. Recent work has attempted to resolve this problem by reducing the boundary effects from modeling the foreground and background of the object target effectively (e.g., CFLB, BACF, and CACF). Although these methods have demonstrated reasonable performance, they are often affected by occlusion, deformation, scale variation, and other challenging scenes. In this study, considering the relationship between the current frame and the previous frame of a moving object target in a time series, we propose a temporal regularization strategy to improve the BACF tracker (denoted as TRBACF), a typical representative of the aforementioned trackers. The TRBACF tracker can efficiently adjust the model to adapt the change of the tracking scenes, thereby enhancing its robustness and accuracy. Moreover, the objective function of our TRBACF tracker can be solved by an improved alternating direction method of multipliers, which can speed up the calculation in the Fourier domain. Extensive experimental results demonstrate that the proposed TRBACF tracker achieves competitive tracking performance compared with state-of-the-art trackers.
... However, that changed in 2012 with AlexNet in that year's ImageNet Large-Scale Visual Recognition Challenge (ILSVRC) [109]. After that, deep network architectures (e.g., VGG [186] and GoogLeNet [193]) were widely applied in the fields of image [219,212], video [135,242], nature language processing [52] and speech processing [261], especially low-level computer vision [166,201]. ...
Preprint
Full-text available
Deep learning techniques have obtained much attention in image denoising. However, deep learning methods of different types deal with the noise have enormous differences. Specifically, discriminative learning based on deep learning can well address the Gaussian noise. Optimization model methods based on deep learning have good effect on estimating of the real noise. So far, there are little related researches to summarize different deep learning techniques for image denoising. In this paper, we make such a comparative study of different deep techniques in image denoising. We first classify the (1) deep convolutional neural networks (CNNs) for additive white noisy images, (2) deep CNNs for real noisy images, (3) deep CNNs for blind denoising and (4) deep CNNs for hybrid noisy images, which is the combination of noisy, blurred and low-resolution images. Then, we analyze the motivations and principles of deep learning methods of different types. Next, we compare and verify the state-of-the-art methods on public denoising datasets in terms of quantitative and qualitative analysis. Finally, we point out some potential challenges and directions of future research.
... In this section, we compare our tracker with 8 tracking methods including SiamRPN [3], SiamFC [43], TFCR [40], ASCT [44], ECO_HC [11], ECO_DEEP [11], SAMF [45] and CFnet [46] on UAV123. The results are shown in Table 2. Our tracker with deep features achieves a precision score of 0.760 and an AUC score of 0.532 and improves the baseline by a gain of 1.9% and 0.7% respectively. ...
Article
Full-text available
Correlation filter-based trackers have gained more and more attention because of their great performances and relative high tracking speeds. However, this kind of trackers may suffer model drifting due to learning limited background information during filter training. This may lead to tracking failures in some complex scenes, such as background clutter, deformation, illumination variation and so on. In this article, we propose an adaptive and complementary correlation filter with dynamic contextual constraints. First, we introduce contextual information around the target as a dynamic constrained term to alleviate model drifting in complex scenes, the optimal function of which can be solved by an iterative method. Then, we integrate a color histogram-based tracker to compensate the inaccurate tracking of correlation filtering. In addition, we present metrics to combine the two complementary trackers with adaptive fusion coefficients. Finally, extensive experiments on OTB2013, OTB2015, VOT2016 and UAV123 benchmark datasets demonstrate that our tracker can improve the performance of our baseline and can perform favorably against some state-of-the-art trackers.
... It is known that differences in textures and edges of different LR images have great influence on SR model. To address this problem, image augmentation has obtained good performance in image [47,51] and video applications [52]. Based on this idea, a twostep mechanism [19,35] is used to enlarge the training dataset for improving the generalization ability of the SR model. ...
Article
Full-text available
Deep convolutional neural networks (CNNs) have been widely applied for low-level vision over the past five years. According to the nature of different applications, designing appropriate CNN architectures is developed. However, customized architectures gather different features via treating all pixel points as equal to improve the performance of given application, which ignores the effects of local power pixel points and results in low training efficiency. In this article, we propose an asymmetric CNN (ACNet) comprising an asymmetric block (AB), a memory enhancement block (MEB), and a high-frequency feature enhancement block (HFFEB) for image superresolution (SR). The AB utilizes one-dimensional (1-D) asymmetric convolutions to intensify the square convolution kernels in horizontal and vertical directions for promoting the influences of local salient features for single image SR (SISR). The MEB fuses all hierarchical low-frequency features from AB via a residual learning technique to resolve the long-term dependency problem and transforms obtained low-frequency features into high-frequency features. The HFFEB exploits low- and high-frequency features to obtain more robust SR features and address the excessive feature enhancement problem. Additionally, it also takes charge of reconstructing a high-resolution image. Extensive experiments show that our ACNet can effectively address SISR, blind SISR, and blind SISR of blind noise problems. The code of the ACNet is shown at https://github.com/hellloxiaotian/ACNet.
... Plug-and-play architectures enlarge the flexibilities of deep CNNs on different computer vision tasks, such as video [35,36], text-to-image synthesis [37], image denoising [38], image deraining [39], low-light image enhancement [40], image dehazing [41] and image super-resolution [42]. Specifically, deep CNNs based blocks can better cooperate with each component to facilitate more useful information, which is popular in real applications. ...
Article
Deep convolutional neural networks (CNNs) with strong expressive ability have achieved impressive performances on single image super-resolution (SISR). However, their excessive amounts of convolutions and parameters usually consume high computational cost and more memory storage for training a SR model, which limits their applications to SR with resource-constrained devices in real world. To resolve these problems, we propose a lightweight enhanced SR CNN (LESRCNN) with three successive sub-blocks, an information extraction and enhancement block (IEEB), a reconstruction block (RB) and an information refinement block (IRB). Specifically, the IEEB extracts hierarchical low-resolution (LR) features and aggregates the obtained features step-by-step to increase the memory ability of the shallow layers on deep layers for SISR. To remove redundant information obtained, a heterogeneous architecture is adopted in the IEEB. After that, the RB converts low-frequency features into high-frequency features by fusing global and local features, which is complementary with the IEEB in tackling the long-term dependency problem. Finally, the IRB uses coarse high-frequency features from the RB to learn more accurate SR features and construct a SR image. The proposed LESRCNN can obtain a high-quality image by a model for different scales. Extensive experiments demonstrate that the proposed LESRCNN outperforms state-of-the-arts on SISR in terms of qualitative and quantitative evaluation.
... Plug-and-play architectures enlarge the flexibilities of deep CNNs on different computer vision tasks, such as video [66,28], text-to-image synthesis [30], image denoising [49], image deraining [36], low-light image enhancement [39], image dehazing [40] and image super-resolution [70]. Specifically, deep CNNs based blocks can better cooperate with each component to facilitate more useful information, which is popular in real applications. ...
Preprint
Full-text available
Deep convolutional neural networks (CNNs) with strong expressive ability have achieved impressive performances on single image super-resolution (SISR). However, their excessive amounts of convolutions and parameters usually consume high computational cost and more memory storage for training a SR model, which limits their applications to SR with resource-constrained devices in real world. To resolve these problems, we propose a lightweight enhanced SR CNN (LESR-CNN) with three successive sub-blocks, an information extraction and enhancement block (IEEB), a reconstruction block (RB) and an information refinement block (IRB). Specifically, the IEEB extracts hierarchical low-resolution (LR) features and aggregates the obtained features step-by-step to increase the memory ability of the shallow layers on deep layers for SISR. To remove redundant information obtained, a heterogeneous architecture is adopted in the IEEB. After that, the RB converts low-frequency features into high-frequency features by fusing global and local features, which is complementary with the IEEB in tackling the long-term dependency problem. Finally, the IRB uses coarse high-frequency features from the RB to learn more accurate SR features and construct a SR image. The proposed LESRCNN can obtain a high-quality image by a model for different scales. Extensive experiments demonstrate that the proposed LESRCNN outperforms state-of-the-arts on SISR in terms of qualitative and quantitative evaluation.
... e scope of computer vision research is quite extensive, including face recognition [3], vehicle or pedestrian detection [4,5], target tracking [6,7], and image generation [8][9][10][11]. Visual target tracking has become one of the most influential research fields in computer vision because it is widely used in video surveillance [12], intelligent transportation, and military guidance [13,14]. With the successful application of visual tracking technology in human life, visual object tracking is used to more and more complex environment, such as illumination variation, occlusions, fast motion, and deformation and background clutter, and these complex factors bring great challenges to the stable tracking of targets [15][16][17][18]. ...
Article
Full-text available
Context-aware correlation filter tracker is one of the most advanced target trackers, and it has significant improvement in tracking accuracy and success rate compared with traditional trackers. However, because the complexity of background in the process of tracking can lead to inaccurate output response of target tracking, an accurate tracking model is difficult to be established. Moreover, the drift problem is easy to occur during the tracking process due to the imprecise tracking model, especially when the target has large area occlusion, fast motion, and deformation. Aiming at the drift problem in the target tracking process, a novel algorithm is proposed in this paper. The developed method derives the specific representation of constraint output by assuming that the output response is Gaussian distribution, and a variable update parameter is obtained based on the output constraint relationship at first, then the tracking filter is selectively updated with changeable update parameters and fixed update parameters, and finally, the target scale is updated with maximizing posterior probability distribution. The effectiveness of developed algorithm is verified by comparing with other trackers on OTB-50 and OTB-100 evaluation benchmark datasets, and the experimental results have shown that the suggested tracker has higher overall object tracking performance than other trackers.
... Therefore, some variants of the SRDCF tracker have been proposed to improve it from different aspects [13][14][15]. There are also some deep learning-based trackers in correlation filters framework, which use CNN features or adopt a deep network structure to do tracking task [16][17][18][19][20]. However, there are two main reasons that the tracking performances of the DCFs-based trackers are far from the actual demand: (1) the response map is prone to multi-peak when it is disturbed, which makes the tracker cannot accurately locate the target; (2) the tracker uses a fixed scale factor to adjust the target scale, which does not fit the actual change of the moving target. ...
Article
Discriminative correlation filters (DCFs) have been widely used in the visual tracking community in recent years. The DCFs-based trackers determine the target location through a response map generated by the correlation filters and determine the target scale by a fixed scale factor. However, the response map is vulnerable to noise interference and the fixed scale factor also cannot reflect the real scale change of the target, which can obviously reduce the tracking performance. In this paper, to solve the aforementioned drawbacks, we propose to learn a metric learning model in correlation filters framework for visual tracking (called CFML). This model can use a metric learning function to solve the target scale problem. In particular, we adopt a hard negative mining strategy to alleviate the influence of the noise on the response map, which can effectively improve the tracking accuracy. Extensive experimental results demonstrate that the proposed CFML tracker achieves competitive performance compared with the state-of-the-art trackers.
... Among them, the most outstanding performance is the tracking method based on deep learning. By taking advantage of the powerful capabilities of deep convolutional neural networks in feature representation, this type of trackers [26,44] can achieve the most advanced performance. Although the use of complex networks to extract deep features can improve tracking performance, it requires a lot of computing resources and is difficult to apply to UAV. ...
Article
Full-text available
Real-time object tracking for unmanned aerial vehicles (UAVs) is an essential and challenging research topic for computer vision. However, the scenarios that UAVs deal with are complicated, and the UAV tracking targets are small. Therefore, general trackers often fail to take full advantage of their performance in UAV scenarios. In this paper, we propose a tracking method for UAV scenes, which utilizes background cues and aberrances response suppression mechanism to track in 4 degrees-of-freedom. Firstly, we consider the tracking task as a similarity measurement problem. In this study, we decompose this problem into two subproblems for optimization. Secondly, to alleviate the problem of small targets in UAV scenes, we utilize background cues fully. Also, to reduce interference by background information, we employ an aberrance response suppression mechanism. Then, to obtain accurate target state information, we introduce a logarithmic polar coordinate system. We perform phase correlation calculations in logarithmic polar coordinates to obtain the rotation and scale changes of the target. Finally, target states are obtained through response fusion, which includes displacement, scale, and rotation angle. Our approach is carried out in a large number of experiments on various UAV datasets, such as UAV123, DBT70, and UAVDT2019. Compared with the current advanced trackers, our method is superior on UAV small target tracking.
... It is known that differences in textures and edges of different LR images have great influence on SR model. To address this problem, image augmentation has obtained good performance in image [47,51] and video applications [52]. Based on this idea, a twostep mechanism [19,35] is used to enlarge the training dataset for improving the generalization ability of the SR model. ...
Preprint
Full-text available
Deep convolutional neural networks (CNNs) have been widely applied for low-level vision over the past five years. According to nature of different applications, designing appropriate CNN architectures is developed. However, customized architectures gather different features via treating all pixel points as equal to improve the performance of given application, which ignores the effects of local power pixel points and results in low training efficiency. In this paper, we propose an asymmetric CNN (ACNet) comprising an asymmetric block (AB), a mem?ory enhancement block (MEB) and a high-frequency feature enhancement block (HFFEB) for image super-resolution. The AB utilizes one-dimensional asymmetric convolutions to intensify the square convolution kernels in horizontal and vertical directions for promoting the influences of local salient features for SISR. The MEB fuses all hierarchical low-frequency features from the AB via residual learning (RL) technique to resolve the long-term dependency problem and transforms obtained low-frequency fea?tures into high-frequency features. The HFFEB exploits low- and high-frequency features to obtain more robust super-resolution features and address excessive feature enhancement problem. Ad?ditionally, it also takes charge of reconstructing a high-resolution (HR) image. Extensive experiments show that our ACNet can effectively address single image super-resolution (SISR), blind SISR and blind SISR of blind noise problems. The code of the ACNet is shown at https://github.com/hellloxiaotian/ACNet.
... Owing to end-to-end connection architectures, CNNs with flexible plugins are used far in many tasks, i.e., image [23,24], video [25,26] and text applications [27]. Specifically, modules or blocks in CNNs are used in low-level computer vision, especially, image super-resolution [28][29][30] and denoising [16]. ...
Article
Deep convolutional neural networks (CNNs) for image denoising have recently attracted increasing research interest. However, plain networks cannot recover fine details for a complex task, such as real noisy images. In this paper, we propose a Dual denoising Network (DudeNet) to recover a clean image. Specifically, DudeNet consists of four modules: a feature extraction block, an enhancement block, a compression block, and a reconstruction block. The feature extraction block with a sparse mechanism extracts global and local features via two sub-networks. The enhancement block gathers and fuses the global and local features to provide complementary information for the latter network. The compression block refines the extracted information and compresses the network. Finally, the reconstruction block is utilized to reconstruct a denoised image. The DudeNet has the following advantages: (1) The dual networks with a sparse mechanism can extract complementary features to enhance the generalized ability of denoiser. (2) Fusing global and local features can extract salient features to recover fine details for complex noisy images. (3) A small-size filter is used to reduce the complexity of denoiser. Extensive experiments demonstrate the superiority of DudeNet over existing current state-of-the-art denoising methods.
... To improve tracking accuracy, a group feature selection strategy has been proposed under the DCF-based tracking framework that can select group features across channels and spatial dimensions to determine the structural correlation between feature channel and filter system [1]. The DCF-based trackers mentioned above are only able to determine the target center location, most of these trackers use a multi-scale search strategy to predict the target state, which usually results in relatively inaccurate tracking results [34,35]. The recently proposed ATOM [14] tracker incorporates IoU modulation and IoU prediction to improve tracking performance. ...
Article
Full-text available
Most existing trackers are based on using a classifier and multi-scale estimation to estimate the target state. Consequently, and as expected, trackers have become more stable while tracking accuracy has stagnated. While trackers adopt a maximum overlap method based on an intersection-over-union (IoU) loss to mitigate this problem, there are defects in the IoU loss itself, that make it impossible to continue to optimize the objective function when a given bounding box is completely contained within/without another bounding box; this makes it very challenging to accurately estimate the target state. Accordingly, in this paper, we address the above-mentioned problem by proposing a novel tracking method based on a distance-IoU (DIoU) loss, such that the proposed tracker consists of target estimation and target classification. The target estimation part is trained to predict the DIoU score between the target ground-truth bounding-box and the estimated bounding-box. The DIoU loss can maintain the advantage provided by the IoU loss while minimizing the distance between the center points of two bounding boxes, thereby making the target estimation more accurate. Moreover, we introduce a classification part that is trained online and optimized with a Conjugate-Gradient-based strategy to guarantee real-time tracking speed. Comprehensive experimental results demonstrate that the proposed method achieves competitive tracking accuracy when compared to state-of-the-art trackers while with a real-time tracking speed.
... Yuan et al. [23] proposed a target-focusing convolutional regression model to complicated situations and achieved better performance. Moreover, Yuan et al. [24] used multiple features to build the target model for object tracking to deal with the tracking accuracy problem caused by a single feature. Zhang et al. [25] combined the texture features and color features to perform optimal similarity matching. ...
Article
Full-text available
Aiming at the problems of illumination changes, target deformation and background clutter in the target tracking field, a visual tracking algorithm based on peak sidelobe ratio is proposed. The object-interference model is used to represent the target appearance model, and context information is added to the relevant filtering framework. Training the filter internally to enhance the ability of filter discrimination. In the model update process, it is easier to introduce samples that cannot characterize the target, and use peak sidelobe comparison to update the tracking parameters, which can enhance the generalization ability of the model. Tested with some classic and recently algorithms in the OTB50, OTB100, UAV123, TC128 experimental video data set, the experiment’ results show that the visual tracking algorithm that is proposed in the article can track the target more accurately. It has important research on the development of intelligent video surveillance value.
... By introducing an online learning scale filter [2], a segmentation based regularized module [3], an adaptive spatial feature selection module [4] and a hierarchical attentional module with contextual attentional correlation filter [5], the performance of the discriminant correlation filters has been greatly improved. Another kind of trackers transform the tracking problem into a target matching problem and we call them siamese trackers [6][7][8][9][10][11][12][13][14][15][16]. The network structure of the siamese tracker consists of two branches that use the shared network structure and parameters. ...
Article
Full-text available
Most of the current tracking methods use bounding box to describe objects, which only provides a rough outline and is unable to accurately capture the shape and posture of the target. Instead of using a bounding box directly, we use points adaptively positioned on the target to describe the target and transform these points to bounding boxes. In this way, we can use a training strategy based on bounding box while describing the target more accurately. Furthermore, a Coarse-Fine classification module is employed to improve the robustness, which is important in the case of scale variation and deformation. Combining the above modules, we propose our SiamPCF, which is an anchor-free tracking method that avoids the carefully selected hyperparameters needed to design anchors. Extensive experiments conducted on five benchmarks show that our SiamPCF can achieve state-of-the-art results. In the analysis of video attributes, our SiamPCF ranks first in scale variance, which demonstrates its effectiveness. Our SiamPCF runs more than 45 frames per second.
... e network not only serves as a feature extraction tool but also judges the target candidate position to obtain the target position. Literature [15] and so on generate a series of candidate samples through particle filter or sliding window and score these candidate samples to get the final tracking result. Literature [16] uses a sliding window to obtain a series of candidate samples and then uses a convolutional neural network to evaluate the maximum likelihood estimation of the samples to obtain the final tracking result. ...
Article
Full-text available
With the advent of the artificial intelligence era, target adaptive tracking technology has been rapidly developed in the fields of human-computer interaction, intelligent monitoring, and autonomous driving. Aiming at the problem of low tracking accuracy and poor robustness of the current Generic Object Tracking Using Regression Network (GOTURN) tracking algorithm, this paper takes the most popular convolutional neural network in the current target-tracking field as the basic network structure and proposes an improved GOTURN target-tracking algorithm based on residual attention mechanism and fusion of spatiotemporal context information for data fusion. The algorithm transmits the target template, prediction area, and search area to the network at the same time to extract the general feature map and predicts the location of the tracking target in the current frame through the fully connected layer. At the same time, the residual attention mechanism network is added to the target template network structure to enhance the feature expression ability of the network and improve the overall performance of the algorithm. A large number of experiments conducted on the current mainstream target-tracking test data set show that the tracking algorithm we proposed has significantly improved the overall performance of the original tracking algorithm.
... Therefore, whether it is image processing or medical diagnosis, accurate image segmentation is particularly important. Image segmentation plays an important role not only in lesion segmentation and organ segmentation but also in pattern recognition [1,2] and object tracking [3,4] . ...
Article
Medical image segmentation has a huge challenge due to intensity inhomogeneity and the similarity of the background and the object. To meet this challenge, we propose an improved active contour model, in which we combine the level set method and the split Bregman method, and provide the two-phase formulation, the multi-phase formulation and 3D formulation. In this paper, the proposed model is presented in a level set framework by including the neighbor region information for segmenting medical images in which the energy functional contains the data fitting term and the length term. The neighbor region and the local intensity variances in the data fitting term are designed to optimize the minimization process. To minimize the energy functional then we apply the split Bregman method which contributes to get faster convergence. Besides, we extend our model to the multi-phase segmentation model and the 3D segmentation model for cardiac MR images, which have all achieved good results. Experimental results show that the new model not only has strong robustness to other cardiac tissue effects and image intensity inhomogeneity, but it also can much better conduce to the extraction of effective tissues. As we expected, our model has higher segmentation accuracy and efficiency for medical image segmentation.
Article
Tracking in the unmanned aerial vehicle (UAV) scenarios is one of the main components of target tracking tasks. Different from the target tracking task in the general scenarios, the target tracking task in the UAV scenarios is very challenging because of factors such as small scale and aerial view. Although the DCFs-based tracker has achieved good results in tracking tasks in general scenarios, the boundary effect caused by the dense sampling method will reduce the tracking accuracy, especially in UAV tracking scenarios. In this work, we propose learning an adaptive spatial-temporal context-aware (ASTCA) model in the DCFs-based tracking framework to improve the tracking accuracy and reduce the influence of boundary effect, thereby enabling our tracker to more appropriately handle UAV tracking tasks. Specifically, our ASTCA model can learn a spatial-temporal context weight, which can precisely distinguish the target and background in the UAV tracking scenarios. Besides, considering the small target scale and the aerial view in UAV tracking scenarios, our ASTCA model incorporates spatial context information within the DCFs-based tracker, which could effectively alleviate background interference. Extensive experiments demonstrate that our ASTCA method performs favorably against state-of-the-art tracking methods on some standard UAV datasets.
Article
Despite the great success achieved in visual tracking, it is still hard for most trackers to address scenes with targets subject to large-scale changes and similar objects. The capacity of existing methods is first insufficient to efficiently extract multi-scale features. Then, convolutional neural networks focus primarily on local characteristics while easily ignoring global characteristics, which is essential for visual tracking. Furthermore, the recently popular tracking methods based on Siamese-like networks can perform the image matching of two branches through simple cross-correlation operations, and cannot effectively establish their connection. An improved Siamese tracking network, called GSiamMS, is proposed to address these challenges via the integration of Res2Net blocks and transformer modules. Within this network, a feature extraction module based on Res2Net blocks is constructed to obtain multi-scale information from the granular level without relying on multi-layer outputs. Then, the cross-attention mechanism is utilized to learn the connection between template features and search features while the self-attention mechanism focusing on the global information establishes long-range dependencies between the object and the background. Finally, numerous experiments on visual tracking benchmarks including TrackingNet, GOT-10k, LaSOT, NFS, UAV123, and TNL2K are implemented to verify that the developed method running at 38fps achieves the superior performance compared with several state-of-the-art methods.
Article
Full-text available
In recent years, correlation filter based trackers have seen widespread success because of their high efficiency and robustness. However, a single feature based tracker cannot deal with complex scenes such as serious occlusion, motion blur and illumination variation. In this paper, we develop a novel tracking method combining color feature, Hog feature and motion feature. The motion feature is estimated between adjacent frames by large displacement optical flow. Besides, in order to cope with boundary effect existing in traditional correlation filter based trackers, an adaptive cosine window is introduced in our method, which can highlight the target region, suppress the background region and enlarge search region. Meanwhile, a novel judge scheme combining Hog correlation response and color response is adopted to evaluate the reliability of tracking result. Finally, inverse sparse representation is presented to locate coarse positions of target in case of tracking failures. Extensive experiments on four famous tracking benchmarks including OTB100, TColor-128, UAVDT, UAV123 and VOT2016 demonstrate our proposed method outperform other sate-of-the-art methods in terms of robustness and accuracy.
Article
Visual attention has recently achieved great success and wide application in deep neural networks. Existing methods based on Siamese network have achieved a good accuracy-efficiency trade-off in visual tracking. However, the training time of Siamese trackers becomes longer for the deeper network and larger training data. Further, Siamese trackers can not predict the target location well in fast motion, full occlusion, camera motion, and similar object scenarios. Due to these difficulties, we develop an end-to-end Siamese attention network for visual tracking. Our approach is to introduce an attention branch in the region proposal network that contains a classification branch and a regression branch. We perform foreground-background classification by combining the scores of the classification branch and the attention branch. The regression branch predicts the bounding boxes of the candidate regions based on the classification results. Furthermore, the proposed tracker achieves the experimental results comparable to the state-of-the-art tracker on six tracking benchmarks. In particular, the proposed method achieves an AUC score of 0.503 on LaSOT, while running at 40 frames per second (FPS).
Article
Full-text available
While part-based methods have been shown effective in the person re-identification task, it is unreasonable for most of them to treat each part equally, due to the retrieved image may be affected by deformation, occlusion and other factors, which makes the feature information of some parts unreliable. Instead of using the same weight of each part for the final person re-ID, we consider using an adaptive weight based on the part image information for each part for precise person retrieval. Specifically, we aim at learning discriminative part-informed features and propose an adaptive weight part-based convolutional network (AWPCN) for the person re-ID task. The core component of our AWPCN framework is an adaptive weight model, in which the part-based convolutional network and the adap-tive weight model are used for feature refinement and feature-pair alignment, respectively. Given an image input at first, it outputs a convolutional descriptor consisting of several part-level features by the part-based convolutional network. And then, the corresponding weights of each part are determined by the adaptive weight model. Finally, we can use the adaptive weight part-based convolutional network joint to train each part loss and simultaneous optimization of its feature representations. We evaluate the proposed AWPCN model on Market-1501, DukeMTMC-reID and CUHK03 datasets. In extensive experiments, the AWPCN model outperforms most of the state-of-the-art methods on these representative datasets which clearly demonstrates the effectiveness of our proposed method. Our code will be released at https://github.com/deasonyuan/AWPCN.
Article
Moving object tracking is one of the applied fields in artificial intelligence and robotic. The main objective of object tracking is to detect and locate targets in video frames of real scenes. Although various methods have been proposed for object tracking so far, tracking in challenging conditions remains an open issue. Recently, different evolutionary and heuristics algorithms like swarm intelligence have been used to address the tracking challenges, which have shown promising performance. In this paper, a new approach based on modified biogeography based optimization (mBBO) method is introduced. The BBO algorithm includes migration and mutation steps. In the migration phase, the search space is properly explored by sharing information between habitats and weaker solutions to improve their position. On the other hand, the mutation phase leads to diversity and change in solutions. In this algorithm, the elitist method has been also used to keep better solutions. The performance of modified BBO tracker has been evaluated on benchmark video datasets and compared with several other tracking methods. Experimental results demonstrate that the proposed method estimates the location of targets with high accuracy and achieves better performance and robustness compared to other trackers.
Article
Full-text available
Visual ship tracking provides crucial kinematic traffic information to maritime traffic participants, which helps to accurately predict ship traveling behaviors in the near future. Traditional ship tracking models obtain a satisfactory performance by exploiting distinct features from maritime images, which may fail when the ship scale varies in image sequences. Moreover, previous frameworks have not paid much attention to weather condition interferences (e.g., visibility). To address this challenge, we propose a scale-adaptive ship tracking framework with the help of a kernelized correlation filter (KCF) and a log-polar transformation operation. First, the proposed ship tracker employs a conventional KCF model to obtain the raw ship position in the current maritime image. Second, both the previous step output and ship training sample are transformed into a log-polar coordinate system, which are further processed with the correlation filter to determine ship scale factor and to suppress the negative influence of the weather conditions. We verify the proposed ship tracker performance on three typical maritime scenarios under typical navigational weather conditions (i.e., sunny, fog). The findings of the study can help traffic participants efficiently obtain maritime situation awareness information from maritime videos, in real time, under different visibility weather conditions.
Article
Compressed image quality enhancement has attracted a large amount of attention in recent years. In general, the primary goal of compression is artifact reduction to produce a higher-quality output from a low-quality input. Information loss and compression artifacts are mostly due to quantization. The quantization matrix is determined by the compression quality factor (QF). However, there has thus far been little related research to estimate the compression quality factor for JPEG images. To address this issue, in this paper, we propose a deep dual-domain semi-blind network (D³SN) that combines compression quality factor detection and compressed image quality enhancement. Specifically, a quality factor detection (QFD) module is designed to capture contextual information of the space and frequency domains. Furthermore, we build a novel deep dual-domain compressed image quality enhancement network to remove the compression artifacts by using the prior in terms of both the discrete cosine transform (DCT) and pixel domains. Different from previous algorithms, our proposed approach can remove compression artifacts generated at different quality factors by inferring the image quality. Experimental results demonstrate the superiority of our deep dual-domain semi-blind network over state-of-the-art methods in terms of objective quality and visual results.
Article
Full-text available
Trackers based on Siamese networks show great potential in tracking accuracy and speed. However, it is still challenging to adapt offline training model to online tracking. In this paper, a Siamese based tracker (SCRPN-CISA) is proposed, which integrates three attention mechanisms and a novel Cascaded Region Proposal Network (RPN) architecture, for improving the feature extraction ability, adaptability and discrimination ability in complex scenes. Firstly, the deep network VGG-Net-D is adopted as the backbone network in the Siamese framework to increase the feature extraction capability. Then, a Channel-Interconnection-Spatial Attention module is constructed to enhance the adaptive and discriminative capability of the model. Next, a Deconvolution Adjust Block is built to fusion cross-layer features. Finally, a Three-Layer Cascaded RPN is conceived to acquire the foreground-background classification and bounding box regression by correlation calculation, and moreover, a proposal region screening strategy is presented to obtain more accurate tracking results. Experiments on OTB-2015, UAV123, VOT2016, and VOT2019 benchmarks demonstrate that, the proposed tracker (SCRPN-CISA) achieves competitive performance compared with the state-of-the-art trackers.
Article
Deep learning techniques have received much attention in the area of image denoising. However, there are substantial differences in the various types of deep learning methods dealing with image denoising. Specifically, discriminative learning based on deep learning can ably address the issue of Gaussian noise. Optimization models based on deep learning are effective in estimating the real noise. However, there has thus far been little related research to summarize the different deep learning techniques for image denoising. In this paper, we offer a comparative study of deep techniques in image denoising. We first classify the deep convolutional neural networks (CNNs) for additive white noisy images; the deep CNNs for real noisy images; the deep CNNs for blind denoising and the deep CNNs for hybrid noisy images, which represents the combination of noisy, blurred and low-resolution images. Then, we analyze the motivations and principles of the different types of deep learning methods. Next, we compare the state-of-the-art methods on public denoising datasets in terms of quantitative and qualitative analysis. Finally, we point out some potential challenges and directions of future research.
Article
Object tracking is challenging and recently correlation filters methods have been proposed for this task. Most of these methods focus on the central portion of the target, and are negatively affected by changes in the target size and shape. This work proposes a collaborative scheme using several local correlation filters combined with a global correlation filter for improving the performance of object tracking methods based on correlation filters. The proposed correlation filter used in this scheme is based on features extracted from multiple layers of deep convolutional neural networks, and a strategy to identify when these models should be updated also is presented. Experiments show that the proposed scheme tends to be consistent and to achieve better results than other comparative tracking approaches. The proposed collaborative approach can be applied to other correlation filters, which tends to further improve the tracker performance.
Article
Most video analytics applications rely on object detectors to localize objects in frames. However, when real-time is a requirement, running the detector at all the frames is usually not possible. This is somewhat circumvented by instantiating visual object trackers between detector calls, but this does not scale with the number of objects. To tackle this problem, we present SiamMT, a new deep learning multiple visual object tracking solution that applies single-object tracking principles to multiple arbitrary objects in real-time. To achieve this, SiamMT reuses feature computations, implements a novel crop-and-resize operator, and defines a new and efficient pairwise similarity operator. SiamMT naturally scales up to several dozens of targets, reaching 25 fps with 122 simultaneous objects for VGA videos, or up to 100 simultaneous objects in HD720 video. SiamMT has been validated on five large real-time benchmarks, achieving leading performance against current state-of-the-art trackers.
Article
Full-text available
To tackle the problem that traditional particle-filter- or correlation-filter-based trackers are prone to low tracking accuracy and poor robustness when the target faces challenges such as occlusion, rotation and scale variation in the case of complex scenes, an accurate reliable-patch-based tracker is proposed through exploiting and complementing the advantages of particle filter and correlation filter. Specifically, to cope with the challenge of continuous full occlusion, the target is divided into numerous patches by combining random with hand-crafted partition methods, and then, an effective target position estimation strategy is presented. Subsequently, according to the motion law between the patch and global target in the particle filter framework, two effective resampling rules are designed to remove unreliable particles to avoid tracking drift, and then, the target position can be estimated by the most reliable patches identified. Finally, an effective scale estimation approach is presented, in which the Manhattan distance between the reliable patches is utilized to estimate the target scale, including the target width and height, respectively. Experimental results illustrate that our tracker can not only be robust against the challenges of occlusion, rotation and scale variation, but also outperform state-of-the-art trackers for comparison in overall performance.
Article
As a typical ill-posed problem, JPEG compressed image restoration (CIR) aims to recover a high-quality (HQ) image from the compressed version. Although many model-based and learning-based methods have been proposed for conventional image restoration (IR), proposing a general and effective framework for various CIR tasks is still a challenging work. The model-based methods are flexible for handling different IR tasks, but they suffer from high complexity and the difficulty in designing sophisticated priors. The learning-based methods have shown promising results in various IR tasks. However, most of them need to retrain their models for each IR task separately, which sacrifices methods’ flexibility. In this paper, we propose a novel and high-performance deep deblocker driven unified framework to flexibly address various CIR tasks without retraining. First, a novel fidelity (NF) is introduced into CIR, and then the CIR problem is divided into inversion and deblocking subproblems by our improved split Bregman iteration (ISBI) algorithm. Next, we design a set of compact yet effective deep deblockers. Since simultaneously modelling the data fidelity term and implicit priors via deblockers is necessary, these deblockers are used as implicit priors and also used for NF in the CIR problem. The convergence of our method is proved as well. To the best of our knowledge, our method is the first work to use deblockers as implicit priors, and it could also contribute to other deblocking methods to obtain better flexibility. The effectiveness of the proposed method is demonstrated both visually and quantitatively.
Article
Deep convolutional neural networks have shown great potential in medical image segmentation. However, automatic cardiac segmentation is still challenging due to the heterogeneous intensity distributions and indistinct boundaries in the images. In this paper, we propose a multiscale dual-path feature aggregation network (MDFA-Net) to solve misclassification and shape discontinuity problems. The proposed network is aimed to maintain a realistic shape of the segmentation results and divided into two parts: the first part is a non-downsampling multiscale nested network (MN-Net) which restrains the cardiac continuous shape and maintains the shallow information, and the second part is a non-symmetric encoding and decoding network (nSED-Net) that can retain deep details and overcome misclassification. We conducted four-fold cross-validation experiments on balanced steady-state, free precession cine cardiac magnetic resonance (bSSFP cine CMR) sequence, edema-sensitive T2-weighted, black blood spectral presaturation attenuated inversion-recovery (T2-SPAIR) CMR sequence and late gadolinium enhancement (LGE) CMR sequence which include 45 cases in each sequence. The data are provided by the organizer of the Multi-sequence Cardiac MR Segmentation Challenge (MS-CMRSeg 2019) in conjunction with 2019 Medical Image Computing and Computer Assisted Interventions (MICCAI). We also conducted external validation experimnets on the data of 2020 MICCAI myocardial pathology segmentation challenge (MyoPS 2020). Whether it is a four-fold cross-validation experiment or an external validation experiment, the proposed method ranks first or second in the segmentation tasks of multi-sequence CMR images. The subjective evaluation also shows the same results as the objective evaluation metrics. The code will be posted at https://github.com/fly1995/MDFA-Net/.
Article
Recently, ConvNeXts constructing from standard ConvNet modules has produced competitive performance in various image applications. In this paper, an efficient model based on the classical UNet, which can achieve promising results with a low number of parameters, is proposed for medical image segmentation. Inspired by ConvNeXt, the designed model is called ConvUNeXt and towards reduction in the amount of parameters while retaining outstanding segmentation superiority. Specifically, we firstly improved the convolution block of UNet by using large convolution kernels and depth-wise separable convolution to considerably decrease the number of parameters; then residual connections in both encoder and decoder are added and pooling is abandoned via adopting convolution for down-sampling; during skip Connection, a lightweight attention mechanism is designed to filter out noise in low-level semantic information and suppress irrelevant features, so that the network can pay more attention to the target area. Compared to the standard UNet, our model has 20% fewer parameters, meanwhile, experimental results on different datasets show that it exhibits superior segmentation performance when the amount of data is scarce or sufficient. Code will be available at https://github.com/1914669687/ConvUNeXt.
Article
Deep learning technology has greatly improved the performance of target tracking, but most recently developed tracking algorithms are short-term tracking algorithms, which cannot meet the actual engineering needs. Based on the Siamese network structure, this paper proposes a long-term tracking framework with a persistent tracking capability. The global proposal module extends the search area globally through the construction of a feature pyramid. The local regression module is mainly responsible for the confidence evaluation of the candidate regions and for performing more accurate bounding box regression. To improve the discriminative ability of the regression network, the error samples are eliminated by synthesizing the temporal information and are then classified through a verification module in advance. Experiments on the VOT long-term tracking dataset and the UAV20L aerial dataset show that the proposed algorithm achieves state-of-the-art performance.
Article
Nighttime construction has been widely conducted in many construction scenarios, but it is also much riskier due to low lighting conditions and fatiguing environments. Therefore, this study proposes a vision-based method specifically for automatic tracking of construction machines at nighttime by integrating the deep learning illumination enhancement. Five main modules are involved in the proposed method, including illumination enhancement, machine detection, Kalman filter tracking, machine association, and linear assignment. Then, a testing experiment based on nine nighttime videos is conducted to evaluate the tracking performance using this approach. The results show that the method developed in this study achieved 95.1% in MOTA and 75.9% in MTOP. Compared with the baseline method SORT, the proposed method has improved the tracking robustness of 21.7% in nighttime construction scenarios. The proposed methodology can also be used to help accomplish automated surveillance tasks in nighttime construction to improve the productivity and safety performance.
Article
Local features have been widely used in visual tracking to improve robustness in the presence of partial occlusion, deformation, and rotation. In this paper, a structured object tracking algorithm, which uses local discriminative color (LoDC) patch representation and discriminative patch attributed relational graph (DPARG) matching, is proposed. Unlike several existing local feature-based algorithms that divide an object into some rectangular patches of fixed sizes while separately locating each patch to track the object, the proposed algorithm relies on a discriminative color model to distinguish the outstanding colors of the given object. Thus, the multimodal color object is represented by multiple unimodal, homogeneous, and discriminative patches. Moreover, these patches are assembled in a structured DPARG, in which vertexes describe the object’s local discriminative patches while encoding the appearance information, and edges express the relations between vertexes while encoding inner geometric structure information. The object tracking is then formulated as inexact matching of the dynamic undirected graph. The changes of DPARG, along with dynamic environments, are used to filter out invalid patches at the current frame, which usually correspond to those abnormal patches emerging from partial occlusion, similar color disturbances, etc. Finally, the valid patches are assembled to locate the object. The experimental results on the popular tracking benchmark datasets exhibit that the proposed algorithm is reliable enough in tracking even in the presence of serious appearance changes, partial occlusion, and background clutter.
Article
Full-text available
Although numerous recent tracking approaches have made tremendous advances in the last decade, achieving high-performance visual tracking remains a challenge. In this paper, we propose an end-to-end network model to learn reinforced attentional representation for accurate target object discrimination and localization. We utilize a novel hierarchical attentional module with long short-term memory and multi-layer perceptrons to leverage both inter- and intra-frame attention to effectively facilitate visual pattern emphasis. Moreover, we incorporate a contextual attentional correlation filter into the backbone network to make our model trainable in an end-to-end fashion. Our proposed approach not only takes full advantage of informative geometries and semantics but also updates correlation filters online without fine-tuning the backbone network to enable the adaptation of variations in the target object's appearance. Extensive experiments conducted on several popular benchmark datasets demonstrate that our proposed approach is effective and computationally efficient.
Article
Full-text available
Visual tracking is one of the most fundamental topics in computer vision. Numerous tracking approaches based on discriminative correlation filters or Siamese convolutional networks have attained remarkable performance over the past decade. However, it is still commonly recognized as an open research problem to develop robust and effective trackers which can achieve satisfying performance with high computational and memory storage efficiency in real-world scenarios. In this paper, we investigate the impacts of three main aspects of visual tracking, i.e., the backbone network, the attentional mechanism, and the detection component, and propose a Siamese Attentional Keypoint Network, dubbed SATIN, for efficient tracking and accurate localization. Firstly, a new Siamese lightweight hourglass network is specially designed for visual tracking. It takes advantage of the benefits of the repeated bottom-up and top-down inference to capture more global and local contextual information at multiple scales. Secondly, a novel cross-attentional module is utilized to leverage both channel-wise and spatial intermediate attentional information, which can enhance both discriminative and localization capabilities of feature maps. Thirdly, a keypoints detection approach is invented to trace any target object by detecting the top-left corner point, the centroid point, and the bottom-right corner point of its bounding box. Therefore, our SATIN tracker not only has a strong capability to learn more effective object representations, but also is computational and memory storage efficiency, either during the training or testing stages. To the best of our knowledge, we are the first to propose this approach. Without bells and whistles, experimental results demonstrate that our approach achieves state-of-the-art performance on several recent benchmark datasets, at a speed far exceeding 27 frames per second.
Article
Full-text available
Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there is not a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carried out the large-scale evaluation experiments on our benchmark dataset using nine publicly available trackers. The experimental results help us understand the strengths and weaknesses of these trackers. In addition, in order to gain more insight into the TIR pedestrian tracker, we divided its functions into three components: feature extractor, motion model, and observation model. Then, we conducted three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Article
Full-text available
Common tracking algorithms only use a single feature to describe the target appearance, which makes the appearance model easily disturbed by noise. Furthermore, the tracking performance and robustness of these trackers are obviously limited. In this paper, we propose a novel multiple feature fused model into a correlation filter framework for visual tracking to improve the tracking performance and robustness of the tracker. In different tracking scenarios, the response maps generated by the correlation filter framework are different for each feature. Based on these response maps, different features can use an adaptive weight-ing function to eliminate noise interference and maintain their respective advantages. It can enhance the tracking performance and robustness of the tracker efficiently. Meanwhile, the correlation filter framework can provide a fast training and accurate locating mechanism. In addition, we give a simple yet effective scale variation detection method, which can appropriately handle scale variation of the target in the tracking sequences. We evaluate our tracker on OTB2013/OTB50/OBT2015 benchmarks, which are including more than 100 video sequences. Extensive experiments on these benchmark datasets demonstrate that the proposed MFFT tracker performs favorably against the state-of-the-art trackers.
Article
Full-text available
Most of the correlation filter based tracking algorithms can achieve good performance and maintain fast computational speed. However, in some complicated tracking scenes, there is a fatal defect that causes the object to be located inaccurately, which is the trackers excessively dependent on the maximum response value to determine the object location. In order to address this problem, we propose a particle filter redetection based tracking approach for accurate object localization. During the tracking process, the kernelized correlation filter (KCF) based tracker can locate the object by relying on the maximum response value of the response map; when the response map becomes ambiguous, the tracking result becomes unreliable correspondingly. Our redetection model can provide abundant object candidates by particle resampling strategy to detect the object accordingly. Additionally, for the target scale variation problem, we give a new object scale evaluation mechanism, which merely considers the differences between the maximum response values in consecutive frames to determine the scale change of the object target. Extensive experiments on OTB2013 and OTB2015 datasets demonstrate that the proposed tracker performs favorably in relation to the state-of-the-art methods.
Conference Paper
Full-text available
We propose a new context-aware correlation filter based tracking framework to achieve both high computational speed and state-of-the-art performance among real-time trackers. The major contribution to the high computational speed lies in the proposed deep feature compression that is achieved by a context-aware scheme utilizing multiple expert auto-encoders; a context in our framework refers to the coarse category of the tracking target according to appearance patterns. In the pre-training phase, one expert auto-encoder is trained per category. In the tracking phase, the best expert auto-encoder is selected for a given target, and only this auto-encoder is used. To achieve high tracking performance with the compressed feature map, we introduce extrinsic denoising processes and a new orthogonality loss term for pre-training and fine-tuning of the expert auto-encoders. We validate the proposed context-aware framework through a number of experiments, where our method achieves a comparable performance to state-of-the-art trackers which cannot run in real-time, while running at a significantly fast speed of over 100 fps.
Article
Full-text available
We propose a novel visual tracking algorithm based on the representations from a discriminatively trained Convolutional Neural Network (CNN). Our algorithm pretrains a CNN using a large set of videos with tracking ground-truths to obtain a generic target representation. Our network is composed of shared layers and multiple branches of domain-specific layers, where domains correspond to individual training sequences and each branch is responsible for binary classification to identify target in each domain. We train each domain in the network iteratively to obtain generic target representations in the shared layers. When tracking a target in a new sequence, we construct a new network by combining the shared layers in the pretrained CNN with a new binary classification layer, which is updated online. Online tracking is performed by evaluating the candidate windows randomly sampled around the previous target state. The proposed algorithm illustrates outstanding performance in existing tracking benchmarks.
Conference Paper
Full-text available
We propose a new tracking framework with an attentional mechanism that chooses a subset of the associated correlation filters for increased robustness and computational efficiency. The subset of filters is adaptively selected by a deep attentional network according to the dynamic properties of the tracking target. Our contributions are manifold, and are summarised as follows: (i) Introducing the Attentional Correlation Filter Network which allows adaptive tracking of dynamic targets. (ii) Utilising an attentional network which shifts the attention to the best candidate modules, as well as predicting the estimated accuracy of currently inactive modules. (iii) Enlarging the variety of correlation filters which cover target drift, blurriness, occlusion, scale changes, and flexible aspect ratio. (iv) Validating the robustness and efficiency of the attentional mechanism for visual tracking through a number of experiments. Our method achieves similar performance to non real-time trackers, and state-of-the-art performance amongst real-time trackers.
Article
Full-text available
Unlike the visual object tracking, thermal infrared object tracking can track a target object in total darkness. Therefore, it has broad applications, such as in rescue and video surveillance at night. However, there are few studies in this field mainly because thermal infrared images have several unwanted attributes, which make it difficult to obtain the discriminative features of the target. Considering the powerful representational ability of convolutional neural networks and their successful application in visual tracking, we transfer the pre-trained convolutional neural networks based on visible images to thermal infrared tracking. We observe that the features from the fully-connected layer are not suitable for thermal infrared tracking due to the lack of spatial information of the target, while the features from the convolution layers are. Besides, the features from a single convolution layer are not robust to various challenges. Based on this observation, we propose a correlation filter based ensemble tracker with multi-layer convolutional features for thermal infrared tracking (MCFTS). Firstly, we use pre-trained convolutional neural networks to extract the features of the multiple convolution layers of the thermal infrared target. Then, a correlation filter is used to construct multiple weak trackers with the corresponding convolution layer features. These weak trackers give the response maps of the target’s location. Finally, we propose an ensemble method that coalesces these response maps to get a stronger one. Furthermore, a simple but effective scale estimation strategy is exploited to boost the tracking accuracy. To evaluate the performance of the proposed tracker, we carry out experiments on two thermal infrared tracking benchmarks: VOT-TIR 2015 and VOT-TIR 2016. The experimental results demonstrate that our tracker is effective and achieves promising performance.
Article
Full-text available
In recent years, Discriminative Correlation Filter (DCF) based methods have significantly advanced the state-of-the-art in tracking. However, in the pursuit of ever increasing tracking performance, their characteristic speed and real-time capability have gradually faded. Further, the increasingly complex models, with massive number of trainable parameters, have introduced the risk of severe over-fitting. In this work, we tackle the key causes behind the problems of computational complexity and over-fitting, with the aim of simultaneously improving both speed and performance. We revisit the core DCF formulation and introduce: (i) a factorized convolution operator, which drastically reduces the number of parameters in the model; (ii) a compact generative model of the training sample distribution, that significantly reduces memory and time complexity, while providing better diversity of samples; (iii) a conservative model update strategy with improved robustness and reduced complexity. We perform comprehensive experiments on four benchmarks: VOT2016, UAV123, OTB-2015, and TempleColor. When using expensive deep features, our tracker provides a 20-fold speedup and achieves a 13.3% relative gain in Expected Average Overlap compared to the top ranked method in the VOT2016 challenge. Moreover, our fast variant, using hand-crafted features, operates at 60 Hz on a single CPU, while obtaining 64.8% AUC on OTB-2015.
Conference Paper
Full-text available
The problem of arbitrary object tracking has traditionally been tackled by learning a model of the object’s appearance exclusively online, using as sole training data the video itself. Despite the success of these methods, their online-only approach inherently limits the richness of the model they can learn. Recently, several attempts have been made to exploit the expressive power of deep convolutional networks. However, when the object to track is not known beforehand, it is necessary to perform Stochastic Gradient Descent online to adapt the weights of the network, severely compromising the speed of the system. In this paper we equip a basic tracking algorithm with a novel fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset for object detection in video. Our tracker operates at frame-rates beyond real-time and, despite its extreme simplicity, achieves state-of-the-art performance in multiple benchmarks.
Conference Paper
Full-text available
The Visual Object Tracking challenge VOT2016 aims at comparing short-term single-object visual trackers that do not apply pre-learned models of object appearance. Results of 70 trackers are presented, with a large number of trackers being published at major computer vision conferences and journals in the recent years. The number of tested state-of-the-art trackers makes the VOT 2016 the largest and most challenging benchmark on short-term tracking to date. For each participating tracker, a short description is provided in the Appendix. The VOT2016 goes beyond its predecessors by (i) introducing a new semi-automatic ground truth bounding box annotation methodology and (ii) extending the evaluation system with the no-reset experiment. The dataset, the evaluation kit as well as the results are publicly available at the challenge website (http:// votchallenge. net).
Article
Full-text available
Robustness and efficiency are the two main goals of existing trackers. Most robust trackers are implemented with combined features or models accompanied with a high computational cost. To achieve a robust and efficient tracking performance, we propose a multi-view correlation tracker to do tracking. On one hand, the robustness of the tracker is enhanced by the multi-view model, which fuses several features and selects the more discriminative features to do tracking. On the other hand, the correlation filter framework provides a fast training and efficient target locating. The multiple features are well fused on the model level of correlation filer, which are effective and efficient. In addition, we raise a simple but effective scale-variation detection mechanism, which strengthens the stability of scale variation tracking. We evaluate our tracker on online tracking benchmark (OTB) and two visual object tracking benchmarks (VOT2014, VOT2015). These three datasets contains more than 100 video sequences in total. On all the three datasets, the proposed approach achieves promising performance.
Conference Paper
Full-text available
In this paper, we propose a new aerial video dataset and benchmark for low altitude UAV target tracking, as well as, a photo-realistic UAV simulator that can be coupled with tracking methods. Our benchmark provides the first evaluation of many state-of-the-art and popular trackers on 123 new and fully annotated HD video sequences captured from a low-altitude aerial perspective. Among the compared trackers, we determine which ones are the most suitable for UAV tracking both in terms of tracking accuracy and run-time. The simulator can be used to evaluate tracking algorithms in real-time scenarios before they are deployed on a UAV “in the field”, as well as, generate synthetic but photo-realistic tracking datasets with automatic ground truth annotations to easily extend existing real-world datasets. Both the benchmark and simulator are made publicly available to the vision community on our website to further research in the area of object tracking from UAVs. (https:// ivul. kaust. edu. sa/ Pages/ pub-benchmark-simulator-uav. aspx.).
Article
Full-text available
Robust and accurate visual tracking is one of the most challenging computer vision problems. Due to the inherent lack of training data, a robust approach for constructing a target appearance model is crucial. Recently, discriminatively learned correlation filters (DCF) have been successfully applied to address this problem for tracking. These methods utilize a periodic assumption of the training samples to efficiently learn a classifier on all patches in the target neighborhood. However, the periodic assumption also introduces unwanted boundary effects, which severely degrade the quality of the tracking model. We propose Spatially Regularized Discriminative Correlation Filters (SRDCF) for tracking. A spatial regularization component is introduced in the learning to penalize correlation filter coefficients depending on their spatial location. Our SRDCF formulation allows the correlation filters to be learned on a significantly larger set of negative training samples, without corrupting the positive samples. We further propose an optimization strategy, based on the iterative Gauss-Seidel method, for efficient online learning of our SRDCF. Experiments are performed on four benchmark datasets: OTB-2013, ALOV++, OTB-2015, and VOT2014. Our approach achieves state-of-the-art results on all four datasets. On OTB-2013 and OTB-2015, we obtain an absolute gain of 8.0% and 8.2% respectively, in mean overlap precision, compared to the best existing trackers.
Article
Full-text available
Discriminative Correlation Filters (DCF) have demonstrated excellent performance for visual object tracking. The key to their success is the ability to efficiently exploit available negative data by including all shifted versions of a training sample. However, the underlying DCF formulation is restricted to single-resolution feature maps, significantly limiting its potential. In this paper, we go beyond the conventional DCF framework and introduce a novel formulation for training continuous convolution filters. We employ an implicit interpolation model to pose the learning problem in the continuous spatial domain. Our proposed formulation enables efficient integration of multi-resolution deep feature maps, leading to superior results on three object tracking benchmarks: OTB-2015 (+5.1% in mean OP), Temple-Color (+4.6% in mean OP), and VOT2015 (20% relative reduction in failure rate). Additionally, our approach is capable of sub-pixel localization, crucial for the task of accurate feature point tracking. We also demonstrate the effectiveness of our learning formulation in extensive feature point tracking experiments. Code and supplementary material are available at http://www.cvl.isy.liu.se/research/objrec/visualtracking/conttrack/index.html.
Article
Discriminative correlation filters (DCFs) have been widely used in the tracking community recently. DCFs-based trackers utilize samples generated by circularly shifting from an image patch to train a ridge regression model, and estimate target location using a response map generated by the correlation filters. However, the generated samples produce some negative effects and the response map is vulnerable to noise interference, which degrades tracking performance. In this paper, to solve the aforementioned drawbacks, we propose a target-focusing convolutional regression (CR) model for visual object tracking tasks (called TFCR). This model uses a target-focusing loss function to alleviate the influence of background noise on the response map of the current tracking image frame, which effectively improves the tracking accuracy. In particular, it can effectively balance the disequilibrium of positive and negative samples by reducing some effects of the negative samples that act on the object appearance model. Extensive experimental results illustrate that our TFCR tracker achieves competitive performance compared with state-of-the-art trackers.
Article
Deep convolutional neural networks (CNNs) have attracted considerable interest in low-level computer vision. Researches are usually devoted to improving the performance via very deep CNNs. However, as the depth increases, influences of the shallow layers on deep layers are weakened. Inspired by the fact, we propose an attention-guided denoising convolutional neural network (ADNet), mainly including a sparse block (SB), a feature enhancement block (FEB), an attention block (AB) and a reconstruction block (RB) for image denoising. Specifically, the SB makes a tradeoff between performance and efficiency by using dilated and common convolutions to remove the noise. The FEB integrates global and local features information via a long path to enhance the expressive ability of the denoising model. The AB is used to finely extract the noise information hidden in the complex background, which is very effective for complex noisy images, especially real noisy images and bind denoising. Also, the FEB is integrated with the AB to improve the efficiency and reduce the complexity for training a denoising model. Finally, a RB aims to construct the clean image through the obtained noise mapping and the given noisy image. Additionally, comprehensive experiments show that the proposed ADNet performs very well in three tasks (i.e. synthetic and real noisy images, and blind denoising) in terms of both quantitative and qualitative evaluations. The code of ADNet is accessible at http://www.yongxu.org/lunwen.html.
Article
Deep convolutional neural networks (CNNs) have attracted great attention in the field of image denoising. However, there are two drawbacks: (1) it is very difficult to train a deeper CNN for denoising tasks, and (2) most of deeper CNNs suffer from performance saturation. In this paper, we report the design of a novel network called a batch-renormalization denoising network (BRDNet). Specifically, we combine two networks to increase the width of the network, and thus obtain more features. Because batch renormalization is fused into BRDNet, we can address the internal covariate shift and small mini-batch problems. Residual learning is also adopted in a holistic way to facilitate the network training. Dilated convolutions are exploited to extract more information for denoising tasks. Extensive experimental results show that BRDNet outperforms state-of-the-art image-denoising methods. The code of BRDNet is accessible at http://www.yongxu.org/lunwen.html.
Article
Correlation filters (CF) have demonstrated a good performance in visual tracking. However, the base training sample region is larger than the object region, including the interference region (IR). IRs in training samples from cyclic shifts of the base training sample severely degrade the quality of the tracking model. In this paper, a region-filtering correlation tracking (RFCT) algorithm is proposed to address this problem. In this algorithm, we filter training samples by introducing a spatial map into the standard CF formulation. Compared with the existing correlation filter trackers, the proposed tracker has the following advantages. (1) Using a spatial map, the correlation filter can be learned on a larger search region without the interference of IR. (2) Due to processing training samples by a spatial map, it is a more general way to control background information and target information in training samples. In addition, a better spatial map can be explored, the values of which are not restricted. Quantitative evaluations are performed on four benchmark datasets: OTB-2013, OTB-2015, VOT2015, and VOT2016. Experimental results demonstrate that the proposed RFCT algorithm performs favorably against several state-of-the-art methods.
Article
The performance of the tracking task directly depends on target object appearance features. Therefore, a robust method for constructing appearance features is crucial for avoiding tracking failure. The tracking methods based on Convolution Neural Network (CNN) have exhibited excellent performance in the past years. However, the features from each original convolutional layer are not robust to the size change of target object. Once the size of the target object has significant changes, the tracker drifts away from the target object. In this paper, we present a novel tracker based on multi-scale feature, spatiotemporal features and deep residual network to accurately estimate the size of the target object. Our tracker can successfully locate the target object in the consecutive video frames. To solve the multi-scale change issue in visual object tracking, we sample each input image with 67 different size templates and resize the samples to a fixed size. And then these samples are used to offline train deep residual network model with multi-scale feature that we have built up. After that spatial feature and temporal feature are fused into the deep residual network model with multi-scale feature, so that we can get deep multi-scale spatiotemporal features model, which is named MSST-ResNet feature model. Finally, MSST-ResNet feature model is transferred into the tracking tasks and combined with three different Kernelized Correlation Filters (KCFs) to accurately locate target object in the consecutive video frames. Unlike the previous trackers, we directly learn various change of the target appearance by building up a MSST-ResNet feature model. The experimental results demonstrate that the proposed tracking method outperforms the state-of-the-art tracking methods.
Article
Psychological and cognitive findings indicate that human visual perception is attentive and selective, which may process spatial and appearance selective attentions in parallel. By reflecting some aspects of these attentions, this paper presents a novel correlation filter (CF) based tracking approach, corresponding to processing a local and a semi-local background domains, respectively. In the local domain, inspired by the Gestalt principle of figure-ground segregation, we leverage an efficient Boolean map representation, which characterizes an image by a set of Boolean maps via randomly thresholding its color channels, yielding a location response map as a weighted sum of all Boolean maps. The Boolean maps capture the topological structures of target and its scene with different granularities, thereby enabling to effectively improve tracking of non-rectangular objects. Alternatively, in the semi-local domains, we introduce a novel distractor-resilient metric regularization into CF, which acts as a force to push distractors into negative space. Consequently, the unwanted boundary effects of CF can be effectively alleviated. Finally, both models associated with the local and the semi-local domains are seamlessly integrated into a Bayesian framework, and the tracked location is determined by maximizing its likelihood function. Extensive evaluations on the OTB50, OTB100, VOT2016 and VOT2017 tracking benchmarks demonstrate that the proposed method achieves favorable performance against a variety of state-of-the-art trackers with a speed of 45 fps on a single CPU.
Article
Online object tracking is a fundamental problem in computer vision and it is crucial to application in numerous fileds such as guided missile, video surveillance and unmanned aerial vehicle. Despite many studies on visual tracking, there are still many challenges during the tracking process including illumination variation, rotation, scale change, deformation, occlusion and camera motion. To make a clear understanding of visual tracking, visual tracking algorithms are summarized in this paper. Firstly, the meaning and the related work are briefly introduced. Secondly, the typical algorithms are classified, summarized and analyzed from two aspects: traditional algorithms and deep learning algorithms. Finally, the problems and the prediction of the future of visual tracking are discussed.
Article
Due to the factors like rapidly fast motion, cluttered backgrounds, arbitrary object appearance variation and shape deformation, an effective target representation plays a key role in robust visual tracking. Existing methods often employ bounding boxes for target representations, which are easily polluted by noisy clutter backgrounds that may cause drifting problem when the target undergoes large-scale non-rigid or articulated motions. To address this issue, in this paper, motivated by the spatio-temporal nonlocality of target appearance reoccurrence in a video, we explore the nonlocal information to accurately represent and segment the target, yielding an object likelihood map to regularize a correlation filter (CF) for visual tracking. Specifically, given a set of tracked target bounding boxes, we first generate a set of superpixels to represent the foreground and background, and then update the appearance of each superpixel with its long-term spatio-temporally nonlocal counterparts. Then, with the updated appearances, we formulate a spatio-temporally graphical model comprised of the superpixel label consistency potentials. Afterwards, we generate segmentation by optimizing the graphical model via iteratively updating the appearance model and estimating the labels. Finally, with the segmentation mask, we obtain an object likelihood map that is employed to adaptively regularize the CF learning by suppressing the clutter background noises while making full use of the long-term stable target appearance information. Extensive evaluations on the OTB50, SegTrack, Youtube-Objects datasets demonstrate the effectiveness of the proposed method, which performs favorably against some state-of-art methods.
Article
Thermal infrared (TIR) pedestrian tracking is one of the most important components in numerous applications of computer vision, which has a major advantage: it can track the pedestrians in total darkness. How to evaluate the TIR pedestrian tracker fairly on a benchmark dataset is significant for the development of this field. However, there is no a benchmark dataset. In this paper, we develop a TIR pedestrian tracking dataset for the TIR pedestrian tracker evaluation. The dataset includes 60 thermal sequences with manual annotations. Each sequence has nine attribute labels for the attribute based evaluation. In addition to the dataset, we carry out the large-scale evaluation experiments on our benchmark dataset using nine public available trackers. The experimental results help us to understand the strength and weakness of these trackers. What's more, in order to get insight into the TIR pedestrian tracker more sufficiently, we divide a tracker into three components: feature extractor, motion model, and observation model. Then, we conduct three comparison experiments on our benchmark dataset to validate how each component affects the tracker's performance. The findings of these experiments provide some guidelines for future research.
Article
This paper improves state-of-the-art on-line trackers that use deep learning. Such trackers train a deep network to pick a specified object out from the background in an initial frame (initialization) and then keep training the model as tracking proceeds (updates). Our core contribution is a meta-learning-based method to adjust deep networks for tracking using off-line training. First, we learn initial parameters and per-parameter coefficients for fast online adaptation. Second, we use training signal from future frames for robustness to target appearance variations and environment changes. The resulting networks train significantly faster during the initialization, while improving robustness and accuracy. We demonstrate this approach on top of the current highest accuracy tracking approach, tracking-by-detection based MDNet and close competitor, the correlation-based CREST. Experimental results on both standard benchmarks, OTB and VOT2016, show improvements in speed, accuracy, and robustness on both trackers.
Article
Most thermal infrared (TIR) tracking methods are discriminative, which treat the tracking problem as a classification task. However, the objective of the classifier (label prediction) is not coupled to the objective of the tracker (location estimation). The classification task focuses on the between-class difference of the arbitrary objects, while the tracking task mainly deals with the within-class difference of the same objects. In this paper, we cast the TIR tracking problem as a similarity verification task, which is well coupled to the objective of tracking task. We propose a TIR tracker via a hierarchical Siamese convolutional neural network (CNN), named HSNet. To obtain both spatial and semantic features of the TIR object, we design a Siamese CNN coalescing the multiple hierarchical convolutional layers. Then, we train this network end to end on a large visible video detection dataset to learn the similarity between paired objects before we transfer the network into the TIR domain. Next, this pre-trained Siamese network is used to evaluate the similarity between the target template and target candidates. Finally, we locate the most similar one as the tracked target. Extensive experimental results on the benchmarks: VOT-TIR 2015 and VOT-TIR 2016, show that our proposed method achieves favorable performance against the state-of-the-art methods.
Article
Recently, deep learning has achieved great success in visual tracking. The goal of this paper is to review the state-of-the-art tracking methods based on deep learning. First, we introduce the background of deep visual tracking, including the fundamental concepts of visual tracking and related deep learning algorithms. Second, we categorize the existing deep-learning-based trackers into three classes according to network structure, network function and network training. For each categorize, we explain its analysis of the network perspective and analyze papers in different categories. Then, we conduct extensive experiments to compare the representative methods on the popular OTB-100, TC-128 and VOT2015 benchmarks. Based on our observations, we conclude that: (1) The usage of the convolutional neural network (CNN) model could significantly improve the tracking performance. (2) The trackers using the convolutional neural network (CNN) model to distinguish the tracked object from its surrounding background could get more accurate results, while using the CNN model for template matching is usually faster. (3) The trackers with deep features perform much better than those with low-level hand-crafted features. (4) Deep features from different convolutional layers have different characteristics and the effective combination of them usually results in a more robust tracker. (5) The deep visual trackers using end-to-end networks usually perform better than the trackers merely using feature extraction networks. (6) For visual tracking, the most suitable network training method is to per-train networks with video information and online fine-tune them with subsequent observations. Finally, we summarize our manuscript and highlight our insights, and point out the further trends for deep visual tracking.
Article
Correlation Filters (CFs) have recently demonstrated excellent performance in terms of rapidly tracking objects under challenging photometric and geometric variations. The strength of the approach comes from its ability to efficiently learn - "on the fly" - how the object is changing over time. A fundamental drawback to CFs, however, is that the background of the object is not be modelled over time which can result in suboptimal results. In this paper we propose a Background-Aware CF that can model how both the foreground and background of the object varies over time. Our approach, like conventional CFs, is extremely computationally efficient - and extensive experiments over multiple tracking benchmarks demonstrate the superior accuracy and real-time performance of our method compared to the state-of-the-art trackers including those based on a deep learning paradigm.