Recent years have witnessed significant improvements of ensemble trackers based on independent models. However, existing ensemble trackers only combine the responses of independent models and pay less attention to the learning process, which hinders their performance from further improvements. To this end, we propose an interactive learning framework to strengthen the independent models in the learning process. Specifically, in the interactive network, we force convolutional filter models to interact with each other by sharing their responses during the learning. The interaction between the convolutional filter models can mine hard samples and prevent easy samples from overwhelming them, which improve their discriminative capacity. In addition, to achieve a more accurate target location, we develop a fusion mechanism based on the confidences of the independent predictions. We evaluate the proposed method on five public datasets including OTB-2013, OTB-2015, VOT-2016, VOT-2017, Temple-Color-128, and LaSOT. The comprehensive experimental results show that the proposed algorithm performs favorably against state-of-the-art methods.
All content in this area was uploaded by Qiao Liu on Jan 12, 2021
Content may be subject to copyright.
A preview of the PDF is not available
... Li et al. [30] proposed a dual-regression framework for visual tracking, which combines discriminative fully convolutional module (for discriminative ability) and a fine-grained correlation filter (for accurate localization). Fan et al. [31] introduced a novel interactive learning framework for visual tracking, in which multiple convolutional filter models are interacted with each other and their responses are fused based on the confidence scores. Liu et al. developed robust visual trackers for thermal infrared objects based on multi-level similarity models under the Siamese framework [32], via the multi-task framework [33], and using the pretrained convolutional neural networks [34]. ...
In this study, we propose a novel Wasserstein distributional tracking method that can balance approximation with accuracy in terms of Monte Carlo estimation. To achieve this goal, we present three different systems: sliced Wasserstein-based (SWT), projected Wasserstein-based (PWT), and orthogonal coupled Wasserstein-based (OCWT) visual tracking systems. Sliced Wasserstein-based visual trackers can find accurate target configurations using the optimal transport plan, which minimizes the discrepancy between appearance distributions described by the estimated and ground truth configurations. Because this plan involves a finite number of probability distributions, the computation costs can be considerably reduced. Projected Wasserstein-based and orthogonal coupled Wasserstein-based visual trackers further enhance the accuracy of visual trackers using bijective mapping functions and orthogonal Monte Carlo, respectively. Experimental results demonstrate that our approach can balance computational efficiency with accuracy, and the proposed visual trackers outperform other state-of-the-art visual trackers on several benchmark visual tracking datasets.
Siamese-based trackers suffer from over-parameterization and online learning limitations, making it difficult to balance high performance and real-time execution when deployed on resource-constrained devices. In this paper, we propose a unified tracking framework that integrates lightweight Siamese network and template-guided learning. Specifically, we propose a two-step pruning method for compressing the Siamese network, which examines both the statistical distribution and correlation patterns of filters. By removing the filters with less importance and replaceable contribution, as well as their connected feature maps, network optimization is realized without custom architectures. Furthermore, we construct a template-guided learning model to capture target appearance information and suppress distracters. This can effectively ensure tracking performance in specific scenarios. Extensive experiments on OTB50, OTB-2013, OTB-2015, DTB70, UAV123, UAV20L, VOT2019, GOT-10K, LaSOT and TrackingNet indicate that the proposed method outperforms several state-of-the-art trackers and is faster. In particular, our lightweight Siamese network reduces the model size by 3.4× and FLOPs by 3.7× without significantly sacrificing performance while running at 116 fps.
Triple loss is widely used to detect learned descriptors and achieves promising performance. However, triple loss fails to fully consider the influence of adjacent descriptors from the same type of sample, which is one of the main reasons for image mismatching. To solve this problem, we propose a descriptor network based on triple loss with a similar triangle constraint, named as STCDesc. This network not only considers the correlation between descriptors from different types of samples but also considers the relevance of descriptors from the same type of sample. Furthermore, we propose a normalized exponential algorithm to reduce the impact of negative samples and improve calculation speed. The proposed method can effectively improve the stability of learned descriptors using the proposed triangle constraint and normalized exponential algorithm. To verify the effectiveness of the proposed descriptor network, extensive experiments were conducted using four benchmarks. The experimental results demonstrate that the proposed descriptor network achieves favorable performance compared to state-of-the-art methods.
Most prevalent trackers have the aid of more powerful appearance models, aiming to learn more discriminative deep representations for a reliable response map. Nevertheless, it is not a piece of cake to obtain an accurate response map on account of diverse challenges. What’s more, discriminative appearance model based trackers with an online update component require dynamic samples to update the target classifier, which inevitably learns the imprecise tracking results into the model, thus reducing its discriminant ability. To alleviate this matter, we propose a brand-new verification mechanism via Target Embedding Network, the intention of which is to learn general target embedding features offline for improving the similarity of the same target while dissimilarity of the different targets. In particular, we devise a simpler select strategy for negative samples and adopt the multiple triple loss to effectively train the network. Furthermore, we adopt a valid cosine similarity method to metric the target embedding features between the initial frame and current frame targets. According to the comparison results between the similarity score and piecewise thresholds, this method can retain the discriminant ability of the tracker by controlling the update of the sample memory and the learning rate. We replace the hard sample mining strategy used by recent SuperDiMP and conduct comprehensive experimental tests and analyses of our approach on six public datasets. Extensive experiments demonstrate that the discriminative ability of the model can be maintained with effect by using the proposed method, acquiring a superior performance against state-of-the-art trackers. The code, raw tracking results and trained models will be released at https://github.com/hexdjx/VisTrack.
Deep neural networks have contributed to significant progress in complex system modeling of biology. However, the existing computational methods cannot extract discriminative features for the translation initiation site (TIS) of complex biology systems, and feature representation methods heavily rely on statistical information, which is not informative enough, thus leading to unsatisfactory performance. To address the problems, we first pre-train and generate co-occurrence embedding of genomic sequences and then propose the competitive neural framework called TISNet by gating convolutional and attention mechanisms and identify TIS from genomic sequences alone. Specifically, we devise a novel gated residual network that contains a gated convolutional residual unit and a gated scaled exponential unit. The gating mechanism not only promotes the propagation of features but also alleviates gradient vanishing problems. Besides, to extract features at different scales, we further introduce a multiscale convolutional block and an attention block to directly learn local and long-distance patterns; then we use a special fusion block to combine information at the local and global levels. Extensive experiments indicate the superiority of TISNet over previous methods and show k-mer co-occurrence information can improve the performance which provides some biological insights into the regulatory mechanism of TIS in complex biology systems.
Existing regression based tracking methods built on correlation filter model or convolution modeldo not take both accuracy and robustness into account at the same time. In this paper, we pro-pose a dual regression framework comprising a discriminative fully convolutional module and a fine-grained correlation filter component for visual tracking. The convolutional module trainedin a classification manner with hard negative mining ensures the discriminative ability of the proposed tracker, which facilitates the handling of several challenging problems, such as drastic de-formation, distracters, and complicated backgrounds. The correlation filter component built onthe shallow features with fine-grained features enables accurate localization. By fusing these twobranches in a coarse-to-fine manner, the proposed dual-regression tracking framework achievesa robust and accurate tracking performance. Extensive experiments on the OTB2013, OTB2015,and VOT2015 datasets demonstrate that the proposed algorithm performs favorably against thestate-of-the-art methods.
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese framework for robust TIR object tracking. Specifically, we compute different pattern similarities using the proposed multi-level similarity network. One of them focuses on the global semantic similarity and the other computes the local structural similarity of the TIR object. These two similarities complement each other and hence enhance the discriminative capacity of the network for handling distractors. In addition, we design a simple while effective relative entropy based ensemble subnetwork to integrate the semantic and structural similarities. This subnetwork can adaptive learn the weights of the semantic and structural similarities at the training stage. To further enhance the discriminative capacity of the tracker, we propose a large-scale TIR video sequence dataset for training the proposed model. To the best of our knowledge, this is the first and the largest TIR object tracking training dataset to date. The proposed TIR dataset not only benefits the training for TIR object tracking but also can be applied to numerous TIR visual tasks. Extensive experimental results on three benchmarks demonstrate that the proposed algorithm performs favorably against the state-of-the-art methods.
In this paper, we propose to tackle egocentric action recognition by suppressing background distractors and enhancing action-relevant interactions. The existing approaches usually utilize two independent branches to recognize egocentric actions, i.e., a verb branch and a noun branch. However, the mechanism to suppress distracting objects and exploit local human-object correlations is missing. To this end, we introduce two extra sources of information, i.e., the candidate objects' spatial location and their discriminative features, to enable concentration on the occurring interactions. We design a Symbiotic Attention withObject-centric featureAlignmentframework (SAOA) to provide meticulous reasoning between the actor and the environment. First, we introduce an object-centric feature alignment method to inject the local object features to the verb branch and noun branch. Second, we propose a symbiotic attention mechanism to encourage the mutual interaction between the two branches and select the most action-relevant candidates for classification. The framework benefits from the communication among the verb branch, the noun branch, and the local object information. Experiments based on different backbones and modalities demonstrate the effectiveness of our method. Notably, our framework achieves the state-of-the-art on the largest egocentric video dataset.