# Zhenyu He's research while affiliated with Harbin Institute of Technology Shenzhen Graduate School and other places

## Publications (127)

Article
Thermal infrared (TIR) target tracking is susceptible to occlusion and similarity interference, which obviously affects the tracking results. To resolve this problem, we develop an Aligned Spatial-Temporal Memory network-based Tracking method (ASTMT) for the TIR target tracking task. Specifically, we model the scene information in the TIR target tr...
Preprint
Unlike indirect methods that usually require time-consuming post-processing, recent deep learning-based direct methods for 6D pose estimation try to predict the 3D rotation and 3D translation from RGB-D data directly. However, direct methods, regressing the absolute translation of the pose, suffer from diverse object translation distribution betwee...
Article
Full-text available
When dealing with complex thermal infrared (TIR) tracking scenarios, the single category feature is not sufficient to portray the appearance of the target, which drastically affects the accuracy of the TIR target tracking method. In order to address these problems, we propose an adaptively multi-feature fusion model (AMFT) for the TIR tracking task...
Preprint
Compared to visible-to-visible (V2V) person re-identification (ReID), the visible-to-infrared (V2I) person ReID task is more challenging due to the lack of sufficient training samples and the large cross-modality discrepancy. To this end, we propose Flow2Flow, a unified framework that could jointly achieve training sample expansion and cross-modali...
Preprint
The model-based gait recognition methods usually adopt the pedestrian walking postures to identify human beings. However, existing methods did not explicitly resolve the large intra-class variance of human pose due to camera views changing. In this paper, we propose to generate multi-view pose sequences for each single-view pose sample by learning...
Preprint
The video-based person re-identification (ReID) aims to identify the given pedestrian video sequence across multiple non-overlapping cameras. To aggregate the temporal and spatial features of the video samples, the graph neural networks (GNNs) are introduced. However, existing graph-based models, like STGCN, perform the \textit{mean}/\textit{max po...
Preprint
Existing methods for video-based person re-identification (ReID) mainly learn the appearance feature of a given pedestrian via a feature extractor and a feature aggregator. However, the appearance models would fail when different pedestrians have similar appearances. Considering that different pedestrians have different walking postures and body pr...
Preprint
The traditional homography estimation pipeline consists of four main steps: feature detection, feature matching, outlier removal and transformation estimation. Recent deep learning models intend to address the homography estimation problem using a single convolutional network. While these models are trained in an end-to-end fashion to simplify the...
Article
Semi-supervised learning is a challenging problem which aims to construct a model by learning from limited labeled examples. Numerous methods for this task focus on utilizing the predictions of unlabeled instances consistency alone to regularize networks. However, treating labeled and unlabeled data separately often leads to the discarding of mass...
Article
The triplet loss is widely used in learning the local descriptors for image matching. However, existing triplet loss-based methods, like HardNet and DSM, employ the point-to-point distance metric, which neglects the neighborhood information of descriptors. Considering the fact that local neighborhood structures of matching descriptors would be simi...
Article
Trackers based on the IoU prediction network (IoU-Net) have shown superior performance, which refines a coarse bounding box to an accurate one by maximizing the IoU between the target and the coarse box. However, the traditional IoU-Net is less effective in exploiting the limited but crucial supervision information contained in the initial frame, i...
Article
Triple loss is widely used to detect learned descriptors and achieves promising performance. However, triple loss fails to fully consider the influence of adjacent descriptors from the same type of sample, which is one of the main reasons for image mismatching. To solve this problem, we propose a descriptor network based on triple loss with a simil...
Preprint
Full-text available
The crux of long-term tracking lies in the difficulty of tracking the target with discontinuous moving caused by out-of-view or occlusion. Existing long-term tracking methods follow two typical strategies. The first strategy employs a local tracker to perform smooth tracking and uses another re-detector to detect the target when the target is lost....
Article
Thermal InfraRed (TIR) target trackers are easy to be interfered by similar objects, while susceptible to the influence of the target occlusion. To solve these problems, we propose a structural target-aware model (STAMT) for the thermal infrared target tracking tasks. Specifically, the proposed STAMT tracker can learn a target-aware model, which ca...
Article
Full-text available
Most existing trackers are based on using a classifier and multi-scale estimation to estimate the target state. Consequently, and as expected, trackers have become more stable while tracking accuracy has stagnated. While trackers adopt a maximum overlap method based on an intersection-over-union (IoU) loss to mitigate this problem, there are defect...
Article
Full-text available
The feature models used by existing Thermal InfraRed (TIR) tracking methods are usually learned from RGB images due to the lack of a large-scale TIR image training dataset. However, these feature models are less effective in representing TIR objects and they are difficult to effectively distinguish distractors because they do not contain fine-grain...
Article
The current Siamese network based on region proposal network (RPN) has attracted great attention in visual tracking due to its excellent accuracy and high efficiency. However, the design of the RPN involves the selection of the number, scale, and aspect ratios of anchor boxes, which will affect the applicability and convenience of the model. Furthe...
Preprint
Semi-supervised learning is a challenging problem which aims to construct a model by learning from limited labeled examples. Numerous methods for this task focus on utilizing the predictions of unlabeled instances consistency alone to regularize networks. However, treating labeled and unlabeled data separately often leads to the discarding of mass...
Preprint
Full-text available
Convolutional neural networks (CNNs) have been successfully applied to the single target tracking task in recent years. Generally, training a deep CNN model requires numerous labeled training samples, and the number and quality of these samples directly affect the representational capability of the trained model. However, this approach is restricti...
Conference Paper
Article
Full-text available
In deep neural network compression, channel/filter pruning is widely used for compressing the pre-trained network by judging the redundant channels/filters. In this paper, we propose a two-step filter pruning method to judge the redundant channels/filters layer by layer. The first step is to design a filter selection scheme based on $$\ell _{2,1}$$...
Article
Tracking in the unmanned aerial vehicle (UAV) scenarios is one of the main components of target tracking tasks. Different from the target tracking task in the general scenarios, the target tracking task in the UAV scenarios is very challenging because of factors such as small scale and aerial view. Although the DCFs-based tracker has achieved good...
Preprint
Full-text available
Generating conversational gestures from speech audio is challenging due to the inherent one-to-many mapping between audio and body motions. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of all possible target motions, resulting in plain/boring motions during inference. In order to overcome this problem, we p...
Preprint
Full-text available
Most existing trackers based on deep learning perform tracking in a holistic strategy, which aims to learn deep representations of the whole target for localizing the target. It is arduous for such methods to track targets with various appearance variations. To address this limitation, another type of methods adopts a part-based tracking strategy w...
Article
Temporal and spatial contexts, characterizing target appearance variations and target-background differences, respectively, are crucial for improving the online adaptive ability and instance-level discriminative ability of object tracking. However, most existing trackers focus on either the temporal context or the spatial context during tracking an...
Article
Person re-identification (ReID) is an important topic of computer vision. Existing works in this field focus primarily on learning a feature extractor that maps the pedestrian images into a feature space, in which feature vectors corresponding to the same identity are close to each other. In this paper, we propose the adjacency-aware Graph Convolut...
Preprint
Full-text available
While deep-learning based methods for visual tracking have achieved substantial progress, these schemes entail large-scale and high-quality annotated data for sufficient training. To eliminate expensive and exhaustive annotation, we study self-supervised learning for visual tracking. In this work, we develop the Crop-Transform-Paste operation, whic...
Article
In visual tracking, it is challenging to distinguish the target from similar objects called noises in the background. As deep trackers use convolutional neural networks for image classification as feature extractors, the extracted features are insensitive to different instances in the same class, which is prone to make prediction models confuse the...
Article
Full-text available
To improve the performance of local learned descriptors, many researchers pay primary attention to the triplet loss network. As expected, it is useful to achieve state-of-the-art performance on various datasets. However, these local learned descriptors suffer from the inconsistency problem without considering the relationship between two descriptor...
Article
Recently, tracking models based on bounding box regression (such as region proposal networks), built on the Siamese network, have attracted much attention. Despite their promising performance, these trackers are less effective in perceiving the target information in the following two aspects. First, existing regression models cannot take a global v...
Preprint
Full-text available
The current Siamese network based on region proposal network (RPN) has attracted great attention in visual tracking due to its excellent accuracy and high efficiency. However, the design of the RPN involves the selection of the number, scale, and aspect ratios of anchor boxes, which will affect the applicability and convenience of the model. Furthe...
Article
Full-text available
Existing trackers usually exploit robust features or online updating mechanisms to deal with target variations which is a key challenge in visual tracking. However, the features being robust to variations remain little spatial information, and existing online updating methods are prone to overfitting. In this paper, we propose a dual-margin model f...
Article
Full-text available
Recent years have witnessed significant improvements of ensemble trackers based on independent models. However, existing ensemble trackers only combine the responses of independent models and pay less attention to the learning process, which hinders their performance from further improvements. To this end, we propose an interactive learning framewo...
Article
Inspired by the mean calculation of RPCA_OM and inductiveness of IRPCA, we first propose an inductive robust principal component analysis method with removing the optimal mean automatically, which is shorted as IRPCA_OM. Furthermore, IRPCA_OM is extended to Schatten- $p$ norm and a more general framework (i.e., EIRPCA_OM) is presented. The objecti...
Article
Full-text available
The training of a feature extraction network typically requires abundant manually annotated training samples, making this a time-consuming and costly process. Accordingly, we propose an effective self-supervised learning-based tracker in a deep correlation framework (named: self-SDCT). Motivated by the forward-backward tracking consistency of a rob...
Article
Full-text available
Existing regression based tracking methods built on correlation filter model or convolution modeldo not take both accuracy and robustness into account at the same time. In this paper, we pro-pose a dual regression framework comprising a discriminative fully convolutional module and a fine-grained correlation filter component for visual tracking. Th...
Preprint
The constraint of neighborhood consistency or local consistency is widely used for robust image matching. In this paper, we focus on learning neighborhood topology consistent descriptors (TCDesc), while former works of learning descriptors, such as HardNet and DSM, only consider point-to-point Euclidean distance among descriptors and totally neglec...
Preprint
Full-text available
In this paper, we present a Large-Scale and high-diversity general Thermal InfraRed (TIR) Object Tracking Benchmark, called LSOTBTIR, which consists of an evaluation dataset and a training dataset with a total of 1,400 TIR sequences and more than 600K frames. We annotate the bounding box of objects in every frame of all sequences and generate over...
Conference Paper
Full-text available
In this paper, we present a Large-Scale and high-diversity general Thermal InfraRed (TIR) Object Tracking Benchmark, called LSOTB-TIR, which consists of an evaluation dataset and a training dataset with a total of 1,400 TIR sequences and more than 600K frames. We annotate the bounding box of objects in every frame of all sequences and generate over...
Article
Full-text available
Correlation filter-based trackers (CFTs) have recently shown remarkable performance in the field of visual object tracking. The advantage of these trackers originates from their ability to convert time-domain calculations into frequency domain calculations. However, a significant problem of these CFTs is that the model is insufficiently robust when...
Article
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to represent the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images. To address this issue, we propose a multi-level similarity model under a Siamese frame...
Preprint
Full-text available
Most existing tracking methods are based on using a classifier and multi-scale estimation to estimate the state of the target. Consequently, and as expected, trackers have become more stable while tracking accuracy has stagnated. While the ATOM \cite{ATOM} tracker adopts a maximum overlap method based on an intersection-over-union (IoU) loss to mit...
Preprint
Triplet loss is widely used for learning local descriptors from image patch. However, triplet loss only minimizes the Euclidean distance between matching descriptors and maximizes that between the non-matching descriptors, which neglects the topology similarity between two descriptor sets. In this paper, we propose topology measure besides Euclidea...
Article
In this paper, we propose a robust subspace learning method, based on RPCA, named Robust Principal Component Analysis with Projection Learning (RPCAPL), which further improves the performance of feature extraction by projecting data samples into a suitable subspace. For Subspace Learning (SL) methods in clustering and classification tasks, it is al...
Article
Visual attention has recently achieved great success and wide application in deep neural networks. Existing methods based on Siamese network have achieved a good accuracy-efficiency trade-off in visual tracking. However, the training time of Siamese trackers becomes longer for the deeper network and larger training data. Further, Siamese trackers c...
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However, these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific di...
Article
Discriminative correlation filters (DCFs) have been widely used in the tracking community recently. DCFs-based trackers utilize samples generated by circularly shifting from an image patch to train a ridge regression model, and estimate target location using a response map generated by the correlation filters. However, the generated samples produce...
Article
Convolutional Neural Networks (CNN) have been demonstrated to achieve state-of-the-art performance in visual object tracking task. However, existing CNN-based trackers usually use holistic target samples to train their networks. Once the target undergoes complicated situations (e.g., occlusion, background clutter, and deformation), the tracking per...
Article
Discriminative correlation filters (DCFs) have been widely used in the visual tracking community in recent years. The DCFs-based trackers determine the target location through a response map generated by the correlation filters and determine the target scale by a fixed scale factor. However, the response map is vulnerable to noise interference and...
Article
Learning an expressive representation from multi-view data is a key step in various real-world applications. In this paper, we propose a Semi-supervised Multi-view Deep Discriminant Representation Learning (SMDDRL) approach. Unlike existing joint or alignment multi-view representation learning methods that cannot simultaneously utilize the consensu...
Article
Full-text available
Correlation filter (CF) theory has gained a sustainable attractiveness on the tracking field by virtue of its efficiency in training and decision stage. The seminal idea of cyclic shift operations on image patch makes the CF-based trackers accomplish the dense sampling scheme in an efficient manner. Due to all shifted samples merely provide the sin...
Article
Sparse additive models have shown competitive performance for high-dimensional variable selection and prediction due to their representation flexibility and interpretability. Despite their theoretical properties have been studied extensively, few works have addressed the robustness for the sparse additive models. In this paper, we employ the robust...
Chapter
The Vision Meets Drone (VisDrone2020) Single Object Tracking is the third annual UAV tracking evaluation activity organized by the VisDrone team, in conjunction with European Conference on Computer Vision (ECCV 2020). The VisDrone-SOT2020 Challenge presents and discusses the results of 13 participating algorithms in detail. By using ensemble of dif...
Data
Multi-Task Driven Feature Models for Thermal Infrared Tracking--Supplementary Materials
Conference Paper
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However , these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific d...
Preprint
Full-text available
Existing deep Thermal InfraRed (TIR) trackers usually use the feature models of RGB trackers for representation. However, these feature models learned on RGB images are neither effective in representing TIR objects nor taking fine-grained TIR information into consideration. To this end, we develop a multi-task framework to learn the TIR-specific di...
Conference Paper
Full-text available
Article
Current unsupervised feature selection methods cannot well select the effective features from the corrupted data. To this end, we propose a robust unsupervised feature selection method under the robust principal component analysis (PCA) reconstruction criterion, which is named the adaptive weighted sparse PCA (AW-SPCA). In the proposed method, both...
Article
Full-text available
Thermal infrared (TIR) pedestrian tracking is one of the important components among the numerous applications of computer vision, which has a major advantage: it can track pedestrians in total darkness. The ability to evaluate the TIR pedestrian tracker fairly, on a benchmark dataset, is significant for the development of this field. However, there...
Preprint
Full-text available
Existing deep Thermal InfraRed (TIR) trackers only use semantic features to describe the TIR object, which lack the sufficient discriminative capacity for handling distractors. This becomes worse when the feature extraction network is only trained on RGB images.To address this issue, we propose a multi-level similarity model under a Siamese framewo...
Article
Full-text available
This paper proposes a novel network flow model for multi-target tracking, which uses short and highly reliable detection responses as the basic unit, namely the tracklet, in the model. Our model exploits the local information of the tracklet and deploys the global strategy of data association in tracking. The method is divided into two phases: a lo...
Conference Paper
Preprint
Existing deep trackers mainly use convolutional neural networks pre-trained for generic object recognition task for representations. Despite demonstrated successes for numerous vision tasks, the contributions of using pre-trained deep features for visual tracking are not as significant as that for object recognition. The key issue is that in visual...
Article
A novel network flow model is proposed for multiple pedestrian tracking in this paper. Based on tracklets, only a short and reliable detection sequence is needed for effective tracking. Our model fuses the local and global data association strategies to compensate for their respective shortcomings, and can be divided into two stages: a local stage...
Article
Correlation filters (CF) have demonstrated a good performance in visual tracking. However, the base training sample region is larger than the object region, including the interference region (IR). IRs in training samples from cyclic shifts of the base training sample severely degrade the quality of the tracking model. In this paper, a region-filter...
Article
With robustness to various corruptions, it is the local geometrical relationship among data that plays an important role in the recognition/clustering task of subspace learning (SL). However, a lot of previous SL methods cannot take into consideration both of the local neighborhood and the robustness, which results in poor performance in image clas...
Article
Visual object tracking is an attractive issue in the field of computer vision. Recently, correlation filters (CF) based trackers formulate the training process by solving the regression in the Fourier domain and show a great efficiency for tracking task. However, its efficiency collapsed when the distracters appear in the background. To improve the...
Article
In general, Low-Rank Representation (LRR) aims to find the lowest-rank representation with respect to a dictionary. In fact, the dictionary is a key aspect of low-rank representation. However, a lot of low-rank representation methods usually use the data itself as a dictionary (i.e., a fixed dictionary), which may degrade their performances due to...
Article
Recently, sparse representation has been widely introduced into tracking methods to improve their performance. However, these methods only focus on reconstructing the candidate samples while ignoring the discriminative information of the background, which greatly limits their performance, especially when the target undergoes heavy occlusion. To tac...
Article
Web service recommendation plays an important role in building service-oriented systems. QoS-based Web service recommendation has recently gained much attention for providing a promising way to help users find high-quality services. To accurately predict the QoS values of candidate Web services, Web service recommendation systems usually need to co...
Article
Recently, correlation filters have demonstrated the excellent performance in visual tracking. However, the base training sample region is larger than the object region,including the Interference Region(IR). The IRs in training samples from cyclic shifts of the base training sample severely degrade the quality of a tracking model. In this paper, we...
Conference Paper
Full-text available
The general tracking algorithm is vulnerable to noise because of using a single feature, makes the performance and robustness of the those algorithms greatly limited. In this paper, in order to achieve the robust and pretty performance, we propose a novel multiple feature fused model in correlation filter framework for visual tracking. The adoption...
Article
Full-text available
Dimensionality reduction is an important topic in machine learning community, which is widely used in the areas of face recognition, visual detection and tracking. Preserving local and global structures simultaneously is crucial for dimensionality reduction. In this paper, local and global approaches are generalized, respectively, and then a unifie...