
Xiaopeng Hong- Ph.D
- Professor at Harbin Institute of Technology
Xiaopeng Hong
- Ph.D
- Professor at Harbin Institute of Technology
About
207
Publications
53,822
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,061
Citations
Introduction
I am a professor at Harbin Institute of Technology (HIT), P. R. China. I had been a distinguished investigator at Xi’an Jiaotong University, P. R. China until Oct. 2021, and an adjunct professor with University of Oulu, Finland until Feb. 2019. I have authored over 50 articles in top-tier publications and conferences such as IEEE T-PAMI, CVPR, ICCV, and AAAI. I have served as an area chair/SPC for ACM MM, AAAI, and IJCAI, a guest editor for PRL and SIVP.
Current institution
Additional affiliations
December 2006 - September 2010
September 2004 - November 2010
Publications
Publications (207)
Given an image region of pixels, second order statistics can be used to construct a descriptor for object representation. One example is the covariance matrix descriptor, which shows high discriminative power and good robustness in many computer vision applications. However, operations for the covariance matrix on Riemannian manifolds are usually c...
This paper focuses on the emerging Infrared-Visible crossmodal person re-identification task (IV-ReID), which takes infrared images as input and matches with visible color images. IV-ReID is important yet challenging, as there is a significant gap between the visible and infrared images. To reduce this ‘gap’, we introduce an auxiliary X modality as...
In this paper, we propose a novel single-task continual learning framework named Bi-Objective Continual Learning (BOCL). BOCL aims at both consolidating historical knowledge and learning from new data. On one hand, we propose to preserve the old knowledge using a small set of pillars and develop the pillar consolidation (PLC) loss to preserve the o...
Human action recognition from skeleton data, fueled by the Graph Convolutional Network (GCN), has attracted lots of attention, due to its powerful capability of modeling non-Euclidean structure data. However, many existing GCN methods provide a pre-defined graph and fix it through the entire network, which can lose implicit joint correlations. Besi...
Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual unde...
Micro-expression recognition plays a pivotal role in understanding hidden emotions and has applications across various fields. Traditional recognition methods assume access to all training data at once, but real-world scenarios involve continuously evolving data streams. To respond to the requirement of adapting to new data while retaining previous...
This paper focuses on semi-supervised crowd counting, where only a small portion of the training data are labeled. We formulate the pixel-wise density value to regress as a probability distribution, instead of a single deterministic value. On this basis, we propose a semi-supervised crowd counting model. Firstly, we design a pixel-wise distribution...
Contemporary continual learning approaches typically select prompts from a pool, which function as supplementary inputs to a pre-trained model. However, this strategy is hindered by the inherent noise of its selection approach when handling increasing tasks. In response to these challenges, we reformulate the prompting approach for continual learni...
Continual learning, focusing on sequential knowledge acquisition and retention, necessitates efficient memory management. This paper introduces a holistic approach, diverging from traditional methods that separately optimize neural network and replay buffer memory. We aim to enhance overall memory efficiency, addressing neural network parameters an...
Large multimodal language models (MLLMs) have revolutionized natural language processing and visual understanding, but often contain outdated or inaccurate information. Current multimodal knowledge editing evaluations are limited in scope and potentially biased, focusing on narrow tasks and failing to assess the impact on in-domain samples. To addr...
To alleviate the heavy annotation burden for training a reliable crowd counting model and thus make the model more practicable and accurate by being able to benefit from more data, this paper presents a new semi-supervised method based on the mean teacher framework. When there is a scarcity of labeled data available, the model is prone to overfit l...
This article addresses the challenge of scale variations in crowd-counting problems from a multidimensional measure-theoretic perspective. We start by formulating crowd counting as a measure-matching problem, based on the assumption that discrete measures can express the scattered ground truth and the predicted density map. In this context, we intr...
This article explores a novel dynamic network for vision and language (V&L) tasks, where the inferring structure is customized on the fly for different inputs. Most previous state-of-the-art (SOTA) approaches are static and handcrafted networks, which not only heavily rely on expert knowledge but also ignore the semantic diversity of input samples,...
Multi-modal crowd counting is a crucial task that uses multi-modal cues to estimate the number of people in crowded scenes. To overcome the gap between different modalities, we propose a modal emulation-based two-pass multi-modal crowd-counting framework that enables efficient modal emulation, alignment, and fusion. The framework consists of two ke...
Continual Test-Time Adaptation (CTTA) involves adapting a pre-trained source model to continually changing unsupervised target domains. In this paper, we systematically analyze the challenges of this task: online environment, unsupervised nature, and the risks of error accumulation and catastrophic forgetting under continual domain shifts. To addre...
Diffusion models have shown superior performance on unsupervised anomaly detection tasks. Since trained with normal data only, diffusion models tend to reconstruct normal counterparts of test images with certain noises added. However, these methods treat all potential anomalies equally, which may cause two main problems. From the global perspective...
This paper explores a novel dynamic network for vision and language tasks, where the inferring structure is customized on the fly for different inputs. Most previous state-of-the-art approaches are static and hand-crafted networks, which not only heavily rely on expert knowledge, but also ignore the semantic diversity of input samples, therefore re...
Transformer has been popular in recent crowd counting work since it breaks the limited receptive field of traditional CNNs. However, since crowd images always contain a large number of similar patches, the self-attention mechanism in Transformer tends to find a homogenized solution where the attention maps of almost all patches are identical. In th...
Food recognition applications in human health have recently garnered significant attention in the field of computer vision. With the advancement of mobile devices, robust food recognition in wireless communication has become a practical and challenging application scenario. We propose a novel Multi-scale Spiking Pyramid Transmission Network (MSPTN)...
Emotion and sentiment play a central role in various human activities, such as perception, decision-making, social interaction, and logical reasoning. Developing artificial emotional intelligence (AEI) for machines is becoming a bottleneck in human–computer interaction. The first step of AEI is to recognize the emotion and sentiment that are convey...
There has been a growing interest in counting crowds through computer vision and machine learning techniques in recent years. Despite that significant progress has been made, most existing methods heavily rely on fully-supervised learning and require a lot of labeled data. To alleviate the reliance, we focus on the semi-supervised learning paradigm...
Modern object detectors are ill-equipped to incrementally learn new emerging object classes over time due to the well-known phenomenon of catastrophic forgetting. Due to data privacy or limited storage, few or no images of the old data can be stored for replay. In this paper, we design a novel One-Shot Replay (OSR) method for incremental object det...
This paper focuses on the prevalent stage interference and stage performance imbalance of incremental learning. To avoid obvious stage learning bottlenecks, we propose a new incremental learning framework, which leverages a series of stage-isolated classifiers to perform the learning task at each stage, without interference from others. To be concr...
Meta AI recently released the Segment Anything model (SAM), which has garnered attention due to its impressive performance in class-agnostic segmenting. In this study, we explore the use of SAM for the challenging task of few-shot object counting, which involves counting objects of an unseen category by providing a few bounding boxes of examples. W...
Although data-free incremental learning methods are memory-friendly, accurately estimating and counteracting representation shifts is challenging in the absence of historical data. This paper addresses this thorny problem by proposing a novel incremental learning method inspired by human analogy capabilities. Specifically, we design an analogy-maki...
Deepfake technologies have been blurring the boundaries between the real and unreal, likely resulting in malicious events. By leveraging newly emerged deepfake technologies, deepfake researchers have been making a great upending to create deepfake artworks (deeparts), which are further closing the gap between reality and fantasy. To address potenti...
Contrastive learning (CL) methods achieve great success by learning the invariant representation from various transformations. However, rotation transformations are considered harmful to CL and are rarely used, which results in failure when the objects show unseen orientations. This article proposes a representation focus shift network (RefosNet),...
Inspired by the global–local information processing mechanism in the human visual system, we propose a novel convolutional neural network (CNN) architecture named cognition-inspired network (CogNet) that consists of a global pathway, a local pathway, and a top-down modulator. We first use a common CNN block to form the local pathway that aims to ex...
Recently, deep learning technique has been widely employed to deal with face super-resolution (FSR) problem. It aims to predict the nonlinear relationship between the low-resolution (LR) face images and corresponding high-resolution (HR) ones, which could recover the high-frequency details from the LR degraded textures. However, either CNN-based or...
To meet the demands in terms of energy-efficient and fast production and delivery of goods, robotic fleets began to populate warehouses and industrial environments. To maximize the profitability of the operations, multi-robot systems are required to coordinate agents and avoid downtime efficiently. In this paper, agent coordination is formulated as...
This paper focuses on the prevalent performance imbalance in the stages of incremental learning. To avoid obvious stage learning bottlenecks, we propose a brand-new stage-isolation based incremental learning framework, which leverages a series of stage-isolated classifiers to perform the learning task of each stage without the interference of other...
In this article, we focus on a new and challenging decentralized machine learning paradigm in which there are continuous inflows of data to be addressed and the data are stored in multiple repositories. We initiate the study of data-decentralized class-incremental learning (DCIL) by making the following contributions. First, we formulate the DCIL p...
Being spontaneous, micro-expressions are useful in the inference of a person's true emotions even if an attempt is made to conceal them. Due to their short duration and low intensity, the recognition of micro-expressions is a difficult task in affective computing. The early work based on handcrafted spatio-temporal features which showed some promis...
Vision crowd counting task has made remarkable process in recent years thanks to the development of CNNs. However, this field has run into bottleneck since CNNs, by their nature, are limited by locally attentive receptive fields and incapable to model long-term dependencies. To address this problem, we introduce a multi-scale transformer based crow...
In this paper, we propose a new agency-guided semi-supervised counting approach. First, we build a learnable auxiliary structure, namely the density agency to bring the recognized foreground regional features close to corresponding density sub-classes (agents) and push away background ones. Second, we propose a density-guided contrastive learning l...
State-of-the-art deep neural networks are still struggling to address the catastrophic forgetting problem in continual learning. In this paper, we propose one simple paradigm (named as S-Prompting) and two concrete approaches to highly reduce the forgetting degree in one of the most typical continual learning scenarios, i.e., domain increment learn...
Face Super-resolution (FSR) task works for generating high-resolution (HR) face images from the corresponding low-resolution (LR) inputs, which has received a lot of attentions because of the wide application prospects. However, due to the diversity of facial texture and the difficulty of reconstructing detailed content from degraded images, FSR te...
There have been emerging a number of benchmarks and techniques for the detection of deepfakes. However, very few works study the detection of incrementally appearing deepfakes in the real-world scenarios. To simulate the wild scenes, this paper suggests a continual deepfake detection benchmark (CDDB) over a new collection of deepfakes from both kno...
In this paper, we focus on a new and challenging decentralized machine learning paradigm in which there are continuous inflows of data to be addressed and the data are stored in multiple repositories. We initiate the study of data decentralized class-incremental learning (DCIL) by making the following contributions. Firstly, we formulate the DCIL p...
This paper focuses on the challenging crowd counting task. As large-scale variations often exist within crowd images, neither fixed-size convolution kernel of CNN nor fixed-size attention of recent vision transformers can well handle this kind of variation. To address this problem, we propose a Multifaceted Attention Network (MAN) to improve transf...
The
data association
problem of multi-object tracking (MOT) aims to assign IDentity (ID) labels to detections and infer a complete trajectory for each target. Most existing methods assume that each detection corresponds to a unique target and thus cannot handle situations when multiple targets occur in a single detection due to detection failure...
Deep models have shown to be vulnerable to catastrophic forgetting, a phenomenon that the recognition performance on old data degrades when a pre-trained model is fine-tuned on new data. Knowledge distillation (KD) is a popular incremental approach to alleviate catastrophic forgetting. However, it usually fixes the absolute values of neural respons...
Recent solutions to crowd counting problems have already achieved promising performance across various benchmarks. However, applying these approaches to real-world applications is still challenging, because they are computation intensive and lack the flexibility to meet various resource budgets. In this article, we propose an efficient crowd counti...
Unsupervised domain adaptation (UDA) aims to address the domain-shift problem between a labeled source domain and an unlabeled target domain. Many efforts have been made to eliminate the mismatch between the distributions of training and testing data by learning domain-invariant representations. However, the learned representations are usually not...
This paper aims to tackle the challenging task of one-shot object counting. Given an image containing novel, previously unseen category objects, the goal of the task is to count all instances in the desired category with only one supporting bounding box example. To this end, we propose a counting model by which you only need to Look At One instance...
Being unconscious and spontaneous, micro-expressions are useful in the inference of a person's true emotions even if an attempt is made to conceal them. Due to their short duration and low intensity, the recognition of micro-expressions is a difficult task in affective computing. The early work based on handcrafted spatio-temporal features which sh...
3D visual tracking is significant to deep space exploration programs, which can guarantee spacecraft to flexibly approach the target. In this paper, we focus on the studied accurate and real-time method for 3D tracking. Considering the fact that there are almost no public dataset for this topic, A new large-scale 3D asteroid tracking dataset is pre...
Weakly supervised object detection (WSOD) is a challenging task that requires simultaneously learn object classifiers and estimate object locations under the supervision of image category labels. A major line of WSOD methods roots in multiple instance learning which regards images as bags of instance and selects positive instances from each bag to...
Micro-expressions describe unconscious facial movements which reflect a person's psychological state even when there is an attempt to conceal it. Often used in psychological and forensic applications, their manual recognition requires professional training and is time consuming. Therefore, achieving automatic recognition by means of computer vision...
Traditional crowd counting approaches usually use Gaussian assumption to generate pseudo density ground truth, which suffers from problems like inaccurate estimation of the Gaussian kernel sizes. In this paper, we propose a new measure-based counting approach to regress the predicted density maps to the scattered point-annotated ground truth direct...
Anomaly detection plays a key role in industrial manufacturing for product quality control. Traditional methods for anomaly detection are rule-based with limited generalization ability. Recent methods based on supervised deep learning are more powerful but require large-scale annotated datasets for training. In practice, abnormal products are rare...
Traditional crowd counting approaches usually use Gaussian assumption to generate pseudo density ground truth, which suffers from problems like inaccurate estimation of the Gaussian kernel sizes. In this paper, we propose a new measure-based counting approach to regress the predicted density maps to the scattered point-annotated ground truth direct...
We formulate crowd counting as a measure regression problem for the first time and propose a novel method to minimize the distance between two measures with different supports. Our method achieves state-of-the-art counting performance and superior localization performance with sharp density predictions.
In this paper, we focus on the multi-object tracking (MOT) problem of automatic driving and robot navigation. Most existing MOT methods track multiple objects using a singular RGB camera, which are prone to camera field-of-view and suffer tracking failures in complex scenarios due to background clutters and poor light conditions. To meet these chal...
In this paper, we focus on the multi-object tracking (MOT) problem of automatic driving and robot navigation. Most existing MOT methods track multiple objects using a singular RGB camera, which are prone to camera field-of-view and suffer tracking failures in complex scenarios due to background clutters and poor light conditions. To meet these chal...
Counting dense crowds through computer vision technology has attracted widespread attention. Most crowd counting datasets use point annotations. In this paper, we formulate crowd counting as a measure regression problem to minimize the distance between two measures with different supports and unequal total mass. Specifically, we adopt the unbalance...
This paper focuses on the unsupervised domain adaptation problem for video-based crowd counting, in which we use labeled data as source domain and unlabelled video data as target domain. It is challenging as there is a huge gap between the source and the target domain and no annotations of samples are available in the target domain. The key issue i...
In this paper, we focus on the challenging few-shot class incremental learning (FSCIL) problem, which requires to transfer knowledge from old tasks to new ones and solves catastrophic forgetting. We propose the exemplar relation distillation incremental learning framework to balance the tasks of old-knowledge preserving and new-knowledge adaptation...
Deep learning-based person re-identification (Re-ID) has made great progress and achieved high performance recently. In this paper, we make the first attempt to examine the vulnerability of current person Re-ID models against a dangerous attack method,
i.e.
, the universal adversarial perturbation (UAP) attack, which has been shown to fool classi...
Recently, image-to-image translation has made significant progress in achieving both multi-label (\ie, translation conditioned on different labels) and multi-style (\ie, generation with diverse styles) tasks. However, due to the unexplored independence and exclusiveness in the labels, existing endeavors are defeated by involving uncontrolled manipu...
Graph Convolutional Network (GCN) has already been successfully applied to skeleton-based action recognition. However, current GCNs in this task are lack of pooling operations such that the architectures are inherently flat, which not only increases the computational complexity but also requires larger memory space to keep the entire graph embeddin...
Micro-expressions (MEs) are brief and involuntary facial expressions that occur when people are trying to hide their true feelings or conceal their emotions. Based on psychology research, MEs play an important role in understanding genuine emotions, which leads to many potential applications. Therefore, ME analysis has become an attractive topic fo...
Multimodal sentiment analysis has increasingly attracted attention since with the arrival of complementary data streams, it has great potential to improve and go beyond unimodal sentiment analysis. In this paper, we present an efficient separable multimodal learning method to deal with the tasks with modality missing issue. In this method, the mult...
This paper focuses on the unsupervised domain adaptation problem for video-based crowd counting, in which we use labeled data as source domain and unlabelled video data as target domain. It is challenging as there is a huge gap between the source and the target domain and no annotations of samples are available in the target domain. The key issue i...
The human visual system can recognize object categories accurately and efficiently and is robust to complex textures and noises. To mimic the analogy-detail dual-pathway human visual cognitive mechanism revealed in recent cognitive science studies, in this article, we propose a novel convolutional neural network (CNN) architecture named
analogy-de...
A well-known issue for class-incremental learning is the catastrophic forgetting phenomenon, where the network’s recognition performance on old classes degrades severely when incrementally learning new classes. To alleviate forgetting, we put forward to preserve the old class knowledge by maintaining the topology of the network’s feature space. On...
Questions
Question (1)
I would like to know the price of some typical device from Tobii or SMI, since we may need to buy one for further research.