Xian-Sheng Hua

Xian-Sheng Hua
  • PhD
  • Senior Researcher at Microsoft

About

515
Publications
57,604
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
20,252
Citations
Current institution
Microsoft
Current position
  • Senior Researcher

Publications

Publications (515)
Article
With the burst of big data, 2D-3D cross-modal retrieval has received increasing attention, which aims to retrieve relevant data from one modality given the query from the other modality. In this paper, we study an underexplored yet practical problem of semi-supervised 2D-3D cross-modal retrieval, which could suffer from serious label scarcity in re...
Article
Achieving a balance between accuracy and efficiency is a critical challenge in facial landmark detection (FLD). This paper introduces Parallel Optimal Position Search (POPoS), a high-precision encoding-decoding framework designed to address the limitations of traditional FLD methods. POPoS employs three key contributions: (1) Pseudo-range multilate...
Preprint
Pioneering text-to-image (T2I) diffusion models have ushered in a new era of real-world image super-resolution (Real-ISR), significantly enhancing the visual perception of reconstructed images. However, existing methods typically integrate uniform abstract textual semantics across all blocks, overlooking the distinct semantic requirements at differ...
Article
This paper studies the problem of semi-supervised learning on graphs, which has recently aroused widespread interest in relational data mining. The focal point of exploration in this area has been the utilization of graph neural networks (GNNs), which stand out for excellent performance. Previous methods, however, typically rely on the limited labe...
Article
With the emergence of AI generated content, cross-modal retrieval of 2D and 3D data has obtained increasing research attention. In practical applications, massive amounts of 2D and 3D data need expensive annotation, which would make labels scarce. Even worse, complicated heterogeneous relationships between 2D and 3D data make the problem more chall...
Article
Hashing aims to compress raw data into compact binary descriptors, which has drawn increasing interest for efficient large-scale image retrieval. Current deep hashing often employs evaluation protocols where usually query data and training data are from similar distributions. However, more realistic evaluations should take into account a broad spec...
Article
Graph neural networks (GNNs) have achieved great success recently on graph classification tasks using supervised end-to-end training. Unfortunately, extensive noisy graph labels could exist in the real world because of the complicated processes of manual graph data annotations, which may significantly degrade the performance of GNNs. Therefore, we...
Preprint
Large-scale "pre-train and prompt learning" paradigms have demonstrated remarkable adaptability, enabling broad applications across diverse domains such as question answering, image recognition, and multimodal retrieval. This approach fully leverages the potential of large-scale pre-trained models, reducing downstream data requirements and computat...
Article
Graph classification is a critical task in numerous multimedia applications, where graphs are employed to represent diverse types of multimedia data, including images, videos, and social networks. Nevertheless, in the real world, labeled graph data are always limited or scarce. To address this issue, we focus on the semi-supervised graph classifica...
Preprint
Deep metric learning (DML) aims to learn a discriminative high-dimensional embedding space for downstream tasks like classification, clustering, and retrieval. Prior literature predominantly focuses on pair-based and proxy-based methods to maximize inter-class discrepancy and minimize intra-class diversity. However, these methods tend to suffer fro...
Article
Hashing has received significant interest in large-scale data retrieval due to its outstanding computational efficiency. Of late, numerous deep hashing approaches have emerged, which have obtained impressive performance. However, these approaches can contain ethical risks during image retrieval. To address this, we are the first to study the proble...
Article
Loss functions and sample mining strategies are essential components in deep metric learning algorithms. However, the existing loss function or mining strategy often necessitate the incorporation of additional hyperparameters, notably the threshold, which defines whether the sample pair is informative. The threshold provides a stable numerical stan...
Article
Learning with noisy labels has gained increasing attention because the inevitable imperfect labels in real-world scenarios can substantially hurt the deep model performance. Recent studies tend to regard low-loss samples as clean ones and discard high-loss ones to alleviate the negative impact of noisy labels. However, real-world datasets contain n...
Article
Domain adaptive hashing has received increasing attention since it is capable of enhancing the performance of retrieval if the target domain for testing meets domain shift. However, owing to data security and transmission constraints nowadays, abundant source data is often not available. Towards this end, this paper investigates a novel yet practic...
Article
Graph neural networks (GNNs) have emerged as powerful tools for graph classification tasks. However, contemporary graph classification methods are predominantly studied in fully supervised scenarios, while there could be label ambiguity and noise in real-world applications. In this work, we explore the weakly supervised problem of partial label lea...
Article
As an important problem in searching system development, domain adaptive retrieval seeks to train a retrieval model with both labeled source samples and unlabeled target samples. Although several domain adaptive hashing algorithms have been proposed to handle the problem with high efficiency, they often presume that source and target domains share...
Article
This paper delves into the problem of correlated time-series forecasting in practical applications, an area of growing interest in a multitude of fields such as stock price prediction and traffic demand analysis. Current methodologies primarily represent data using conventional graph structures, yet these fail to capture intricate structures with n...
Article
The optical flow guidance strategy is ideal for obtaining motion information of objects in the video. It is widely utilized in video segmentation tasks. However, existing optical flow-based methods have a significant dependency on optical flow, which results in poor performance when the optical flow estimation fails for a particular scene. The temp...
Article
This paper studies deep unsupervised hashing, which has attracted increasing attention in large-scale image retrieval. The majority of recent approaches usually reconstruct semantic similarity information, which then guides the hash code learning. However, they still fail to achieve satisfactory performance in reality due to two reasons. On the one...
Preprint
We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by using a joint prediction from a conventional 3D model and a well-trained 2D model. Surprisingly, such an ensemble, though seems trivial, has hardly been shown effective in recent 2D-3D models. We find out the crux is the less effective training for the ''join...
Article
Recent years have witnessed the explosive growth of interaction behaviors in multimedia information systems, where multi-behavior recommender systems have received increasing attention by leveraging data from various auxiliary behaviors such as tip and collect. Among various multi-behavior recommendation methods, non-sampling methods have shown sup...
Preprint
We show that classifiers trained with random region proposals achieve state-of-the-art Open-world Object Detection (OWOD): they can not only maintain the accuracy of the known objects (w/ training labels), but also considerably improve the recall of unknown ones (w/o training labels). Specifically, we propose RandBox, a Fast R-CNN based architectur...
Preprint
Full-text available
Although graph neural networks (GNNs) have achieved impressive achievements in graph classification, they often need abundant task-specific labels, which could be extensively costly to acquire. A credible solution is to explore additional labeled graphs to enhance unsupervised learning on the target domain. However, how to apply GNNs to domain adap...
Preprint
Full-text available
In this paper, we investigate the challenge of spatio-temporal video prediction, which involves generating future videos based on historical data streams. Existing approaches typically utilize external information such as semantic maps to enhance video prediction, which often neglect the inherent physical knowledge embedded within videos. Furthermo...
Preprint
Existing knowledge distillation works for semantic segmentation mainly focus on transferring high-level contextual knowledge from teacher to student. However, low-level texture knowledge is also of vital importance for characterizing the local structural pattern and global statistical property, such as boundary, smoothness, regularity and color con...
Article
Full-text available
With the increasing scale of urbanization, traffic congestion has caused a severe negative impact on the efficiency of social development. To this end, a series of intelligent traffic light control methods based on reinforcement learning are proposed. They get superior performance compared with conventional control methods under certain conditions....
Preprint
Full-text available
This paper studies semi-supervised graph classification, a crucial task with a wide range of applications in social network analysis and bioinformatics. Recent works typically adopt graph neural networks to learn graph-level representations for classification, failing to explicitly leverage features derived from graph topology (e.g., paths). Moreov...
Chapter
Video frame interpolation (VFI) aims to improve the temporal resolution of a video sequence. Most of the existing deep learning based VFI methods adopt off-the-shelf optical flow algorithms to estimate the bidirectional flows and interpolate the missing frames accordingly. Though having achieved a great success, these methods require much human exp...
Chapter
In this paper, we propose an embarrassingly simple yet highly effective adversarial domain adaptation (ADA) method. We view ADA problem primarily from an optimization perspective and point out a fundamental dilemma, in that the real-world data often exhibits an imbalanced distribution where the large data clusters typically dominate and bias the ad...
Article
Full-text available
This paper studies the problem of unsupervised domain adaptive hashing, which is less-explored but emerging for efficient image retrieval, particularly for cross-domain retrieval. This problem is typically tackled by learning hashing networks with pseudo-labeling and domain alignment techniques. Nevertheless, these approaches usually suffer from ov...
Article
Person re-identification (Re-ID) aims to retrieve person images from a large gallery given a query image of a person of interest. Global information and fine-grained local features are both essential for the representation. However, global embedding learned by naive classification model tends to be trapped in the most discriminative local region, l...
Preprint
Few-shot semantic segmentation is the task of learning to locate each pixel of the novel class in the query image with only a few annotated support images. The current correlation-based methods construct pair-wise feature correlations to establish the many-to-many matching because the typical prototype-based approaches cannot learn fine-grained cor...
Article
This paper studies the problem of semi-supervised learning on graphs, which aims to incorporate ubiquitous unlabeled knowledge (e.g., graph topology, node attributes) with few-available labeled knowledge (e.g., node class) to alleviate the scarcity issue of supervised information on node classification. While promising results are achieved, existin...
Article
Few-shot semantic segmentation is the task of learning to locate each pixel of the novel class in the query image with only a few annotated support images. The current correlation-based methods construct pair-wise feature correlations to establish the many-to-many matching because the typical prototype-based approaches cannot learn fine-grained cor...
Article
Due to the excellent computing efficiency, learning to hash has acquired broad popularity for Big Data retrieval. Although supervised hashing methods have achieved promising performance recently, they presume that all training samples are appropriately annotated. Unfortunately, label noise is ubiquitous owing to erroneous annotations in real-world...
Chapter
Transformer-based methods have recently achieved great advancement on 2D image-based vision tasks. For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers on video data will bring heavy computation and memory burdens due to the largely increased number of patches and the quadratic complexity of se...
Chapter
Since Intersection-over-Union (IoU) based optimization maintains the consistency of the final IoU prediction metric and losses, it has been widely used in both regression and classification branches of single-stage 2D object detectors. Recently, several 3D object detection methods adopt IoU-based optimization and directly replace the 2D IoU with 3D...
Chapter
In contrast to the fully supervised methods using pixel-wise mask labels, box-supervised instance segmentation takes advantage of the simple box annotations, which has recently attracted a lot of research attentions. In this paper, we propose a novel single-shot box-supervised instance segmentation approach, which integrates the classical level set...
Chapter
Out-Of-Distribution generalization (OOD) is all about learning invariance against environmental changes. If the context (In this paper, the word “context” denotes any class-agnostic attributes such as color, texture and background. The formal definition can be found in Appendix, A.2.) in every class is evenly distributed, OOD would be trivial becau...
Article
Recently, unsupervised person re-identification (Re-ID) has received increasing research attention due to its potential for label-free applications. A promising way to address unsupervised Re-ID is clustering-based, which generates pseudo labels by clustering and uses the pseudo labels to train a Re-ID model iteratively. However, most clustering-ba...
Conference Paper
Hashing, which encodes raw data into compact binary codes, has grown in popularity for large-scale image retrieval due to its storage and computation efficiency. Although deep supervised hashing has lately shown promising performance, they mostly assume that the semantic labels of training data are ideally noise-free, which is often unrealistic in...
Chapter
In this paper, we explore the details in video recognition with the aim to improve the accuracy. It is observed that most failure cases in recent works fall on the mis-classifications among very similar actions (such as high kick vs. side kick) that need a capturing of fine-grained discriminative details. To solve this problem, we propose synopsis-...
Preprint
Out-Of-Distribution generalization (OOD) is all about learning invariance against environmental changes. If the context in every class is evenly distributed, OOD would be trivial because the context can be easily removed due to an underlying principle: class is invariant to context. However, collecting such a balanced dataset is impractical. Learni...
Preprint
Conventional de-noising methods rely on the assumption that all samples are independent and identically distributed, so the resultant classifier, though disturbed by noise, can still easily identify the noises as the outliers of training distribution. However, the assumption is unrealistic in large-scale data that is inevitably long-tailed. Such im...
Preprint
Transformer-based methods have recently achieved great advancement on 2D image-based vision tasks. For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers on video data will bring heavy computation and memory burdens due to the largely increased number of patches and the quadratic complexity of se...
Preprint
Full-text available
Since Intersection-over-Union (IoU) based optimization maintains the consistency of the final IoU prediction metric and losses, it has been widely used in both regression and classification branches of single-stage 2D object detectors. Recently, several 3D object detection methods adopt IoU-based optimization and directly replace the 2D IoU with 3D...
Preprint
Full-text available
Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images. An intriguing yet challenging problem arises: Can generative models achieve counterfactual editing against their learnt priors? Due to the lack of counterfactual...
Conference Paper
Full-text available
This paper studies semi-supervised graph classification, a crucial task with a wide range of applications in social network analysis and bioinformatics. Recent works typically adopt graph neural networks to learn graph-level representations for classification, failing to explicitly leverage features derived from graph topology (e.g., paths). Moreov...
Preprint
Semi-Supervised Learning (SSL) is fundamentally a missing label problem, in which the label Missing Not At Random (MNAR) problem is more realistic and challenging, compared to the widely-adopted yet naive Missing Completely At Random assumption where both labeled and unlabeled data share the same class distribution. Different from existing SSL solu...
Article
We address the overlooked unbiasedness in existing long-tailed classification methods: we find that their overall improvement is mostly attributed to the biased preference of "tail" over "head", as the test distribution is assumed to be balanced; however, when the test is as imbalanced as the long-tailed training data---let the test respect Zipf's...
Conference Paper
Full-text available
This paper proposes a novel active boundary loss for semantic segmentation. It can progressively encourage the alignment between predicted boundaries and ground-truth boundaries during end-to-end training, which is not explicitly enforced in commonly used cross-entropy loss. Based on the predicted boundaries detected from the segmentation results u...
Article
This article studies self-supervised graph representation learning, which is critical to various tasks, such as protein property prediction. Existing methods typically aggregate representations of each individual node as graph representations, but fail to comprehensively explore local substructures (i.e., motifs and subgraphs), which also play impo...
Preprint
Full-text available
This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single ima...
Article
Full-text available
Nearest neighbor search aims to obtain the samples in the database with the smallest distances from them to the queries, which is a basic task in a range of fields, including computer vision and data mining. Hashing is one of the most widely used methods for its computational and storage efficiency. With the development of deep learning, deep hashi...
Preprint
Monocular 3D object detection is an essential task in autonomous driving. However, most current methods consider each 3D object in the scene as an independent training sample, while ignoring their inherent geometric relations, thus inevitably resulting in a lack of leveraging spatial constraints. In this paper, we propose a novel method that takes...
Preprint
Full-text available
The application of cross-dataset training in object detection tasks is complicated because the inconsistency in the category range across datasets transforms fully supervised learning into semi-supervised learning. To address this problem, recent studies focus on the generation of high-quality missing annotations. In this study, we first point out...
Preprint
Full-text available
Extracting class activation maps (CAM) is arguably the most standard step of generating pseudo masks for weakly-supervised semantic segmentation (WSSS). Yet, we find that the crux of the unsatisfactory pseudo masks is the binary cross-entropy loss (BCE) widely used in CAM. Specifically, due to the sum-over-class pooling nature of BCE, each pixel in...
Preprint
Recently, unsupervised person re-identification (Re-ID) has received increasing research attention due to its potential for label-free applications. A promising way to address unsupervised Re-ID is clustering-based, which generates pseudo labels by clustering and uses the pseudo labels to train a Re-ID model iteratively. However, most clustering-ba...
Article
Model fine-tuning is a widely used transfer learning approach in person Re-identification (ReID) applications, which fine-tuning a pre-trained feature extraction model into the target scenario instead of training a model from scratch. It is challenging due to the significant variations inside the target scenario, e.g., different camera viewpoint, i...
Article
Webly-supervised fine-grained visual classification (FGVC) has attracted increasing attention in recent years because of the unaffordable cost of obtaining correctly-labeled large-scale fine-grained datasets. However, due to the existence of label noise in web images and the high memorization capacity of deep neural networks, training deep fine-gra...
Article
Temporal action proposal generation aims at localizing the temporal segments containing human actions in a video. This work proposes a centerness-aware network (CAN), which is a novel one-stage approach intended to generate action proposals as keypoint triplets. A keypoint triplet contains two boundary points (starting and ending) and one center po...
Preprint
Full-text available
We address the overlooked unbiasedness in existing long-tailed classification methods: we find that their overall improvement is mostly attributed to the biased preference of tail over head, as the test distribution is assumed to be balanced; however, when the test is as imbalanced as the long-tailed training data -- let the test respect Zipf's law...
Article
We study the task of single person dense pose estimation. Specifically, given a human-centric image, we learn to map all human pixels onto a 3D, surface-based human body model. Existing methods approach this problem by fitting deep convolutional networks on sparse annotated points where the regression on both surface coordinate components for each...
Preprint
Full-text available
Unsupervised Person Re-identification (U-ReID) with pseudo labeling recently reaches a competitive performance compared to fully-supervised ReID methods based on modern clustering algorithms. However, such clustering-based scheme becomes computationally prohibitive for large-scale datasets. How to efficiently leverage endless unlabeled data with li...
Preprint
Finding a suitable density function is essential for density-based clustering algorithms such as DBSCAN and DPC. A naive density corresponding to the indicator function of a unit $d$-dimensional Euclidean ball is commonly used in these algorithms. Such density suffers from capturing local features in complex datasets. To tackle this issue, we propo...
Article
With the rise of deep learning methods, person Re-Identification (ReID) performance has been improved tremendously in many public datasets. However, most public ReID datasets are collected in a short time window in which persons' appearance rarely changes. In real-world applications such as in a shopping mall, the same person may change their weari...
Article
The application of cross-dataset training in object detection tasks is complicated because the inconsistency in the category range across datasets transforms fully supervised learning into semi-supervised learning. To address this problem, recent studies focus on the generation of high-quality missing annotations. In this study, we first specify th...
Article
Weakly supervised object detection (WSOD), which is an effective way to train an object detection model using only image-level annotations, has attracted considerable attention from researchers. However, most of the existing methods, which are based on multiple instance learning (MIL), tend to localize instances to the discriminative parts of salie...

Network

Cited By