
Xian-Sheng Hua- PhD
- Senior Researcher at Microsoft
Xian-Sheng Hua
- PhD
- Senior Researcher at Microsoft
About
515
Publications
57,604
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
20,252
Citations
Introduction
Skills and Expertise
Current institution
Publications
Publications (515)
With the burst of big data, 2D-3D cross-modal retrieval has received increasing attention, which aims to retrieve relevant data from one modality given the query from the other modality. In this paper, we study an underexplored yet practical problem of semi-supervised 2D-3D cross-modal retrieval, which could suffer from serious label scarcity in re...
Achieving a balance between accuracy and efficiency is a critical challenge in facial landmark detection (FLD). This paper introduces Parallel Optimal Position Search (POPoS), a high-precision encoding-decoding framework designed to address the limitations of traditional FLD methods. POPoS employs three key contributions: (1) Pseudo-range multilate...
Pioneering text-to-image (T2I) diffusion models have ushered in a new era of real-world image super-resolution (Real-ISR), significantly enhancing the visual perception of reconstructed images. However, existing methods typically integrate uniform abstract textual semantics across all blocks, overlooking the distinct semantic requirements at differ...
This paper studies the problem of semi-supervised learning on graphs, which has recently aroused widespread interest in relational data mining. The focal point of exploration in this area has been the utilization of graph neural networks (GNNs), which stand out for excellent performance. Previous methods, however, typically rely on the limited labe...
With the emergence of AI generated content, cross-modal retrieval of 2D and 3D data has obtained increasing research attention. In practical applications, massive amounts of 2D and 3D data need expensive annotation, which would make labels scarce. Even worse, complicated heterogeneous relationships between 2D and 3D data make the problem more chall...
Hashing aims to compress raw data into compact binary descriptors, which has drawn increasing interest for efficient large-scale image retrieval. Current deep hashing often employs evaluation protocols where usually query data and training data are from similar distributions. However, more realistic evaluations should take into account a broad spec...
Nan Yin Li Shen Chong Chen- [...]
Xiao Luo
Graph neural networks (GNNs) have achieved great success recently on graph classification tasks using supervised end-to-end training. Unfortunately, extensive noisy graph labels could exist in the real world because of the complicated processes of manual graph data annotations, which may significantly degrade the performance of GNNs. Therefore, we...
Large-scale "pre-train and prompt learning" paradigms have demonstrated remarkable adaptability, enabling broad applications across diverse domains such as question answering, image recognition, and multimodal retrieval. This approach fully leverages the potential of large-scale pre-trained models, reducing downstream data requirements and computat...
Graph classification is a critical task in numerous multimedia applications, where graphs are employed to represent diverse types of multimedia data, including images, videos, and social networks. Nevertheless, in the real world, labeled graph data are always limited or scarce. To address this issue, we focus on the semi-supervised graph classifica...
Deep metric learning (DML) aims to learn a discriminative high-dimensional embedding space for downstream tasks like classification, clustering, and retrieval. Prior literature predominantly focuses on pair-based and proxy-based methods to maximize inter-class discrepancy and minimize intra-class diversity. However, these methods tend to suffer fro...
Hashing has received significant interest in large-scale data retrieval due to its outstanding computational efficiency. Of late, numerous deep hashing approaches have emerged, which have obtained impressive performance. However, these approaches can contain ethical risks during image retrieval. To address this, we are the first to study the proble...
Loss functions and sample mining strategies are essential components in deep metric learning algorithms. However, the existing loss function or mining strategy often necessitate the incorporation of additional hyperparameters, notably the threshold, which defines whether the sample pair is informative. The threshold provides a stable numerical stan...
Learning with noisy labels has gained increasing attention because the inevitable imperfect labels in real-world scenarios can substantially hurt the deep model performance. Recent studies tend to regard low-loss samples as clean ones and discard high-loss ones to alleviate the negative impact of noisy labels. However, real-world datasets contain n...
Domain adaptive hashing has received increasing attention since it is capable of enhancing the performance of retrieval if the target domain for testing meets domain shift. However, owing to data security and transmission constraints nowadays, abundant source data is often not available. Towards this end, this paper investigates a novel yet practic...
Graph neural networks (GNNs) have emerged as powerful tools for graph classification tasks. However, contemporary graph classification methods are predominantly studied in fully supervised scenarios, while there could be label ambiguity and noise in real-world applications. In this work, we explore the weakly supervised problem of partial label lea...
As an important problem in searching system development, domain adaptive retrieval seeks to train a retrieval model with both labeled source samples and unlabeled target samples. Although several domain adaptive hashing algorithms have been proposed to handle the problem with high efficiency, they often presume that source and target domains share...
Nan Yin Li Shen Huan Xiong- [...]
Xiao Luo
This paper delves into the problem of correlated time-series forecasting in practical applications, an area of growing interest in a multitude of fields such as stock price prediction and traffic demand analysis. Current methodologies primarily represent data using conventional graph structures, yet these fail to capture intricate structures with n...
The optical flow guidance strategy is ideal for obtaining motion information of objects in the video. It is widely utilized in video segmentation tasks. However, existing optical flow-based methods have a significant dependency on optical flow, which results in poor performance when the optical flow estimation fails for a particular scene. The temp...
This paper studies deep unsupervised hashing, which has attracted increasing attention in large-scale image retrieval. The majority of recent approaches usually reconstruct semantic similarity information, which then guides the hash code learning. However, they still fail to achieve satisfactory performance in reality due to two reasons. On the one...
We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by using a joint prediction from a conventional 3D model and a well-trained 2D model. Surprisingly, such an ensemble, though seems trivial, has hardly been shown effective in recent 2D-3D models. We find out the crux is the less effective training for the ''join...
Recent years have witnessed the explosive growth of interaction behaviors in multimedia information systems, where multi-behavior recommender systems have received increasing attention by leveraging data from various auxiliary behaviors such as tip and collect. Among various multi-behavior recommendation methods, non-sampling methods have shown sup...
We show that classifiers trained with random region proposals achieve state-of-the-art Open-world Object Detection (OWOD): they can not only maintain the accuracy of the known objects (w/ training labels), but also considerably improve the recall of unknown ones (w/o training labels). Specifically, we propose RandBox, a Fast R-CNN based architectur...
Although graph neural networks (GNNs) have achieved impressive achievements in graph classification, they often need abundant task-specific labels, which could be extensively costly to acquire. A credible solution is to explore additional labeled graphs to enhance unsupervised learning on the target domain. However, how to apply GNNs to domain adap...
Hao Wu Wei Xion Fan Xu- [...]
Haixin Wang
In this paper, we investigate the challenge of spatio-temporal video prediction, which involves generating future videos based on historical data streams. Existing approaches typically utilize external information such as semantic maps to enhance video prediction, which often neglect the inherent physical knowledge embedded within videos. Furthermo...
Existing knowledge distillation works for semantic segmentation mainly focus on transferring high-level contextual knowledge from teacher to student. However, low-level texture knowledge is also of vital importance for characterizing the local structural pattern and global statistical property, such as boundary, smoothness, regularity and color con...
With the increasing scale of urbanization, traffic congestion has caused a severe negative impact on the efficiency of social development. To this end, a series of intelligent traffic light control methods based on reinforcement learning are proposed. They get superior performance compared with conventional control methods under certain conditions....
Wei Ju Xiao Luo Meng Qu- [...]
Ming Zhang
This paper studies semi-supervised graph classification, a crucial task with a wide range of applications in social network analysis and bioinformatics. Recent works typically adopt graph neural networks to learn graph-level representations for classification, failing to explicitly leverage features derived from graph topology (e.g., paths). Moreov...
Video frame interpolation (VFI) aims to improve the temporal resolution of a video sequence. Most of the existing deep learning based VFI methods adopt off-the-shelf optical flow algorithms to estimate the bidirectional flows and interpolate the missing frames accordingly. Though having achieved a great success, these methods require much human exp...
In this paper, we propose an embarrassingly simple yet highly effective adversarial domain adaptation (ADA) method. We view ADA problem primarily from an optimization perspective and point out a fundamental dilemma, in that the real-world data often exhibits an imbalanced distribution where the large data clusters typically dominate and bias the ad...
This paper studies the problem of unsupervised domain adaptive hashing, which is less-explored but emerging for efficient image retrieval, particularly for cross-domain retrieval. This problem is typically tackled by learning hashing networks with pseudo-labeling and domain alignment techniques. Nevertheless, these approaches usually suffer from ov...
Person re-identification (Re-ID) aims to retrieve person images from a large gallery given a query image of a person of interest. Global information and fine-grained local features are both essential for the representation. However, global embedding learned by naive classification model tends to be trapped in the most discriminative local region, l...
Few-shot semantic segmentation is the task of learning to locate each pixel of the novel class in the query image with only a few annotated support images. The current correlation-based methods construct pair-wise feature correlations to establish the many-to-many matching because the typical prototype-based approaches cannot learn fine-grained cor...
This paper studies the problem of semi-supervised learning on graphs, which aims to incorporate ubiquitous unlabeled knowledge (e.g., graph topology, node attributes) with few-available labeled knowledge (e.g., node class) to alleviate the scarcity issue of supervised information on node classification. While promising results are achieved, existin...
Few-shot semantic segmentation is the task of learning to locate each pixel of the novel class in the query image with only a few annotated support images. The current correlation-based methods construct pair-wise feature correlations to establish the many-to-many matching because the typical prototype-based approaches cannot learn fine-grained cor...
Due to the excellent computing efficiency, learning to hash has acquired broad popularity for Big Data retrieval. Although supervised hashing methods have achieved promising performance recently, they presume that all training samples are appropriately annotated. Unfortunately, label noise is ubiquitous owing to erroneous annotations in real-world...
Transformer-based methods have recently achieved great advancement on 2D image-based vision tasks. For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers on video data will bring heavy computation and memory burdens due to the largely increased number of patches and the quadratic complexity of se...
Since Intersection-over-Union (IoU) based optimization maintains the consistency of the final IoU prediction metric and losses, it has been widely used in both regression and classification branches of single-stage 2D object detectors. Recently, several 3D object detection methods adopt IoU-based optimization and directly replace the 2D IoU with 3D...
In contrast to the fully supervised methods using pixel-wise mask labels, box-supervised instance segmentation takes advantage of the simple box annotations, which has recently attracted a lot of research attentions. In this paper, we propose a novel single-shot box-supervised instance segmentation approach, which integrates the classical level set...
Out-Of-Distribution generalization (OOD) is all about learning invariance against environmental changes. If the context (In this paper, the word “context” denotes any class-agnostic attributes such as color, texture and background. The formal definition can be found in Appendix, A.2.) in every class is evenly distributed, OOD would be trivial becau...
Recently, unsupervised person re-identification (Re-ID) has received increasing research attention due to its potential for label-free applications. A promising way to address unsupervised Re-ID is clustering-based, which generates pseudo labels by clustering and uses the pseudo labels to train a Re-ID model iteratively. However, most clustering-ba...
Hashing, which encodes raw data into compact binary codes, has grown in popularity for large-scale image retrieval due to its storage and computation efficiency. Although deep supervised hashing has lately shown promising performance, they mostly assume that the semantic labels of training data are ideally noise-free, which is often unrealistic in...
In this paper, we explore the details in video recognition with the aim to improve the accuracy. It is observed that most failure cases in recent works fall on the mis-classifications among very similar actions (such as high kick vs. side kick) that need a capturing of fine-grained discriminative details. To solve this problem, we propose synopsis-...
Out-Of-Distribution generalization (OOD) is all about learning invariance against environmental changes. If the context in every class is evenly distributed, OOD would be trivial because the context can be easily removed due to an underlying principle: class is invariant to context. However, collecting such a balanced dataset is impractical. Learni...
Conventional de-noising methods rely on the assumption that all samples are independent and identically distributed, so the resultant classifier, though disturbed by noise, can still easily identify the noises as the outliers of training distribution. However, the assumption is unrealistic in large-scale data that is inevitably long-tailed. Such im...
Transformer-based methods have recently achieved great advancement on 2D image-based vision tasks. For 3D video-based tasks such as action recognition, however, directly applying spatiotemporal transformers on video data will bring heavy computation and memory burdens due to the largely increased number of patches and the quadratic complexity of se...
Since Intersection-over-Union (IoU) based optimization maintains the consistency of the final IoU prediction metric and losses, it has been widely used in both regression and classification branches of single-stage 2D object detectors. Recently, several 3D object detection methods adopt IoU-based optimization and directly replace the 2D IoU with 3D...
Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images. An intriguing yet challenging problem arises: Can generative models achieve counterfactual editing against their learnt priors? Due to the lack of counterfactual...
Wei Ju Xiao Luo Meng Qu- [...]
Ming Zhang
This paper studies semi-supervised graph classification, a crucial task with a wide range of applications in social network analysis and bioinformatics. Recent works typically adopt graph neural networks to learn graph-level representations for classification, failing to explicitly leverage features derived from graph topology (e.g., paths). Moreov...
Semi-Supervised Learning (SSL) is fundamentally a missing label problem, in which the label Missing Not At Random (MNAR) problem is more realistic and challenging, compared to the widely-adopted yet naive Missing Completely At Random assumption where both labeled and unlabeled data share the same class distribution. Different from existing SSL solu...
We address the overlooked unbiasedness in existing long-tailed classification methods: we find that their overall improvement is mostly attributed to the biased preference of "tail" over "head", as the test distribution is assumed to be balanced; however, when the test is as imbalanced as the long-tailed training data---let the test respect Zipf's...
This paper proposes a novel active boundary loss for semantic segmentation. It can progressively encourage the alignment between predicted boundaries and ground-truth boundaries during end-to-end training, which is not explicitly enforced in commonly used cross-entropy loss. Based on the predicted boundaries detected from the segmentation results u...
Xiao Luo Wei Ju Meng Qu- [...]
Ming Zhang
This article studies self-supervised graph representation learning, which is critical to various tasks, such as protein property prediction. Existing methods typically aggregate representations of each individual node as graph representations, but fail to comprehensively explore local substructures (i.e., motifs and subgraphs), which also play impo...
This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus on the proposed solutions and results. The task of the challenge was to super-resolve an input image with a magnification factor of $\times$4 based on pairs of low and corresponding high resolution images. The aim was to design a network for single ima...
Nearest neighbor search aims to obtain the samples in the database with the smallest distances from them to the queries, which is a basic task in a range of fields, including computer vision and data mining. Hashing is one of the most widely used methods for its computational and storage efficiency. With the development of deep learning, deep hashi...
Monocular 3D object detection is an essential task in autonomous driving. However, most current methods consider each 3D object in the scene as an independent training sample, while ignoring their inherent geometric relations, thus inevitably resulting in a lack of leveraging spatial constraints. In this paper, we propose a novel method that takes...
The application of cross-dataset training in object detection tasks is complicated because the inconsistency in the category range across datasets transforms fully supervised learning into semi-supervised learning. To address this problem, recent studies focus on the generation of high-quality missing annotations. In this study, we first point out...
Extracting class activation maps (CAM) is arguably the most standard step of generating pseudo masks for weakly-supervised semantic segmentation (WSSS). Yet, we find that the crux of the unsatisfactory pseudo masks is the binary cross-entropy loss (BCE) widely used in CAM. Specifically, due to the sum-over-class pooling nature of BCE, each pixel in...
Recently, unsupervised person re-identification (Re-ID) has received increasing research attention due to its potential for label-free applications. A promising way to address unsupervised Re-ID is clustering-based, which generates pseudo labels by clustering and uses the pseudo labels to train a Re-ID model iteratively. However, most clustering-ba...
Model fine-tuning is a widely used transfer learning approach in person Re-identification (ReID) applications, which fine-tuning a pre-trained feature extraction model into the target scenario instead of training a model from scratch. It is challenging due to the significant variations inside the target scenario, e.g., different camera viewpoint, i...
Webly-supervised fine-grained visual classification (FGVC) has attracted increasing attention in recent years because of the unaffordable cost of obtaining correctly-labeled large-scale fine-grained datasets. However, due to the existence of label noise in web images and the high memorization capacity of deep neural networks, training deep fine-gra...
Temporal action proposal generation aims at localizing the temporal segments containing human actions in a video. This work proposes a centerness-aware network (CAN), which is a novel one-stage approach intended to generate action proposals as keypoint triplets. A keypoint triplet contains two boundary points (starting and ending) and one center po...
We address the overlooked unbiasedness in existing long-tailed classification methods: we find that their overall improvement is mostly attributed to the biased preference of tail over head, as the test distribution is assumed to be balanced; however, when the test is as imbalanced as the long-tailed training data -- let the test respect Zipf's law...
We study the task of single person dense pose estimation. Specifically, given a human-centric image, we learn to map all human pixels onto a 3D, surface-based human body model. Existing methods approach this problem by fitting deep convolutional networks on sparse annotated points where the regression on both surface coordinate components for each...
Unsupervised Person Re-identification (U-ReID) with pseudo labeling recently reaches a competitive performance compared to fully-supervised ReID methods based on modern clustering algorithms. However, such clustering-based scheme becomes computationally prohibitive for large-scale datasets. How to efficiently leverage endless unlabeled data with li...
Finding a suitable density function is essential for density-based clustering algorithms such as DBSCAN and DPC. A naive density corresponding to the indicator function of a unit $d$-dimensional Euclidean ball is commonly used in these algorithms. Such density suffers from capturing local features in complex datasets. To tackle this issue, we propo...
With the rise of deep learning methods, person Re-Identification (ReID) performance has been improved tremendously in many public datasets. However, most public ReID datasets are collected in a short time window in which persons' appearance rarely changes. In real-world applications such as in a shopping mall, the same person may change their weari...
The application of cross-dataset training in object detection tasks is complicated because the inconsistency in the category range across datasets transforms fully supervised learning into semi-supervised learning. To address this problem, recent studies focus on the generation of high-quality missing annotations. In this study, we first specify th...
Weakly supervised object detection (WSOD), which is an effective way to train an object detection model using only image-level annotations, has attracted considerable attention from researchers. However, most of the existing methods, which are based on multiple instance learning (MIL), tend to localize instances to the discriminative parts of salie...