Xuelong Li’s research while affiliated with Machine Intelligence Research Institute and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (803)


Fig. 1. Difference levels of IQA indicator research. Level 1 and level 2 focus primarily on indicator-level evaluations, while level 3 (ours) extend it to a system-level perspective.
Fig. 3. Examples of color denoised images from proposed dataset.
Fig. 4. Overall architecture and construction process of our SEM-based indicator system
Fig. 5. IQA indicator system for gray-scale image denoising.
Fig. 7. Scatter plots of subjective scores against predicted scores of our indicator system on TID2008, TID2013, KADID, and PIPAL datasets.

+7

Structural-Equation-Modeling-Based Indicator Systems for Image Quality Assessment
  • Article
  • Full-text available

May 2025

·

34 Reads

IEEE Transactions on Pattern Analysis and Machine Intelligence

·

Junxi Lin

·

Xuelong Li

Recent advancements in image denoising algorithms have significantly improved visual performance. However, they also introduce new challenges for image quality assessment (IQA) indicators to provide evaluations that align with human visual perception. To address the limitations of current single-indicator methods, we propose a comprehensive IQA framework that integrates multiple indicators to achieve a holistic assessment of image quality. We first develop a large-scale denoised image dataset to show the diversity of distortions. Then, we employ structural equation modeling to establish correlations among three fundamental aspects of image quality, i.e., structural similarity, information loss, and perceptual gain. Through a series of regressions and iterative refinements, we eliminate indicators with low accuracy and high redundancy, thus resulting in a robust and optimal indicator system. Finally, we systematically validate the reliability and effectiveness of the proposed system through statistical analysis and evaluate its performance across three key tasks, i.e., image quality prediction, IQA indicator comparison, and denoising algorithm optimization. Experimental results demonstrate that the proposed system not only offers highly reliable and valid assessments but also provides valuable insights for the analysis and application of IQA indicators.

Download

On the Value of Myopic Behavior in Policy Reuse

April 2025

IEEE Transactions on Pattern Analysis and Machine Intelligence

Chenjia Bai

·

Kang Xu

·

Shuang Qiu

·

[...]

·

Xuelong Li

Leveraging learned strategies in unfamiliar scenarios is fundamental to human intelligence. In reinforcement learning, rationally reusing the policies acquired from other tasks or human experts is critical for tackling problems that are difficult to learn from scratch. In this work, we present a framework called Selective Myopic bEhavior Control (SMEC), which results from the insight that the short-term behaviors of prior policies are sharable across tasks. By evaluating the behaviors of prior policies via a hybrid value function architecture, SMEC adaptively aggregates the sharable short-term behaviors of prior policies and the long-term behaviors of the task policy, leading to coordinated decisions. Empirical results on a collection of manipulation and locomotion tasks demonstrate that SMEC outperforms existing methods, and validate the ability of SMEC to leverage related prior policies.


Adversarial Locomotion and Motion Imitation for Humanoid Policy Learning

April 2025

·

15 Reads

Humans exhibit diverse and expressive whole-body movements. However, attaining human-like whole-body coordination in humanoid robots remains challenging, as conventional approaches that mimic whole-body motions often neglect the distinct roles of upper and lower body. This oversight leads to computationally intensive policy learning and frequently causes robot instability and falls during real-world execution. To address these issues, we propose Adversarial Locomotion and Motion Imitation (ALMI), a novel framework that enables adversarial policy learning between upper and lower body. Specifically, the lower body aims to provide robust locomotion capabilities to follow velocity commands while the upper body tracks various motions. Conversely, the upper-body policy ensures effective motion tracking when the robot executes velocity-based movements. Through iterative updates, these policies achieve coordinated whole-body control, which can be extended to loco-manipulation tasks with teleoperation systems. Extensive experiments demonstrate that our method achieves robust locomotion and precise motion tracking in both simulation and on the full-size Unitree H1 robot. Additionally, we release a large-scale whole-body motion control dataset featuring high-quality episodic trajectories from MuJoCo simulations deployable on real robots. The project page is https://almi-humanoid.github.io.


Figure 3. Qualitative comparisons of depth predictions on the indoor dataset NYU. We show both depth maps and corresponding error maps. When dealing with large-scale and long-distance indoor scenes, our framework achieves better absolute depth recovery.
Metric-Solver: Sliding Anchored Metric Depth Estimation from a Single Image

April 2025

·

1 Read

Accurate and generalizable metric depth estimation is crucial for various computer vision applications but remains challenging due to the diverse depth scales encountered in indoor and outdoor environments. In this paper, we introduce Metric-Solver, a novel sliding anchor-based metric depth estimation method that dynamically adapts to varying scene scales. Our approach leverages an anchor-based representation, where a reference depth serves as an anchor to separate and normalize the scene depth into two components: scaled near-field depth and tapered far-field depth. The anchor acts as a normalization factor, enabling the near-field depth to be normalized within a consistent range while mapping far-field depth smoothly toward zero. Through this approach, any depth from zero to infinity in the scene can be represented within a unified representation, effectively eliminating the need to manually account for scene scale variations. More importantly, for the same scene, the anchor can slide along the depth axis, dynamically adjusting to different depth scales. A smaller anchor provides higher resolution in the near-field, improving depth precision for closer objects while a larger anchor improves depth estimation in far regions. This adaptability enables the model to handle depth predictions at varying distances and ensure strong generalization across datasets. Our design enables a unified and adaptive depth representation across diverse environments. Extensive experiments demonstrate that Metric-Solver outperforms existing methods in both accuracy and cross-dataset generalization.


Figure 2. Method overview. (a) Given a video with four paired modalities, we first encode it into latents using a shared 3D-VAE encoder; (b) Then, concatenate them along the channel dimension and apply noise for video diffusion, where the denoised latents are then decoded into their respective modalities via modality-specific decoding heads; (c) Finally, each modality can be reconstructed into color space by the 3D-VAE decoder. During inference, the model enables various tasks by dynamically adjusting the role of each modality: (d) Text-to-video generation, where all modalities are denoised from pure noise and (e) X-conditioned generation, where the condition X is given and other modalities are denoised from pure noise. If X is RGB modality, the model will perform generative understanding.
Figure 3. Qualitative comparison of text-to-video generation. Our method generates more coherent and temporally consistent video sequences. Additionally, our method also produces multiple aligned visual modalities, which are not displayed here due to space constraints.
Figure 6. Visualization of multi-modal video understanding. Given a reference video (a), OmniVDiff can estimate aligned multiple visual understanding predictions in one diffusion process (b,c,d).
Figure 7. Video2Video Style Control. Given a reference video (a), OmniVDiff first estimates the corresponding depth (b) and uses it as a bridge to control the scene structure, enabling the generation of videos with diverse scene styles through text-based control (c,d,e).
OmniVDiff: Omni Controllable Video Diffusion for Generation and Understanding

April 2025

·

4 Reads

In this paper, we propose a novel framework for controllable video diffusion, OmniVDiff, aiming to synthesize and comprehend multiple video visual content in a single diffusion model. To achieve this, OmniVDiff treats all video visual modalities in the color space to learn a joint distribution, while employing an adaptive control strategy that dynamically adjusts the role of each visual modality during the diffusion process, either as a generation modality or a conditioning modality. This allows flexible manipulation of each modality's role, enabling support for a wide range of tasks. Consequently, our model supports three key functionalities: (1) Text-conditioned video generation: multi-modal visual video sequences (i.e., rgb, depth, canny, segmentaion) are generated based on the text conditions in one diffusion process; (2) Video understanding: OmniVDiff can estimate the depth, canny map, and semantic segmentation across the input rgb frames while ensuring coherence with the rgb input; and (3) X-conditioned video generation: OmniVDiff generates videos conditioned on fine-grained attributes (e.g., depth maps or segmentation maps). By integrating these diverse tasks into a unified video diffusion framework, OmniVDiff enhances the flexibility and scalability for controllable video diffusion, making it an effective tool for a variety of downstream applications, such as video-to-video translation. Extensive experiments demonstrate the effectiveness of our approach, highlighting its potential for various video-related applications.


Why Does Dropping Edges Usually Outperform Adding Edges in Graph Contrastive Learning?

April 2025

·

1 Read

Proceedings of the AAAI Conference on Artificial Intelligence

Graph contrastive learning (GCL) has been widely used as an effective self-supervised learning method for graph representation learning. However, how to apply adequate and stable graph augmentation to generating proper views for contrastive learning remains an essential problem. Dropping edges is a primary augmentation in GCL while adding edges is not a common method due to its unstable performance. To our best knowledge, there is no theoretical analysis to study why dropping edges usually outperforms adding edges. To answer this question, we introduce a new metric, namely Error Passing Rate (EPR), to quantify how a graph fits the network. Inspired by the theoretical conclusions and the idea of positive-incentive noise, we propose a novel GCL algorithm, Error-PAssing-based Graph Contrastive Learning (EPAGCL), which uses both edge adding and edge dropping as its augmentations. To be specific, we generate views by adding and dropping edges based on the weights derived from EPR. Extensive experiments on various real-world datasets are conducted to validate the correctness of our theoretical analysis and the effectiveness of our proposed algorithm.


Towards Learnable Anchor for Deep Multi-View Clustering

April 2025

·

1 Read

Proceedings of the AAAI Conference on Artificial Intelligence

Deep multi-view clustering incorporating graph learning has presented tremendous potential. Most methods encounter costly square time consumption w.r.t. data size. Theoretically, anchor-based graph learning can alleviate this limitation, but related deep models mainly rely on manual discretization approaches to select anchors, which indicates that 1) the anchors are fixed during model training and 2) they may deviate from the true cluster distribution. Consequently, the unreliable anchors may corrupt clustering results. In this paper, we propose the Deep Multi-view Anchor Clustering (DMAC) model that performs clustering in linear time. Concretely, the initial anchors are intervened by the positive-incentive noise sampled from Gaussian distribution, such that they can be optimized with a newly designed anchor learning loss, which promotes a clear relationship between samples and anchors. Afterwards, anchor graph convolution is devised to model the cluster structure formed by the anchors, and the mutual information maximization loss is built to provide cross-view clustering guidance. In this way, the learned anchors can better represent clusters. With the optimal anchors, the full sample graph is calculated to derive a discriminative embedding for clustering. Extensive experiments on several datasets demonstrate the superior performance and efficiency of DMAC compared to state-of-the-art competitors.


Enhance Vision-Language Alignment with Noise

April 2025

·

1 Citation

Proceedings of the AAAI Conference on Artificial Intelligence

With the advancement of pre-trained vision-language (VL) models, enhancing the alignment between visual and linguistic modalities in downstream tasks has emerged as a critical challenge. Different from existing fine-tuning methods that add extra modules to these two modalities, we investigate whether the frozen model can be fine-tuned by customized noise. Our approach is motivated by the scientific study of beneficial noise, namely Positive-incentive Noise (Pi-noise) , which quantitatively analyzes the impact of noise. It therefore implies a new scheme to learn beneficial noise distribution that can be employed to fine-tune VL models. Focusing on few-shot classification tasks based on CLIP, we reformulate the inference process of CLIP and apply variational inference, demonstrating how to generate Pi-noise towards visual and linguistic modalities. Then, we propose Positive-incentive Noise Injector (PiNI), which can fine-tune CLIP via injecting noise into both visual and text encoders. Since the proposed method can learn the distribution of beneficial noise, we can obtain more diverse embeddings of vision and language to better align these two modalities for specific downstream tasks within limited computational resources. We evaluate different noise incorporation approaches and network architectures of PiNI. The evaluation across 11 datasets demonstrates its effectiveness.


AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

April 2025

·

1 Read

Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.


Bridging the Gap between Continuous and Informative Discrete Representations by Random Product Quantization

April 2025

·

2 Reads

Self-supervised learning has become a core technique in speech processing, but the high dimensionality of its representations makes discretization essential for improving efficiency. However, existing discretization methods still suffer from significant information loss, resulting in a notable performance gap compared to continuous representations. To overcome these limitations, we propose two quantization-based discretization methods: Product Quantization (PQ) and Random Product Quantization (RPQ). PQ partitions the original feature space into multiple subspaces and independently quantizes each sub-vector, producing a fused set of discrete units that retain diverse information from different subspaces, thus mitigating the loss associated with single-cluster quantization. RPQ further enhances representation diversity by randomly sampling a fixed proportion of feature dimensions multiple times to construct sub-vectors, thereby better capturing the variability in the data distribution. Theoretical analysis shows that RPQ reduces the correlation coefficient rho (where 0 <= rho <= 1) between sub-quantizers. Its quantization error is lower-bounded by the product of rho and epsilon-kms, where epsilon-kms denotes the quantization error of a single K-means quantizer. Experimental results on a combined dataset built from LibriSpeech and ML-SUPERB show that PQ and RPQ outperform standard K-means discretization, achieving relative improvements of 21.8 percent and 20.0 percent in WER on LibriSpeech, and 24.1 percent and 19.6 percent in CER on ML-SUPERB, respectively. Moreover, their performance is competitive with, and in some cases even surpasses, that of continuous SSL representations.


Citations (24)


... Advancement in diffusion models [35,38,46] has significantly propelled recent progress in text-conditional image and video generation [8,10,26,34]. Current research is dedicated to improving the performance of these models in multiple ways, such as exploring high-quality large-scale text image datasets [41,42,58], upgrading the base model [38], and improving the controllability of the model [20,39,49,57]. Lately, the Stable Diffusion XL (SDXL) [35] has been widely pursued due to its relatively low computational cost and impressive capability of portrait generation. ...

Reference:

Adv-CPG: A Customized Portrait Generation Framework with Facial Adversarial Attacks
Enhance Vision-Language Alignment with Noise
  • Citing Article
  • April 2025

Proceedings of the AAAI Conference on Artificial Intelligence

... to ensure temporal alignment of actions, while the Diffusion Policy [5] addresses multi-modal action distributions to enhance robust manipulation. Driven by the development of foundation models [13,28], recent works [4,7,14,22,36,37] leverage the world knowledge of MLLMs to facilitate task decomposition and planning for robotic manipulation. Concurrently, other works [3,11,21,27] collect large-scale robotic manipulation demonstrations to train generalizable language-conditioned manipulation policy across different robots and tasks. ...

Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection
  • Citing Conference Paper
  • October 2024

... Similarly, methods like USER [18] expand negative sample pairs in mini-batches, improving the model's ability to distinguish between positive and negative samples. Hybrid architectures, including CLIP-Adapter [19], ModeX [20], and LLaVA [21], enhance fine-grained interactions through lightweight adapters or visual encoders, further advancing state-of-the-art results. However, current multimodal alignment approaches often rely on large amounts of data and simple strategies to achieve excellent performance on natural images, often neglecting their adaptability to downstream tasks, particularly in challenging domains such as medical imaging and remote sensing. ...

Modality-experts coordinated adaptation for large multimodal models
  • Citing Article
  • December 2024

Science China Information Sciences

... Adaptive Regulation (RQ4): With the continuous advancement of deep learning (Chen et al. 2024a;Yan et al. 2024b;Chen et al. 2024b;Ma et al. 2024;Huang et al. 2024;Yan et al. 2024a;Shu et al. 2024;Wu et al. 2024), the data-hungry paradigm of fully supervised learning has increasingly revealed certain limitations. Unlike the extensively studied fully-supervised setting, semi-supervised learning typically operates with a label rate ranging from 1% to 10%, making it particularly suitable for tasks like finegrained action recognition that require high-quality data. ...

CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning
  • Citing Conference Paper
  • October 2024

... Recently, PPT [63] demonstrates the effectiveness of finetuning position encodings, and PointGST [32] proposes an adapter to extract spectral-domain features for efficient 3D PEFT. However, these adapter-based and prompt tuning methods [12,13,[47][48][49] introduce inference overhead and are specifically designed for Transformers [53], limiting adaptability to other architectures like Mamba [16] and U-Net [42]. In contrast, our MoST overcomes these limitations from the perspective of reparameterization, avoiding inference overhead while maintaining high generalizability. ...

Any2Point: Empowering Any-Modality Large Models for Efficient 3D Understanding
  • Citing Chapter
  • October 2024

... 3D stereoscopic vision advances precise object detection, gaining significant traction in computer vision through improved pixel density. The TS3D model, utilizing a Transformer-based architecture, significantly betters stereo feature depiction by integrating image correspondence data [19]. Furthermore, the SRDL framework pioneers by merging semantic and spatial information from stereo imagery and point clouds, thus propelling 3D object detection [20]. ...

Transformer-Based Stereo-Aware 3D Object Detection From Binocular Images

IEEE Transactions on Intelligent Transportation Systems

... In addition, the chameleon swarm algorithm was improved to increase the convergence of the Otsu's method in [31]. To improve the accuracy of the fuzzy C-means method, the gradient descent was used to solve an unconstrained fuzzy C-means algorithm and accordingly a novel deep fuzzy clustering model was obtained for clustering [39]. The second category is the Gaussian-mixture-model-based method that uses the EM methods [57][58][59][60][61][62][63][64] to calculate the mixture model's parameters. ...

Unsupervised Deep Embedding for Fuzzy Clustering
  • Citing Article
  • December 2024

IEEE Transactions on Fuzzy Systems

... III. MAIN STAGES OF SPECTRAL CLUSTERING Spectral clustering has emerged as a sophisticated graphcut methodology in machine learning, distinguished by its ability to capture intrinsic manifold structures through the eigenspectrum analysis of graph Laplacian matrices [1]. The mathematical foundation of spectral clustering integrates graph theory and manifold learning principles, enabling effective handling of complex, nonlinear data distributions while maintaining theoretical guarantees and computational efficiency in the clustering process. ...

A comprehensive survey of fast graph clustering

Vicinagearth

... Further, some researches use the two-step strategy to get final label matrices (Wang et al. 2021;Kang et al. 2020), whose solutions are suboptimal, as they are far away from the solutions obtained by directly solving the original problem. Therefore, (1) the rank constraint are introduce to ensure c connected components to avoid the post-process (Xia et al. 2023;Li et al. 2020;Fang et al. 2023;Xue et al. 2024); (2) some novel optimization strategies (Qiang et al. 2023;Nie et al. 2022) or Nonnegative Matrix Factorization (NMF) models (Yang et al. 2020) are proposed, which can obtain the final solution directly. The former has high requirements on parameters, and in some cases it may not be possible to obtain bipartite graphs with clear connected components. ...

Co-Clustering by Directly Solving Bipartite Spectral Graph Partitioning
  • Citing Article
  • September 2024

IEEE Transactions on Cybernetics

... Baselines. For the LIBERO-LONG benchmark, we implement the multi-task policy MTACT (Zhao et al., 2023), the general image-based pre-trained policy MVP (Xiao et al., 2022), the interaction-oriented representation learning method MPI (Zeng et al., 2024), large-scale pretrained vision-language-action policy OpenVLA (Kim et al., 2024), an image-editing based subgoal planner SuSIE (Black et al., 2024), and the end-to-end predictive inverse dynamics model Seer (Tian et al., 2025). For real-world experiments, we compare LBP with SuSIE, one of the most competitive methods against LBP in the LIBERO-LONG benchmark. ...

Learning Manipulation by Predicting Interaction
  • Citing Conference Paper
  • July 2024