Xiaohao Xu’s research while affiliated with Concordia University Ann Arbor and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (42)


Figure 5: Real-world experimental results. Left: two single-stage tasks; Right: a two-stage long-horizon task. Quantitative Results and Analysis. The real-world experimental results are presented in Fig. 5. For the single-stage tasks MoveContainer and PickEggplant, ViSA-Flow significantly outperforms the GR-MG model across 12 trials. Meanwhile, DP achieves a comparable success rate of 75.0% on the PickEggplant task. In contrast, for the long-horizon task-which sequentially combines MoveContainer and PickEggplant-our method demonstrates superior performance, achieving 9/12 successful trials for each subtask and yielding an overall success rate of 56.3% for the full sequence. By comparison, GR-MG and DP attain success rates of only 8.3% and 13.8%, respectively. Notably, DP experiences a significant performance drop when transitioning from single-stage to long-horizon tasks, whereas ViSA-Flow maintains robust and consistent performance.
Comparative evaluation on CALVIN ABC→ D benchmark. Performance metrics include success rates for completing 1-5 consecutive tasks and average sequence length (Avg. Len). Meth- ods in the top section use 100% of training data, while methods in the bottom section use only 10%. The robot executed 1,000 test sequences with five tasks each. Bold indicates best performance.
Ablation study evaluating the contribution of key components in ViSA-Flow.
ViSA-Flow: Accelerating Robot Skill Learning via Large-Scale Video Semantic Action Flow
  • Preprint
  • File available

May 2025

·

4 Reads

Changhe Chen

·

·

Xiaohao Xu

·

[...]

·

Olov Andersson

One of the central challenges preventing robots from acquiring complex manipulation skills is the prohibitive cost of collecting large-scale robot demonstrations. In contrast, humans are able to learn efficiently by watching others interact with their environment. To bridge this gap, we introduce semantic action flow as a core intermediate representation capturing the essential spatio-temporal manipulator-object interactions, invariant to superficial visual differences. We present ViSA-Flow, a framework that learns this representation self-supervised from unlabeled large-scale video data. First, a generative model is pre-trained on semantic action flows automatically extracted from large-scale human-object interaction video data, learning a robust prior over manipulation structure. Second, this prior is efficiently adapted to a target robot by fine-tuning on a small set of robot demonstrations processed through the same semantic abstraction pipeline. We demonstrate through extensive experiments on the CALVIN benchmark and real-world tasks that ViSA-Flow achieves state-of-the-art performance, particularly in low-data regimes, outperforming prior methods by effectively transferring knowledge from human video observation to robotic execution. Videos are available at https://visaflow-web.github.io/ViSAFLOW.

Download

Robust Latent Matters: Boosting Image Generation with Sampling Error

March 2025

·

1 Read

Recent image generation schemes typically capture image distribution in a pre-constructed latent space relying on a frozen image tokenizer. Though the performance of tokenizer plays an essential role to the successful generation, its current evaluation metrics (e.g. rFID) fail to precisely assess the tokenizer and correlate its performance to the generation quality (e.g. gFID). In this paper, we comprehensively analyze the reason for the discrepancy of reconstruction and generation qualities in a discrete latent space, and, from which, we propose a novel plug-and-play tokenizer training scheme to facilitate latent space construction. Specifically, a latent perturbation approach is proposed to simulate sampling noises, i.e., the unexpected tokens sampled, from the generative process. With the latent perturbation, we further propose (1) a novel tokenizer evaluation metric, i.e., pFID, which successfully correlates the tokenizer performance to generation quality and (2) a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer thus boosting the generation quality and convergence speed. Extensive benchmarking are conducted with 11 advanced discrete image tokenizers with 2 autoregressive generation models to validate our approach. The tokenizer trained with our proposed latent perturbation achieve a notable 1.60 gFID with classifier-free guidance (CFG) and 3.45 gFID without CFG with a \sim400M generator. Code: https://github.com/lxa9867/ImageFolder.


Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity

March 2025

·

3 Reads

Depth ambiguity is a fundamental challenge in spatial scene understanding, especially in transparent scenes where single-depth estimates fail to capture full 3D structure. Existing models, limited to deterministic predictions, overlook real-world multi-layer depth. To address this, we introduce a paradigm shift from single-prediction to multi-hypothesis spatial foundation models. We first present \texttt{MD-3k}, a benchmark exposing depth biases in expert and foundational models through multi-layer spatial relationship labels and new metrics. To resolve depth ambiguity, we propose Laplacian Visual Prompting (LVP), a training-free spectral prompting technique that extracts hidden depth from pre-trained models via Laplacian-transformed RGB inputs. By integrating LVP-inferred depth with standard RGB-based estimates, our approach elicits multi-layer depth without model retraining. Extensive experiments validate the effectiveness of LVP in zero-shot multi-layer depth estimation, unlocking more robust and comprehensive geometry-conditioned visual generation, 3D-grounded spatial reasoning, and temporally consistent video-level depth inference. Our benchmark and code will be available at https://github.com/Xiaohao-Xu/Ambiguity-in-Space.


Towards Visual Discrimination and Reasoning of Real-World Physical Dynamics: Physics-Grounded Anomaly Detection

March 2025

·

8 Reads

Humans detect real-world object anomalies by perceiving, interacting, and reasoning based on object-conditioned physical knowledge. The long-term goal of Industrial Anomaly Detection (IAD) is to enable machines to autonomously replicate this skill. However, current IAD algorithms are largely developed and tested on static, semantically simple datasets, which diverge from real-world scenarios where physical understanding and reasoning are essential.To bridge this gap, we introduce the Physics Anomaly Detection (Phys-AD) dataset, the first large-scale, real-world, physics-grounded video dataset for industrial anomaly detection. Collected using a real robot arm and motor, Phys-AD provides a diverse set of dynamic, semantically rich scenarios. The dataset includes more than 6400 videos across 22 real-world object categories, interacting with robot arms and motors, and exhibits 47 types of anomalies. Anomaly detection in Phys-AD requires visual reasoning, combining both physical knowledge and video content to determine object abnormality.We benchmark state-of-the-art anomaly detection methods under three settings: unsupervised AD, weakly-supervised AD, and video-understanding AD, highlighting their limitations in handling physics-grounded anomalies. Additionally, we introduce the Physics Anomaly Explanation (PAEval) metric, designed to assess the ability of visual-language foundation models to not only detect anomalies but also provide accurate explanations for their underlying physical causes. Our dataset and benchmark will be publicly available.


Fig. 1: Conceptual comparison of creature/robot evolution paradigms. (a) Evolution-driven emergence of creatures in nature through selective pressures; (b) Traditional humanengineered robot design guided by intuition and expertise; (c) AI-driven robot design, where large language models (LLMs) act as a 'natural selector' in the evolution of soft robotics, e.g., modular robot shown in the figure. This shift highlights the transition from biological evolution to humandriven engineering and finally to AI-mediated selection.
Fig. 6: Error distribution across difficulty levels for LLMs.
Large Language Models as Natural Selector for Embodied Soft Robot Design

March 2025

·

11 Reads

Designing soft robots is a complex and iterative process that demands cross-disciplinary expertise in materials science, mechanics, and control, often relying on intuition and extensive experimentation. While Large Language Models (LLMs) have demonstrated impressive reasoning abilities, their capacity to learn and apply embodied design principles--crucial for creating functional robotic systems--remains largely unexplored. This paper introduces RoboCrafter-QA, a novel benchmark to evaluate whether LLMs can learn representations of soft robot designs that effectively bridge the gap between high-level task descriptions and low-level morphological and material choices. RoboCrafter-QA leverages the EvoGym simulator to generate a diverse set of soft robot design challenges, spanning robotic locomotion, manipulation, and balancing tasks. Our experiments with state-of-the-art multi-modal LLMs reveal that while these models exhibit promising capabilities in learning design representations, they struggle with fine-grained distinctions between designs with subtle performance differences. We further demonstrate the practical utility of LLMs for robot design initialization. Our code and benchmark will be available to encourage the community to foster this exciting research direction.


Personalizing Vision-Language Models With Hybrid Prompts for Zero-Shot Anomaly Detection

February 2025

·

13 Reads

·

3 Citations

IEEE Transactions on Cybernetics

Zero-shot anomaly detection (ZSAD) aims to develop a foundational model capable of detecting anomalies across arbitrary categories without relying on reference images. However, since “abnormality” is inherently defined in relation to “normality” within specific categories, detecting anomalies without reference images describing the corresponding normal context remains a significant challenge. As an alternative to reference images, this study explores the use of widely available product standards to characterize normal contexts and potential abnormal states. Specifically, this study introduces AnomalyVLM, which leverages generalized pretrained vision-language models (VLMs) to interpret these standards and detect anomalies. Given the current limitations of VLMs in comprehending complex textual information, AnomalyVLM generates hybrid prompts—comprising prompts for abnormal regions, symbolic rules, and region numbers—from the standards to facilitate more effective understanding. These hybrid prompts are incorporated into various stages of the anomaly detection process within the selected VLMs, including an anomaly region generator and an anomaly region refiner. By utilizing hybrid prompts, VLMs are personalized as anomaly detectors for specific categories, offering users flexibility and control in detecting anomalies across novel categories without the need for training data. Experimental results on four public industrial anomaly detection datasets, as well as a practical automotive part inspection task, highlight the superior performance and enhanced generalization capability of AnomalyVLM, especially in texture categories. An online demo of AnomalyVLM is available at https://github.com/caoyunkang/Segment-Any-Anomaly.


Scalable Benchmarking and Robust Learning for Noise-Free Ego-Motion and 3D Reconstruction from Noisy Video

January 2025

·

19 Reads

We aim to redefine robust ego-motion estimation and photorealistic 3D reconstruction by addressing a critical limitation: the reliance on noise-free data in existing models. While such sanitized conditions simplify evaluation, they fail to capture the unpredictable, noisy complexities of real-world environments. Dynamic motion, sensor imperfections, and synchronization perturbations lead to sharp performance declines when these models are deployed in practice, revealing an urgent need for frameworks that embrace and excel under real-world noise. To bridge this gap, we tackle three core challenges: scalable data generation, comprehensive benchmarking, and model robustness enhancement. First, we introduce a scalable noisy data synthesis pipeline that generates diverse datasets simulating complex motion, sensor imperfections, and synchronization errors. Second, we leverage this pipeline to create Robust-Ego3D, a benchmark rigorously designed to expose noise-induced performance degradation, highlighting the limitations of current learning-based methods in ego-motion accuracy and 3D reconstruction quality. Third, we propose Correspondence-guided Gaussian Splatting (CorrGS), a novel test-time adaptation method that progressively refines an internal clean 3D representation by aligning noisy observations with rendered RGB-D frames from clean 3D map, enhancing geometric alignment and appearance restoration through visual correspondence. Extensive experiments on synthetic and real-world data demonstrate that CorrGS consistently outperforms prior state-of-the-art methods, particularly in scenarios involving rapid motion and dynamic illumination.


Multi-Sensor Object Anomaly Detection: Unifying Appearance, Geometry, and Internal Properties

December 2024

·

9 Reads

Object anomaly detection is essential for industrial quality inspection, yet traditional single-sensor methods face critical limitations. They fail to capture the wide range of anomaly types, as single sensors are often constrained to either external appearance, geometric structure, or internal properties. To overcome these challenges, we introduce MulSen-AD, the first high-resolution, multi-sensor anomaly detection dataset tailored for industrial applications. MulSen-AD unifies data from RGB cameras, laser scanners, and lock-in infrared thermography, effectively capturing external appearance, geometric deformations, and internal defects. The dataset spans 15 industrial products with diverse, real-world anomalies. We also present MulSen-AD Bench, a benchmark designed to evaluate multi-sensor methods, and propose MulSen-TripleAD, a decision-level fusion algorithm that integrates these three modalities for robust, unsupervised object anomaly detection. Our experiments demonstrate that multi-sensor fusion substantially outperforms single-sensor approaches, achieving 96.1% AUROC in object-level detection accuracy. These results highlight the importance of integrating multi-sensor data for comprehensive industrial anomaly detection.


MAC-Ego3D: Multi-Agent Gaussian Consensus for Real-Time Collaborative Ego-Motion and Photorealistic 3D Reconstruction

December 2024

·

9 Reads

Real-time multi-agent collaboration for ego-motion estimation and high-fidelity 3D reconstruction is vital for scalable spatial intelligence. However, traditional methods produce sparse, low-detail maps, while recent dense mapping approaches struggle with high latency. To overcome these challenges, we present MAC-Ego3D, a novel framework for real-time collaborative photorealistic 3D reconstruction via Multi-Agent Gaussian Consensus. MAC-Ego3D enables agents to independently construct, align, and iteratively refine local maps using a unified Gaussian splat representation. Through Intra-Agent Gaussian Consensus, it enforces spatial coherence among neighboring Gaussian splats within an agent. For global alignment, parallelized Inter-Agent Gaussian Consensus, which asynchronously aligns and optimizes local maps by regularizing multi-agent Gaussian splats, seamlessly integrates them into a high-fidelity 3D model. Leveraging Gaussian primitives, MAC-Ego3D supports efficient RGB-D rendering, enabling rapid inter-agent Gaussian association and alignment. MAC-Ego3D bridges local precision and global coherence, delivering higher efficiency, largely reducing localization error, and improving mapping fidelity. It establishes a new SOTA on synthetic and real-world benchmarks, achieving a 15x increase in inference speed, order-of-magnitude reductions in ego-motion estimation error for partial cases, and RGB PSNR gains of 4 to 10 dB. Our code will be made publicly available at https://github.com/Xiaohao-Xu/MAC-Ego3D .


Holmes-VAU: Towards Long-term Video Anomaly Understanding at Any Granularity

December 2024

·

7 Reads

How can we enable models to comprehend video anomalies occurring over varying temporal scales and contexts? Traditional Video Anomaly Understanding (VAU) methods focus on frame-level anomaly prediction, often missing the interpretability of complex and diverse real-world anomalies. Recent multimodal approaches leverage visual and textual data but lack hierarchical annotations that capture both short-term and long-term anomalies. To address this challenge, we introduce HIVAU-70k, a large-scale benchmark for hierarchical video anomaly understanding across any granularity. We develop a semi-automated annotation engine that efficiently scales high-quality annotations by combining manual video segmentation with recursive free-text annotation using large language models (LLMs). This results in over 70,000 multi-granular annotations organized at clip-level, event-level, and video-level segments. For efficient anomaly detection in long videos, we propose the Anomaly-focused Temporal Sampler (ATS). ATS integrates an anomaly scorer with a density-aware sampler to adaptively select frames based on anomaly scores, ensuring that the multimodal LLM concentrates on anomaly-rich regions, which significantly enhances both efficiency and accuracy. Extensive experiments demonstrate that our hierarchical instruction data markedly improves anomaly comprehension. The integrated ATS and visual-language model outperform traditional methods in processing long videos. Our benchmark and model are publicly available at https://github.com/pipixin321/HolmesVAU.


Citations (15)


... The investigation of unsupervised AD methods begins in the image field [13], [14], which can be divided into three categories, including flow-based methods [15], [16], knowledge-distillation-based methods [17], [18] and memorybank-based methods [19], [20], [21]. Following image AD, point cloud AD methods also develop knowledge-distillationbased methods and memory-bank-based methods. ...

Reference:

Boosting Global-Local Feature Matching via Anomaly Synthesis for Multi-Class Point Cloud Anomaly Detection
Personalizing Vision-Language Models With Hybrid Prompts for Zero-Shot Anomaly Detection
  • Citing Article
  • February 2025

IEEE Transactions on Cybernetics

... The authors design 27 types of corruptions covering weather conditions, sensor noises, motion distortions, object deformations, and sensor misalignment, and apply them to available datasets to create three corruption robustness benchmarks. Further, Li et al., [39] provide a systematic evaluation framework for assessing the robustness of perception module against diverse types of perturbations. Similarly, Li et al., [40] propose CODA dataset by providing a realistic and diverse collection of corner cases for evaluating object detections in autonomous driving. ...

R2^2-Bench: Benchmarking the Robustness of Referring Perception Models Under Perturbations
  • Citing Chapter
  • October 2024

... However, Myriad's reliance on carefully curated vision expert models introduces complexity, potentially limiting scalability in diverse industrial applications. Meanwhile, Echo [31] introduced a collaborative framework where specialized MLLMs work together, enhancing detection through system-level synergy, though it avoids full fine-tuning for IAD tasks.In contrast, LogicAD [32] and LogiCode [33] approached anomaly detection through logical reasoning, offering a unique perspective that excels when anomalies are defined in rational terms. ...

LogiCode: An LLM-Driven Framework for Logical Anomaly Detection
  • Citing Article
  • January 2024

IEEE Transactions on Automation Science and Engineering

... This makes synthesizing locally continuous and smooth anomaly data particularly challenging. Although Anomaly-ShapeNet [43] acquires point cloud anomaly data by editing the ShapeNet [9] dataset, such editing cannot be applied to discrete points and can only be done on 3D models that are often not available. Therefore, to obtain a substantial and diverse amount of anomaly data for the point cloud feature extractor's self-supervised learning, we propose an automatic anomaly synthesis pipeline that stretches along the normal direction at any position of point clouds to create protrusion or depression defect. ...

Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network
  • Citing Conference Paper
  • June 2024

... We identify two key challenges posed by audio signals. The first is feature confusion due to overlapping audio signals, where simultaneous sounds combine across time and Figure 2. Three methods for achieving precise audio-visual alignment: (a) Audio Semantic Decomposition [24]: Models multi-source semantic space as a Cartesian product of single-source subspaces, employing product quantization and a shared codebook to decompose audio features into compact semantic tokens. (b) Audio Separation [6]: Devising a branch to decode audio-visual fused features into separated audio signals; (c) Audio Semantic Derivation and Elimination (Ours): Derives distinct semantic representations for each source from a mixed audio signal by exploring inter-class relationships. ...

QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition
  • Citing Conference Paper
  • June 2024

... However, the overall performance of knowledge-distillation-based methods still has significant shortcomings. Memory-bank-based methods [25], [26], [27], [28] store the extracted normal data features in a bank and determine anomalies by computing the distance between the extracted features and the features of the bank during testing. The difference between these methods lies in the way they extract features: BTF [25] [27] projects point clouds into multi-view images and extracts features using a pre-trained image encoder; Shape-guided [28] considers PointNet [29] and NIF [30] to learn local representation of surface geometry. ...

Complementary pseudo multimodal feature for point cloud anomaly detection
  • Citing Article
  • July 2024

Pattern Recognition

... When the example is not annotated, producing a pseudo label through confidence score thresholding is a commonly used approach in traditional methods [20,37,49,53]. Between the WTAL and TAL settings, for point-supervised (PTAL) [22,43,52] and semisupervised (SSTAL) [41,42,55] settings, several works focus on learning from the generated pseudo labels. For PTAL, based on the point annotation distribution, TSP-Net [43] proposes a center score learning method to dynamically predict the alignment from the saliency information. ...

HR-Pro: Point-Supervised Temporal Action Localization via Hierarchical Reliability Propagation
  • Citing Article
  • March 2024

Proceedings of the AAAI Conference on Artificial Intelligence

... Anomaly detection methods aim to accurately pinpoint irregular patterns that deviate from normal patterns in given scenarios/categories. Existing anomaly detection methods can be categorized based on the combinations of training data [1], [2] into semi-supervised [18], unsupervised [7], [19], [20], and few-shot methods [21], [22]. 1) Semi-supervised anomaly detection methods require both normal and abnormal samples from target categories for training [23], [24]. As abnormal samples are typically fewer than normal ones, these methods focus on modeling the normal data distribution, using abnormal samples to refine the decision boundary [25]. ...

BiaS: Incorporating Biased Knowledge to Boost Unsupervised Image Anomaly Localization
  • Citing Article
  • April 2024

IEEE Transactions on Systems Man and Cybernetics Systems

... Transformer (Vaswani et al., 2017) has been widely applied and achieved great success in many computer vision tasks, such as object detection and tracking (Carion et al., 2020;Zhu et al., 2020;Wu et al., 2023;Zhang et al., 2024), image segmentation (Zheng et al., 2021;Cheng et al., 2021), and image generation (Liu et al., 2024a). Since DETR (Carion et al., 2020) introduces a new query-based paradigm, the latest works (Botach et al., 2022;Wu et al., 2022b;Li et al., 2023b) prefer to apply the DETR-like framework to the RVOS task. Specifically, they utilize Transformer structures to interact visual images with linguistic data and thereby are able to attain SOTA performance in accuracy and efficiency. ...

Robust Referring Video Object Segmentation with Cyclic Structural Consensus
  • Citing Conference Paper
  • October 2023

... On the other hand, recent work [6,7,25,86] leverages semantic information to mitigate information loss in the high-compression scenarios. In this paper, we provide a comprehensive analysis of image tokenizer in a view of perturbation robustness [5,33,34,38,39,78]. cess in generating high-quality images by modeling the distribution of pixels or latent codes in a sequential manner. ...

Towards Noise-Tolerant Speech-Referring Video Object Segmentation: Bridging Speech and Text
  • Citing Conference Paper
  • January 2023