Xiaolong Wang’s research while affiliated with Ursinus College and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (198)


AMO: Adaptive Motion Optimization for Hyper-Dexterous Humanoid Whole-Body Control
  • Preprint

May 2025

·

2 Reads

Jialong Li

·

Xuxin Cheng

·

Tianshu Huang

·

[...]

·

Xiaolong Wang

Humanoid robots derive much of their dexterity from hyper-dexterous whole-body movements, enabling tasks that require a large operational workspace: such as picking objects off the ground. However, achieving these capabilities on real humanoids remains challenging due to their high degrees of freedom (DoF) and nonlinear dynamics. We propose Adaptive Motion Optimization (AMO), a framework that integrates sim-to-real reinforcement learning (RL) with trajectory optimization for real-time, adaptive whole-body control. To mitigate distribution bias in motion imitation RL, we construct a hybrid AMO dataset and train a network capable of robust, on-demand adaptation to potentially O.O.D. commands. We validate AMO in simulation and on a 29-DoF Unitree G1 humanoid robot, demonstrating superior stability and an expanded workspace compared to strong baselines. Finally, we show that AMO's consistent performance supports autonomous task execution via imitation learning, underscoring the system's versatility and robustness.


Figure 5. Video frames comparing TTT-MLP against Gated DeltaNet and sliding-window attention, the leading baselines in our human evaluation. TTT-MLP demonstrates better scene consistency by preserving details across transitions and better motion naturalness by accurately depicting complex actions.
Figure 7. Artifacts in videos generated by TTT-MLP. Temporal consistency: Objects sometimes morph at the boundaries of 3-second segments, potentially because the diffusion model samples from different modes across the segments. Motion naturalness: Objects sometimes float unnaturally because gravitational effects are not properly modeled. Aesthetics: Lighting changes do not consistently align with actions unless explicitly prompted. Complex camera movements, such as parallax, are sometimes depicted inaccurately.
One-Minute Video Generation with Test-Time Training
  • Preprint
  • File available

April 2025

·

14 Reads

Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba~2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit

Download

Visual Acoustic Fields

March 2025

·

1 Read

Objects produce different sounds when hit, and humans can intuitively infer how an object might sound based on its appearance and material properties. Inspired by this intuition, we propose Visual Acoustic Fields, a framework that bridges hitting sounds and visual signals within a 3D space using 3D Gaussian Splatting (3DGS). Our approach features two key modules: sound generation and sound localization. The sound generation module leverages a conditional diffusion model, which takes multiscale features rendered from a feature-augmented 3DGS to generate realistic hitting sounds. Meanwhile, the sound localization module enables querying the 3D scene, represented by the feature-augmented 3DGS, to localize hitting positions based on the sound sources. To support this framework, we introduce a novel pipeline for collecting scene-level visual-sound sample pairs, achieving alignment between captured images, impact locations, and corresponding sounds. To the best of our knowledge, this is the first dataset to connect visual and acoustic signals in a 3D context. Extensive experiments on our dataset demonstrate the effectiveness of Visual Acoustic Fields in generating plausible impact sounds and accurately localizing impact sources. Our project page is at https://yuelei0428.github.io/projects/Visual-Acoustic-Fields/.


M3: 3D-Spatial MultiModal Memory

March 2025

·

2 Reads

We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rendering feature representations across granularities, encompassing a wide range of knowledge. In our exploration, we identify two key challenges in previous works on feature splatting: (1) computational constraints in storing high-dimensional features for each Gaussian primitive, and (2) misalignment or information loss between distilled features and foundation model features. To address these challenges, we propose M3 with key components of principal scene components and Gaussian memory attention, enabling efficient training and inference. To validate M3, we conduct comprehensive quantitative evaluations of feature similarity and downstream tasks, as well as qualitative visualizations to highlight the pixel trace of Gaussian memory attention. Our approach encompasses a diverse range of foundation models, including vision-language models (VLMs), perception models, and large multimodal and language models (LMMs/LLMs). Furthermore, to demonstrate real-world applicability, we deploy M3's feature field in indoor scenes on a quadruped robot. Notably, we claim that M3 is the first work to address the core compression challenges in 3D feature distillation.


Fig. 2: Adapting Consumer-grade Devices for Data Collection. To avoid relying on specialized hardware for data collection and make our method more accessible, we design our data collection process using consumer-grade VR devices.
Fig. 5: Object Placement Generalization. Performance comparisons of models trained with and without human data on vertical grasping (picking). Each cell in the 3×3 grid represents a 10cm × 10cm region where the robot attempts to pick up a box, with numbers indicating successful attempts out of 10. The real-robot data is collected in two cells inside the dashed lines. Notably, our teleoperation data is intentionally imbalanced.
Humanoid Policy ~ Human Policy

March 2025

·

17 Reads

Training manipulation policies for humanoid robots with diverse data enhances their robustness and generalization across tasks and platforms. However, learning solely from robot demonstrations is labor-intensive, requiring expensive tele-operated data collection which is difficult to scale. This paper investigates a more scalable data source, egocentric human demonstrations, to serve as cross-embodiment training data for robot learning. We mitigate the embodiment gap between humanoids and humans from both the data and modeling perspectives. We collect an egocentric task-oriented dataset (PH2D) that is directly aligned with humanoid manipulation demonstrations. We then train a human-humanoid behavior policy, which we term Human Action Transformer (HAT). The state-action space of HAT is unified for both humans and humanoid robots and can be differentiably retargeted to robot actions. Co-trained with smaller-scale robot data, HAT directly models humanoid robots and humans as different embodiments without additional supervision. We show that human data improves both generalization and robustness of HAT with significantly better data collection efficiency. Code and data: https://human-as-robot.github.io/


Figure 3. WorldModelBench consists of 7 domains and 56 subdomains, totaling 350 image and text conditions.
Figure 5. We enhance video generation models by leveraging sparse rewards from our fine-tuned judger. Solid arrows indicate the forward process, while dashed lines are gradient directions.
Correlation coefficient of VBench Dimensions with Physics Adherence
WorldModelBench: Judging Video Generation Models As World Models

February 2025

·

12 Reads

·

1 Citation

Video generation models have rapidly progressed, positioning themselves as video world models capable of supporting decision-making applications like robotics and autonomous driving. However, current benchmarks fail to rigorously evaluate these claims, focusing only on general video quality, ignoring important factors to world models such as physics adherence. To bridge this gap, we propose WorldModelBench, a benchmark designed to evaluate the world modeling capabilities of video generation models in application-driven domains. WorldModelBench offers two key advantages: (1) Against to nuanced world modeling violations: By incorporating instruction-following and physics-adherence dimensions, WorldModelBench detects subtle violations, such as irregular changes in object size that breach the mass conservation law - issues overlooked by prior benchmarks. (2) Aligned with large-scale human preferences: We crowd-source 67K human labels to accurately measure 14 frontier models. Using our high-quality human labels, we further fine-tune an accurate judger to automate the evaluation procedure, achieving 8.6% higher average accuracy in predicting world modeling violations than GPT-4o with 2B parameters. In addition, we demonstrate that training to align human annotations by maximizing the rewards from the judger noticeably improve the world modeling capability. The website is available at https://worldmodelbench-team.github.io.


RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets

February 2025

·

20 Reads

We present RigAnything, a novel autoregressive transformer-based model, which makes 3D assets rig-ready by probabilistically generating joints, skeleton topologies, and assigning skinning weights in a template-free manner. Unlike most existing auto-rigging methods, which rely on predefined skeleton template and are limited to specific categories like humanoid, RigAnything approaches the rigging problem in an autoregressive manner, iteratively predicting the next joint based on the global input shape and the previous prediction. While autoregressive models are typically used to generate sequential data, RigAnything extends their application to effectively learn and represent skeletons, which are inherently tree structures. To achieve this, we organize the joints in a breadth-first search (BFS) order, enabling the skeleton to be defined as a sequence of 3D locations and the parent index. Furthermore, our model improves the accuracy of position prediction by leveraging diffusion modeling, ensuring precise and consistent placement of joints within the hierarchy. This formulation allows the autoregressive model to efficiently capture both spatial and hierarchical relationships within the skeleton. Trained end-to-end on both RigNet and Objaverse datasets, RigAnything demonstrates state-of-the-art performance across diverse object types, including humanoids, quadrupeds, marine creatures, insects, and many more, surpassing prior methods in quality, robustness, generalizability, and efficiency. Please check our website for more details: https://www.liuisabella.com/RigAnything.


LoRA-TTT: Low-Rank Test-Time Training for Vision-Language Models

February 2025

·

42 Reads

The rapid advancements in vision-language models (VLMs), such as CLIP, have intensified the need to address distribution shifts between training and testing datasets. Although prior Test-Time Training (TTT) techniques for VLMs have demonstrated robust performance, they predominantly rely on tuning text prompts, a process that demands substantial computational resources and is heavily dependent on entropy-based loss. In this paper, we propose LoRA-TTT, a novel TTT method that leverages Low-Rank Adaptation (LoRA), applied exclusively to the image encoder of VLMs. By introducing LoRA and updating only its parameters during test time, our method offers a simple yet effective TTT approach, retaining the model's initial generalization capability while achieving substantial performance gains with minimal memory and runtime overhead. Additionally, we introduce a highly efficient reconstruction loss tailored for TTT. Our method can adapt to diverse domains by combining these two losses, without increasing memory consumption or runtime. Extensive experiments on two benchmarks, covering 15 datasets, demonstrate that our method improves the zero-shot top-1 accuracy of CLIP-ViT-B/16 by an average of 5.79% on the OOD benchmark and 1.36% on the fine-grained benchmark, efficiently surpassing test-time prompt tuning, without relying on any external models or cache.


Diffusion Autoencoders are Scalable Image Tokenizers

January 2025

·

10 Reads

Tokenizing images into compact visual representations is a key step in learning efficient and high-quality image generative models. We present a simple diffusion tokenizer (DiTo) that learns compact visual representations for image generation models. Our key insight is that a single learning objective, diffusion L2 loss, can be used for training scalable image tokenizers. Since diffusion is already widely used for image generation, our insight greatly simplifies training such tokenizers. In contrast, current state-of-the-art tokenizers rely on an empirically found combination of heuristics and losses, thus requiring a complex training recipe that relies on non-trivially balancing different losses and pretrained supervised models. We show design decisions, along with theoretical grounding, that enable us to scale DiTo for learning competitive image representations. Our results show that DiTo is a simpler, scalable, and self-supervised alternative to the current state-of-the-art image tokenizer which is supervised. DiTo achieves competitive or better quality than state-of-the-art in image reconstruction and downstream image generation tasks.


Integrating LMM Planners and 3D Skill Policies for Generalizable Manipulation

January 2025

·

12 Reads

The recent advancements in visual reasoning capabilities of large multimodal models (LMMs) and the semantic enrichment of 3D feature fields have expanded the horizons of robotic capabilities. These developments hold significant potential for bridging the gap between high-level reasoning from LMMs and low-level control policies utilizing 3D feature fields. In this work, we introduce LMM-3DP, a framework that can integrate LMM planners and 3D skill Policies. Our approach consists of three key perspectives: high-level planning, low-level control, and effective integration. For high-level planning, LMM-3DP supports dynamic scene understanding for environment disturbances, a critic agent with self-feedback, history policy memorization, and reattempts after failures. For low-level control, LMM-3DP utilizes a semantic-aware 3D feature field for accurate manipulation. In aligning high-level and low-level control for robot actions, language embeddings representing the high-level policy are jointly attended with the 3D feature field in the 3D transformer for seamless integration. We extensively evaluate our approach across multiple skills and long-horizon tasks in a real-world kitchen environment. Our results show a significant 1.45x success rate increase in low-level control and an approximate 1.5x improvement in high-level planning accuracy compared to LLM-based baselines. Demo videos and an overview of LMM-3DP are available at https://lmm-3dp-release.github.io.


Citations (42)


... Affordance Diffusion (Ye et al. 2023) generates handobject interaction images, conditioned on a hand orientation mask. HandBooster ) and HOIDiffusion (Zhang et al. 2024b) synthesizes realistic hand-object images with diverse appearances, poses, views, and backgrounds. MANUS (Pokhariya et al. 2023) utilizes 3DGS to model hand and object respectively and combine them to form a data set. ...

Reference:

HOGSA: Bimanual Hand-Object Interaction Understanding with 3D Gaussian Splatting Based Data Augmentation
HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data
  • Citing Conference Paper
  • June 2024

... End-to-end learning techniques for mobile manipulation are also rising in popularity. Recent works include learning to manipulated articulated objects through behavior cloning [19], and learning to jointly optimize navigation and manipulation for hybrid tasks such as door opening and table wiping [20]. While modular systems like SHOPPER are typically more robust and generalizable than end-to-end systems currently are, similar large-scale studies as the one presented in this work should also be conducted for end-toend policies in the future. ...

Harmonic Mobile Manipulation
  • Citing Conference Paper
  • October 2024

... Our method essentially differentiates previous featurebased methods [24,47] in its specific strategy for geometric enhancement. In our method, the expressive MVS features are incorporated, and features of each Gaussian are fixed, both for geometry enhancement with sparse views. ...

Language-Driven Physics-Based Scene Synthesis and Editing via Feature Splatting
  • Citing Chapter
  • November 2024

... Recent advances in conditional image generation are predominantly driven by high capacity text-to-image diffusion models [24,25,41,48,52,55,57,58]. These models serve as a strong image prior, that can be specialized to individual tasks, facilitating remarkable progress in applications like inpainting [35,69,70], personalization [18,29,56], image editing [4,22,28,38,40], and image translation [30,74]. While these approaches excel at each individual task, the resulting variations in architectural designs and learning objectives make it challenging to integrate multiple tasks within a single framework. ...

Editable Image Elements for Controllable Synthesis
  • Citing Chapter
  • October 2024

... For the single unit sensor and the K feature sensor, three different measurement quantities are extracted as sensor readings: 1) Binary signal (denoted as B) returns 1 if the taxel is touched and 0 otherwise. Sensors used in [40], [41] belong to this class. 2) Magnitude signal (denoted as M) returns the magnitude of the gross force being applied to each taxel. ...

DexTouch: Learning to Seek and Manipulate Objects With Tactile Dexterity
  • Citing Article
  • December 2024

IEEE Robotics and Automation Letters

... NeRF−− [Wang et al. 2022], BARF [Lin et al. 2021], and SCNeRF [Jeong et al. 2021] jointly optimize NeRF and camera parameters, reducing the need for known camera parameters in static scene reconstruction. NoPe-NeRF [Bian et al. 2023], CF-3DGS [Fu et al. 2024] leverage depth priors for more accurate pose estimation. RoDyNeRF [Liu et al. 2023] extends this approach by jointly optimizing dynamic NeRF and camera parameters to correct inaccurate camera poses. ...

COLMAP-Free 3D Gaussian Splatting
  • Citing Conference Paper
  • June 2024

... Given a query image and an optional prompt specifying the keypoint of interest, our goal is to generate textual descriptions and keypoint locations that convey fine-grained keypoint information within the image. Recognizing the exceptional ability of LLMs in handling multimodal tokens for different perception tasks [76,10,77,73,78], we further leverage LLM for keypoint comprehension, which could effectively process various inputs: (1) the visual tokens z q of the query image, (2) the prompt tokens z p , and (3) a sequence of language tokens t, which depend on the three semantic keypoint comprehension scenarios. ...

Pixel Aligned Language Models
  • Citing Conference Paper
  • June 2024

... In contrast with single shot models, iterative models, such as diffusion models [9,15,33], generate high-quality samples by inverting known degradation processes. These models are widely used for diverse image processing tasks, including image restoration and translation [6,8,9,16,33,34]. Furthermore, diffusion models have also been applied to 3D reconstruction tasks from text prompts or 2D images [22,27]. ...

Image Neural Field Diffusion Models
  • Citing Conference Paper
  • June 2024

... Finally, these re-rendered conditions are sent to PCDController and RealisDance-DiT for generated outcomes, as shown in Figure 4. Datasets. To ensure the generalization of PCDController, we collect large-scale training data for camera control, including DL3DV [32], RE10K [74], ACID [34], Co3Dv2 [43], Tartainair [56], Map-Free-Reloc [1], WildRGBD [59], COP3D [49], UCo3D [35]. This comprehensive dataset encompasses a variety of scenarios, featuring both static and dynamic scenes, as well as object-level and scene-level environments. ...

RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos
  • Citing Conference Paper
  • June 2024

... By leveraging this minimal human annotation regarding the order of subtasks, we can efficiently divide each source demo into contiguous object-centric manipulation segments {τ i } M i=1 (each of which corresponds to a subtask S i (o i )) using a simulator, and then generate extensive trajectory datasets for various task variants (in our case: variations in the initial and goal state distributions of objects (D) and robots (R)) using MimicGen [79]. This approach has been shown to significantly benefit generalization in imitation learning [79,50,121,31,85], particularly in scenarios where the number of source demonstrations is limited. For further details, please refer to the supplementary materials. ...

CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation
  • Citing Conference Paper
  • June 2024