Marco Pavone’s research while affiliated with NVIDIA and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (595)


Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving
  • Preprint

June 2025

·

4 Reads

·

·

Yurong You

·

[...]

·

Marco Pavone

Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on a large-scale AV dataset and state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to 72% fewer tokens, resulting in up to 50% faster policy inference while achieving the same open-loop motion planning accuracy and improved offroad rates in closed-loop driving simulations.


SkillBlender: Towards Versatile Humanoid Whole-Body Loco-Manipulation via Skill Blending

June 2025

·

2 Reads

Humanoid robots hold significant potential in accomplishing daily tasks across diverse environments thanks to their flexibility and human-like morphology. Recent works have made significant progress in humanoid whole-body control and loco-manipulation leveraging optimal control or reinforcement learning. However, these methods require tedious task-specific tuning for each task to achieve satisfactory behaviors, limiting their versatility and scalability to diverse tasks in daily scenarios. To that end, we introduce SkillBlender, a novel hierarchical reinforcement learning framework for versatile humanoid loco-manipulation. SkillBlender first pretrains goal-conditioned task-agnostic primitive skills, and then dynamically blends these skills to accomplish complex loco-manipulation tasks with minimal task-specific reward engineering. We also introduce SkillBench, a parallel, cross-embodiment, and diverse simulated benchmark containing three embodiments, four primitive skills, and eight challenging loco-manipulation tasks, accompanied by a set of scientific evaluation metrics balancing accuracy and feasibility. Extensive simulated experiments show that our method significantly outperforms all baselines, while naturally regularizing behaviors to avoid reward hacking, resulting in more accurate and feasible movements for diverse loco-manipulation tasks in our daily scenarios. Our code and benchmark will be open-sourced to the community to facilitate future research. Project page: https://usc-gvl.github.io/SkillBlender-web/.


Fig. 1: The reproducibility scheme. The graph includes key components to improve AMoD control research reproducibility.
Fig. 4: Distribution of networks used in AMoD case studies.
Fig. 5: Different network granularity for NYC.
Reproducibility in the Control of Autonomous Mobility-on-Demand Systems
  • Preprint
  • File available

June 2025

·

20 Reads

Autonomous Mobility-on-Demand (AMoD) systems, powered by advances in robotics, control, and Machine Learning (ML), offer a promising paradigm for future urban transportation. AMoD offers fast and personalized travel services by leveraging centralized control of autonomous vehicle fleets to optimize operations and enhance service performance. However, the rapid growth of this field has outpaced the development of standardized practices for evaluating and reporting results, leading to significant challenges in reproducibility. As AMoD control algorithms become increasingly complex and data-driven, a lack of transparency in modeling assumptions, experimental setups, and algorithmic implementation hinders scientific progress and undermines confidence in the results. This paper presents a systematic study of reproducibility in AMoD research. We identify key components across the research pipeline, spanning system modeling, control problems, simulation design, algorithm specification, and evaluation, and analyze common sources of irreproducibility. We survey prevalent practices in the literature, highlight gaps, and propose a structured framework to assess and improve reproducibility. Specifically, concrete guidelines are offered, along with a "reproducibility checklist", to support future work in achieving replicable, comparable, and extensible results. While focused on AMoD, the principles and practices we advocate generalize to a broader class of cyber-physical systems that rely on networked autonomy and data-driven control. This work aims to lay the foundation for a more transparent and reproducible research culture in the design and deployment of intelligent mobility systems.

Download

Pseudo-Simulation for Autonomous Driving

June 2025

·

7 Reads

Existing evaluation paradigms for Autonomous Vehicles (AVs) face critical limitations. Real-world evaluation is often challenging due to safety concerns and a lack of reproducibility, whereas closed-loop simulation can face insufficient realism or high computational costs. Open-loop evaluation, while being efficient and data-driven, relies on metrics that generally overlook compounding errors. In this paper, we propose pseudo-simulation, a novel paradigm that addresses these limitations. Pseudo-simulation operates on real datasets, similar to open-loop evaluation, but augments them with synthetic observations generated prior to evaluation using 3D Gaussian Splatting. Our key idea is to approximate potential future states the AV might encounter by generating a diverse set of observations that vary in position, heading, and speed. Our method then assigns a higher importance to synthetic observations that best match the AV's likely behavior using a novel proximity-based weighting scheme. This enables evaluating error recovery and the mitigation of causal confusion, as in closed-loop benchmarks, without requiring sequential interactive simulation. We show that pseudo-simulation is better correlated with closed-loop simulations (R^2=0.8) than the best existing open-loop approach (R^2=0.7). We also establish a public leaderboard for the community to benchmark new methodologies with pseudo-simulation. Our code is available at https://github.com/autonomousvision/navsim.


E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models

June 2025

·

3 Reads

Spatial intelligence, encompassing 3D reconstruction, perception, and reasoning, is fundamental to applications such as robotics, aerial imaging, and extended reality. A key enabler is the real-time, accurate estimation of core 3D attributes (camera parameters, point clouds, depth maps, and 3D point tracks) from unstructured or streaming imagery. Inspired by the success of large foundation models in language and 2D vision, a new class of end-to-end 3D geometric foundation models (GFMs) has emerged, directly predicting dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters. Since late 2023, the field has exploded with diverse variants, but systematic evaluation is lacking. In this work, we present the first comprehensive benchmark for 3D GFMs, covering five core tasks: sparse-view depth estimation, video depth estimation, 3D reconstruction, multi-view pose estimation, novel view synthesis, and spanning both standard and challenging out-of-distribution datasets. Our standardized toolkit automates dataset handling, evaluation protocols, and metric computation to ensure fair, reproducible comparisons. We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains, and derive key insights to guide future model scaling and optimization. All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.


Figure 8: More examples of the retrieval model. B.2 Comparison between w/ RAG and w/o RAG models on long-tail scenarios.
Figure 9: Top 50 samples with largest minADE of w/o RAG model.
Figure 10: More examples of the comparison between different model outputs.
Ablation study on Waymo dataset
Hyperparameters of experiments
RealDrive: Retrieval-Augmented Driving with Diffusion Models

May 2025

·

3 Reads

Learning-based planners generate natural human-like driving behaviors by learning to reason about nuanced interactions from data, overcoming the rigid behaviors that arise from rule-based planners. Nonetheless, data-driven approaches often struggle with rare, safety-critical scenarios and offer limited controllability over the generated trajectories. To address these challenges, we propose RealDrive, a Retrieval-Augmented Generation (RAG) framework that initializes a diffusion-based planning policy by retrieving the most relevant expert demonstrations from the training dataset. By interpolating between current observations and retrieved examples through a denoising process, our approach enables fine-grained control and safe behavior across diverse scenarios, leveraging the strong prior provided by the retrieved scenario. Another key insight we produce is that a task-relevant retrieval model trained with planning-based objectives results in superior planning performance in our framework compared to a task-agnostic retriever. Experimental results demonstrate improved generalization to long-tail events and enhanced trajectory diversity compared to standard learning-based planners -- we observe a 40% reduction in collision rate on the Waymo Open Motion dataset with RAG.


Scan, Materialize, Simulate: A Generalizable Framework for Physically Grounded Robot Planning

May 2025

·

7 Reads

Autonomous robots must reason about the physical consequences of their actions to operate effectively in unstructured, real-world environments. We present Scan, Materialize, Simulate (SMS), a unified framework that combines 3D Gaussian Splatting for accurate scene reconstruction, visual foundation models for semantic segmentation, vision-language models for material property inference, and physics simulation for reliable prediction of action outcomes. By integrating these components, SMS enables generalizable physical reasoning and object-centric planning without the need to re-learn foundational physical dynamics. We empirically validate SMS in a billiards-inspired manipulation task and a challenging quadrotor landing scenario, demonstrating robust performance on both simulated domain transfer and real-world experiments. Our results highlight the potential of bridging differentiable rendering for scene reconstruction, foundation models for semantic understanding, and physics-based simulation to achieve physically grounded robot planning across diverse settings.


Figure 4: FORTRESS employs foundation model reasoners to anticipate failure modes. It then calibrates thresholds in the embedding model space to determine if new state descriptions more similar to failure modes than safe data Ωs. During safety-critical moments, the semantic safety cost functions rapidly identify physical unsafe state regions during an ANYmal robot's deployment. FORTRESS differentiates the safety of a ladder from a person standing on one, anticipating worker injuries without encountering failures in Ωs.
Figure 6: Planning rates of FORTRESS versus AESOP [13] and Safe-Lang [18] for drone robot in CARLA simulation. We augment baselines with our VLM goal identification for fair comparison.
Figure 8: ROC curves using around 10 failure modes with varying percentile α thresholds on autonomous drones, boats, and vehicle environments using Mahalanobis distance calibrated on cosine similarity on 8 embedding models.
Figure 9: ROC curves using only the "Safe" Mode with varying percentile α thresholds on autonomous drones, boats, and vehicle environments using cosine similarity on 8 embedding models.
Figure 13: Examples of OOD failures detected by FORTRESS for deployment of ANYmal hardware in a room under construction. The green boxes indicate semantically safe concepts for the robot such as a ladder or a person. The other colors show potential hazards: in the image, the boxes are labeled with what objects are detected and on the legend we list their corresponding failure modes that have been identified by the semantic safety cost functions.
Real-Time Out-of-Distribution Failure Prevention via Multi-Modal Reasoning

May 2025

·

4 Reads

Foundation models can provide robust high-level reasoning on appropriate safety interventions in hazardous scenarios beyond a robot's training data, i.e. out-of-distribution (OOD) failures. However, due to the high inference latency of Large Vision and Language Models, current methods rely on manually defined intervention policies to enact fallbacks, thereby lacking the ability to plan generalizable, semantically safe motions. To overcome these challenges we present FORTRESS, a framework that generates and reasons about semantically safe fallback strategies in real time to prevent OOD failures. At a low frequency in nominal operations, FORTRESS uses multi-modal reasoners to identify goals and anticipate failure modes. When a runtime monitor triggers a fallback response, FORTRESS rapidly synthesizes plans to fallback goals while inferring and avoiding semantically unsafe regions in real time. By bridging open-world, multi-modal reasoning with dynamics-aware planning, we eliminate the need for hard-coded fallbacks and human safety interventions. FORTRESS outperforms on-the-fly prompting of slow reasoning models in safety classification accuracy on synthetic benchmarks and real-world ANYmal robot data, and further improves system safety and planning success in simulation and on quadrotor hardware for urban navigation.


Generative AI for Autonomous Driving: Frontiers and Opportunities

May 2025

·

193 Reads

Generative Artificial Intelligence (GenAI) constitutes a transformative technological wave that reconfigures industries through its unparalleled capabilities for content creation, reasoning, planning, and multimodal understanding. This revolutionary force offers the most promising path yet toward solving one of engineering's grandest challenges: achieving reliable, fully autonomous driving, particularly the pursuit of Level 5 autonomy. This survey delivers a comprehensive and critical synthesis of the emerging role of GenAI across the autonomous driving stack. We begin by distilling the principles and trade-offs of modern generative modeling, encompassing VAEs, GANs, Diffusion Models, and Large Language Models (LLMs). We then map their frontier applications in image, LiDAR, trajectory, occupancy, video generation as well as LLM-guided reasoning and decision making. We categorize practical applications, such as synthetic data workflows, end-to-end driving strategies, high-fidelity digital twin systems, smart transportation networks, and cross-domain transfer to embodied AI. We identify key obstacles and possibilities such as comprehensive generalization across rare cases, evaluation and safety checks, budget-limited implementation, regulatory compliance, ethical concerns, and environmental effects, while proposing research plans across theoretical assurances, trust metrics, transport integration, and socio-technical influence. By unifying these threads, the survey provides a forward-looking reference for researchers, engineers, and policymakers navigating the convergence of generative AI and advanced autonomous mobility. An actively maintained repository of cited works is available at https://github.com/taco-group/GenAI4AD.


Deployable and Generalizable Motion Prediction: Taxonomy, Open Challenges and Future Directions

May 2025

·

70 Reads

Motion prediction, the anticipation of future agent states or scene evolution, is rooted in human cognition, bridging perception and decision-making. It enables intelligent systems, such as robots and self-driving cars, to act safely in dynamic, human-involved environments, and informs broader time-series reasoning challenges. With advances in methods, representations, and datasets, the field has seen rapid progress, reflected in quickly evolving benchmark results. Yet, when state-of-the-art methods are deployed in the real world, they often struggle to generalize to open-world conditions and fall short of deployment standards. This reveals a gap between research benchmarks, which are often idealized or ill-posed, and real-world complexity. To address this gap, this survey revisits the generalization and deployability of motion prediction models, with an emphasis on the applications of robotics, autonomous driving, and human motion. We first offer a comprehensive taxonomy of motion prediction methods, covering representations, modeling strategies, application domains, and evaluation protocols. We then study two key challenges: (1) how to push motion prediction models to be deployable to realistic deployment standards, where motion prediction does not act in a vacuum, but functions as one module of closed-loop autonomy stacks - it takes input from the localization and perception, and informs downstream planning and control. 2) how to generalize motion prediction models from limited seen scenarios/datasets to the open-world settings. Throughout the paper, we highlight critical open challenges to guide future work, aiming to recalibrate the community's efforts, fostering progress that is not only measurable but also meaningful for real-world applications.


Citations (24)


... While LRMs continue to set new records in various domains, their success has not seamlessly transferred to the medical domain (Hoyt et al., 2025). Directly applying LRM training methods, which are designed for domains with clear objective verification like mathematics (Luong et al., 2024;Zhang et al., 2025b), has produced suboptimal results in medical scenarios (Chen et al., 2024). These observations underscore the limited transferability of reasoning training paradigms from mathematics to medicine, highlighting the urgent need for novel training schemes specifically tailored to the unique demands of the medical domain. ...

Reference:

Med-REFL: Medical Reasoning Enhancement via Self-Corrected Fine-grained Reflection
LLaMA-Berry: Pairwise Optimization for Olympiad-level Mathematical Reasoning via O1-like Monte Carlo Tree Search

... It leverages data-driven models and advanced techniques like rapidly exploring random trees (RRT) to improve task performance and safety [3]. By combining data-driven flexibility with MPC's optimization capabilities, DD-MPC balances adaptability and performance, enabling real-time optimization and constraint handling, making it suitable for complex robotic systems [4,5]. This paper reviews DD-MPC's application in robot control. ...

Discovering dominant dynamics for nonlinear continuum robot control

npj Robotics

... This approach has led to reductions in emissions and electricity costs; for example, in electric bus depots emissions have been reduced by 98%, and in autonomous depots electricity costs have been reduced by between $3612 and $4010. By optimising recovery procedures and energy usage, SORT testing can aid in evaluating the practical effectiveness of these systems [19]. Practical considerations for choosing electric buses, including power consumption, charging duration, and failure frequency, were brought to light in research conducted by the Chinese Nanjing Bus Company. ...

Optimal coordination of electric buses and battery storage for achieving a 24/7 carbon-free electrified fleet
  • Citing Article
  • January 2025

Applied Energy

... Moreover, many current object-centric learning approaches for videos fail to fully exploit the sequential structure of datasets with longer videos, primarily due to their reliance on RNNbased models Singh et al., 2022;Elsayed et al., 2022) that struggle with learning from long-range temporal coherence. Employing slot-attention mechanisms to process entire video sequences, as proposed in (Singh et al., 2024), could help overcome some challenges but may struggle with scalability for longer videos in real-world datasets due to computational limitations. ...

Parallelized Spatiotemporal Slot Binding for Videos

... These capabilities have also raised interest in applying these models to end-to-end AD for open-loop and closed-loop planning. Early methods directly apply VLMs to the front-view camera images of an autonomous vehicle [3,41,53,67,73,76] to predict future trajectories or control signals in text form. However, a holistic understanding of the traffic scene is crucial for realistic driving scenes with highly dynamic scenarios. ...

Dolphins: Multimodal Language Model for Driving
  • Citing Chapter
  • November 2024

... Reasoning ability, particularly in multi-agent systems, is strengthened to enhance multistep logical problem solving with unstructured data in domains such as biomedicine [8], recommendation systems [9,10], and fake news detection [11,12] [13]. Furthermore, training efficiency and inference of large language models (LLMs) have seen significant improvement due to advancements in algorithms and hardware [14,15]. Recent work investigates robust deployment and management of large language models within cloud computing infrastructures. ...

Learning from Teaching Regularization: Generalizable Correlations Should be Easy to Imitate

... Two particular works -Gu et al. [6], Gu et al. [7] evaluate the performance of DenseTNT [5], a popular Trajectory Prediction model, when trained on HD Maps and maps generated by Online Lane-Topology Prediction models. The lanetopology prediction models evaluated are variants of Map-TRv2 [11] and StreamMapNet [22]. ...

Accelerating Online Mapping and Behavior Prediction via Direct BEV Feature Attention
  • Citing Chapter
  • November 2024

... As a result, many existing works treat samples from an entirely new dataset as OOD. However, this approach is [127], CMTS [128], SceneGen [510], TrafficSim [507], TrafficGen [145], CTG [651], RealGen [133] Adversarial Generation L2C [129], MMG [130], AdvSim [550], Zhang et al. [642], AdvDO [66], KING [190], CAT [639] Language Guided CTG++ [650], LCTGen [511], InteractTraj [587], ProSim [512] coarse-grained: a new dataset may include samples that are well covered by the source training data-on which the model may perform well-and the original training domain may contain underrepresented data regions where the model still performs poorly. This highlights the need for more fine-grained and principled definitions of OOD in regression-based motion prediction. ...

RealGen: Retrieval Augmented Generation for Controllable Traffic Scenarios
  • Citing Chapter
  • October 2024

... Among the various methods that use Monte Carlo sampling, the sample average approximation (SAA) approach is a widely adopted practice (Kleywegt et al., 2002;Birge and Louveaux, 2011). This method replaces the uncertainty set, characterized by its distributional form, with an empirically approximated set generated from Monte Carlo samples drawn from the original distributions (Lew et al., 2022). ...

Sample Average Approximation for Stochastic Programming with Equality Constraints
  • Citing Article
  • October 2024

SIAM Journal on Optimization

... Methods such as UniAD [65] and VAD [66] explicitly integrate multiple driving tasks from perception to planning in a unified Transformer architecture, thereby enhancing planning performance. ParaDrive [67] discusses the necessary components within end-to-end driving architectures. Additionally, GenAD [68] and DiffusionDrive [69] adopt generative models to maintain trajectory continuity and produce multi-modal driving trajectories. ...

PARA-Drive: Parallelized Architecture for Real-Time Autonomous Driving
  • Citing Conference Paper
  • June 2024