Deva Ramanan’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (306)


Fig. 2: AV-21 sensors and their respective locations.
BETTY Dataset: A Multi-modal Dataset for Full-Stack Autonomy
  • Preprint
  • File available

May 2025

·

27 Reads

Micah Nye

·

·

Andrew Saba

·

[...]

·

We present the BETTY dataset, a large-scale, multi-modal dataset collected on several autonomous racing vehicles, targeting supervised and self-supervised state estimation, dynamics modeling, motion forecasting, perception, and more. Existing large-scale datasets, especially autonomous vehicle datasets, focus primarily on supervised perception, planning, and motion forecasting tasks. Our work enables multi-modal, data-driven methods by including all sensor inputs and the outputs from the software stack, along with semantic metadata and ground truth information. The dataset encompasses 4 years of data, currently comprising over 13 hours and 32TB, collected on autonomous racing vehicle platforms. This data spans 6 diverse racing environments, including high-speed oval courses, for single and multi-agent algorithm evaluation in feature-sparse scenarios, as well as high-speed road courses with high longitudinal and lateral accelerations and tight, GPS-denied environments. It captures highly dynamic states, such as 63 m/s crashes, loss of tire traction, and operation at the limit of stability. By offering a large breadth of cross-modal and dynamic data, the BETTY dataset enables the training and testing of full autonomy stack pipelines, pushing the performance of all algorithms to the limits. The current dataset is available at https://pitt-mit-iac.github.io/betty-dataset/.

Download

Figure 1. Overview of LEGOGPT. (a) Our method generates physically stable LEGO structures from text descriptions through an end-to-end approach, showing intermediate brick-by-brick steps. (b) The generated designs are buildable both by hand and by automated robotic assembly. (c) We show example results with corresponding text prompts. Besides basic LEGO designs (top), our method can generate colored LEGO models (bottom right) and textured models (bottom left) with appearance descriptions. We highly recommend the reader to check our website for step-by-step videos.
Figure 4. Force Model. (a) We consider all forces exerted on a single brick, including gravity (black), vertical forces with the top brick (red/blue) and bottom brick (green/purple), and horizontal (shear) forces due to knob connections (cyan), and adjacent bricks (yellow). (b) The structural force model F extends the individual force model to multiple bricks. Solving for static equilibrium in F determines each brick's stability score.
Figure 5. Result gallery and baseline comparisons. Our method can generate high-quality, diverse, and novel LEGO designs aligned with the given text prompts. Black bricks are colliding. For LLaMA-Mesh [78], LGM [68], XCube [60], and Hunyuan3D-2 [86], an inset of the generated mesh before legolization is shown.
Generating Physically Stable and Buildable LEGO Designs from Text

May 2025

·

33 Reads

We introduce LegoGPT, the first approach for generating physically stable LEGO brick models from text prompts. To achieve this, we construct a large-scale, physically stable dataset of LEGO designs, along with their associated captions, and train an autoregressive large language model to predict the next brick to add via next-token prediction. To improve the stability of the resulting designs, we employ an efficient validity check and physics-aware rollback during autoregressive inference, which prunes infeasible token predictions using physics laws and assembly constraints. Our experiments show that LegoGPT produces stable, diverse, and aesthetically pleasing LEGO designs that align closely with the input text prompts. We also develop a text-based LEGO texturing method to generate colored and textured designs. We show that our designs can be assembled manually by humans and automatically by robotic arms. We also release our new dataset, StableText2Lego, containing over 47,000 LEGO structures of over 28,000 unique 3D objects accompanied by detailed captions, along with our code and models at the project website: https://avalovelace1.github.io/LegoGPT/.


Figure 4. Additional Qualitative Results on Predicted Camera Poses. DiffusionSfM shows robustness to ambiguous patterns in inputs.
Figure 8. Qualitative Comparison of Sparse and Dense Model Outputs. The sparse model predicts the ray origin and endpoint for each image patch, limiting its ability to capture the fine-grained details of the scene.
DiffusionSfM: Predicting Structure and Motion via Ray Origin and Endpoint Diffusion

May 2025

·

2 Reads

Current Structure-from-Motion (SfM) methods typically follow a two-stage pipeline, combining learned or geometric pairwise reasoning with a subsequent global optimization step. In contrast, we propose a data-driven multi-view reasoning approach that directly infers 3D scene geometry and camera poses from multi-view images. Our framework, DiffusionSfM, parameterizes scene geometry and cameras as pixel-wise ray origins and endpoints in a global frame and employs a transformer-based denoising diffusion model to predict them from multi-view inputs. To address practical challenges in training diffusion models with missing data and unbounded scene coordinates, we introduce specialized mechanisms that ensure robust learning. We empirically validate DiffusionSfM on both synthetic and real datasets, demonstrating that it outperforms classical and learning-based approaches while naturally modeling uncertainty.


Towards Understanding Camera Motions in Any Video

April 2025

·

1 Read

We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding. CameraBench consists of ~3,000 diverse internet videos, annotated by experts through a rigorous multi-stage quality control process. One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers. We find, for example, that some motions like "follow" (or tracking) require understanding scene content like moving subjects. We conduct a large-scale human study to quantify human annotation performance, revealing that domain expertise and tutorial-based training can significantly enhance accuracy. For example, a novice may confuse zoom-in (a change of intrinsics) with translating forward (a change of extrinsics), but can be trained to differentiate the two. Using CameraBench, we evaluate Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM models struggle to capture semantic primitives that depend on scene content, while VLMs struggle to capture geometric primitives that require precise estimation of trajectories. We then fine-tune a generative VLM on CameraBench to achieve the best of both worlds and showcase its applications, including motion-augmented captioning, video question answering, and video-text retrieval. We hope our taxonomy, benchmark, and tutorials will drive future efforts towards the ultimate goal of understanding camera motions in any video.


AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis

April 2025

·

8 Reads

We explore the task of geometric reconstruction of images captured from a mixture of ground and aerial views. Current state-of-the-art learning-based approaches fail to handle the extreme viewpoint variation between aerial-ground image pairs. Our hypothesis is that the lack of high-quality, co-registered aerial-ground datasets for training is a key reason for this failure. Such data is difficult to assemble precisely because it is difficult to reconstruct in a scalable way. To overcome this challenge, we propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes (e.g., Google Earth) with real, ground-level crowd-sourced images (e.g., MegaDepth). The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images where mesh-based renderings lack sufficient detail, effectively bridging the domain gap between real images and pseudo-synthetic renderings. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks. For example, we observe that baseline DUSt3R localizes fewer than 5% of aerial-ground pairs within 5 degrees of camera rotation error, while fine-tuning with our data raises accuracy to nearly 56%, addressing a major failure point in handling large viewpoint changes. Beyond camera estimation and scene reconstruction, our dataset also improves performance on downstream tasks like novel-view synthesis in challenging aerial-ground scenarios, demonstrating the practical value of our approach in real-world applications.


Ablation study on octree node ordering
Efficient Autoregressive Shape Generation via Octree-Based Adaptive Tokenization

April 2025

·

22 Reads

Many 3D generative models rely on variational autoencoders (VAEs) to learn compact shape representations. However, existing methods encode all shapes into a fixed-size token, disregarding the inherent variations in scale and complexity across 3D data. This leads to inefficient latent representations that can compromise downstream generation. We address this challenge by introducing Octree-based Adaptive Tokenization, a novel framework that adjusts the dimension of latent representations according to shape complexity. Our approach constructs an adaptive octree structure guided by a quadric-error-based subdivision criterion and allocates a shape latent vector to each octree cell using a query-based transformer. Building upon this tokenization, we develop an octree-based autoregressive generative model that effectively leverages these variable-sized representations in shape generation. Extensive experiments demonstrate that our approach reduces token counts by 50% compared to fixed-size methods while maintaining comparable visual quality. When using a similar token length, our method produces significantly higher-quality shapes. When incorporated with our downstream generative model, our method creates more detailed and diverse 3D content than existing approaches.


Accenture-NVS1: A Novel View Synthesis Dataset

March 2025

·

1 Read

This paper introduces ACC-NVS1, a specialized dataset designed for research on Novel View Synthesis specifically for airborne and ground imagery. Data for ACC-NVS1 was collected in Austin, TX and Pittsburgh, PA in 2023 and 2024. The collection encompasses six diverse real-world scenes captured from both airborne and ground cameras, resulting in a total of 148,000 images. ACC-NVS1 addresses challenges such as varying altitudes and transient objects. This dataset is intended to supplement existing datasets, providing additional resources for comprehensive research, rather than serving as a benchmark.


Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models

February 2025

·

1 Read

·

1 Citation

While recent Large Vision-Language Models (LVLMs) have shown remarkable performance in multi-modal tasks, they are prone to generating hallucinatory text responses that do not align with the given visual input, which restricts their practical applicability in real-world scenarios. In this work, inspired by the observation that the text-to-image generation process is the inverse of image-conditioned response generation in LVLMs, we explore the potential of leveraging text-to-image generative models to assist in mitigating hallucinations in LVLMs. We discover that generative models can offer valuable self-feedback for mitigating hallucinations at both the response and token levels. Building on this insight, we introduce self-correcting Decoding with Generative Feedback (DeGF), a novel training-free algorithm that incorporates feedback from text-to-image generative models into the decoding process to effectively mitigate hallucinations in LVLMs. Specifically, DeGF generates an image from the initial response produced by LVLMs, which acts as an auxiliary visual reference and provides self-feedback to verify and correct the initial response through complementary or contrastive decoding. Extensive experimental results validate the effectiveness of our approach in mitigating diverse types of hallucinations, consistently surpassing state-of-the-art methods across six benchmarks. Code is available at https://github.com/zhangce01/DeGF.


Using Diffusion Priors for Video Amodal Segmentation

December 2024

·

11 Reads

Object permanence in humans is a fundamental cue that helps in understanding persistence of objects, even when they are fully occluded in the scene. Present day methods in object segmentation do not account for this amodal nature of the world, and only work for segmentation of visible or modal objects. Few amodal methods exist; single-image segmentation methods cannot handle high-levels of occlusions which are better inferred using temporal information, and multi-frame methods have focused solely on segmenting rigid objects. To this end, we propose to tackle video amodal segmentation by formulating it as a conditional generation task, capitalizing on the foundational knowledge in video generative models. Our method is simple; we repurpose these models to condition on a sequence of modal mask frames of an object along with contextual pseudo-depth maps, to learn which object boundary may be occluded and therefore, extended to hallucinate the complete extent of an object. This is followed by a content completion stage which is able to inpaint the occluded regions of an object. We benchmark our approach alongside a wide array of state-of-the-art methods on four datasets and show a dramatic improvement of upto 13% for amodal segmentation in an object's occluded region.


Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers

November 2024

·

5 Reads

Generative Large Multimodal Models (LMMs) like LLaVA and Qwen-VL excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering. Despite strong performance, LMMs are not directly suited for foundational discriminative vision-language tasks (i.e., tasks requiring discrete label predictions) such as image classification and multiple-choice VQA. One key challenge in utilizing LMMs for discriminative tasks is the extraction of useful features from generative models. To overcome this issue, we propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks. Toward this end, we present Sparse Attention Vectors (SAVs) -- a finetuning-free method that leverages sparse attention head activations (fewer than 1\% of the heads) in LMMs as strong features for VL tasks. With only few-shot examples, SAVs demonstrate state-of-the-art performance compared to a variety of few-shot and finetuned baselines on a collection of discriminative tasks. Our experiments also imply that SAVs can scale in performance with additional examples and generalize to similar tasks, establishing SAVs as both effective and robust multimodal feature representations.


Citations (48)


... Recently, a series of training-free approaches have directly reduced language priors using contrastive decoding [27; 45], achieving remarkable performance. These methods construct an alternative logit distribution on top of the original one through techniques such as masking the image [21], perturbing the instruction [22], augmenting the vision input [23], or performing cross-modal conversion [26]. During decoding, the two logit distributions are contrasted to eliminate language priors. ...

Reference:

Mitigate Language Priors in Large Vision-Language Models by Cross-Images Contrastive Decoding
Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models

... Latent-NeRF [32] applies the SDS-loss [38] to rendered images to optimize a NeRF representation in the stable diffusion latent space. More recently, Paint-it [59], Fantasia3D [10], and FlashTex [17] have enhanced SDS-loss-based texturing by incorporating Physically Based Rendering (PBR), BRDF modeling, and illumination control, respectively. These optimization techniques require processing each object individually, leading to slower performance, whereas our method has a much shorter inference time. ...

FlashTex: Fast Relightable Mesh Texturing with LightControlNet
  • Citing Chapter
  • November 2024

... adoption of FID[22] for accessing visual quality (Vis. Qual.) and VQAScore[35] for Text-Vision Relevance (Txt-Vis Rel.), inclusion of automated Judge Model, reproducibility support (Reprod.), incorporation of human alignment (Hum.-Align.) and number of evaluated models (#Models). ...

Evaluating Text-to-Visual Generation with Image-to-Text Generation
  • Citing Chapter
  • October 2024

... Liu et al. (2023) use contrastive pre-training to distill vision-foundation model features for label-efficient segmentation. Osep et al. (2024) distill vision foundation models into a zero-shot Lidar panoptic segmentation model. Peng et al. (2023); Xiao et al. (2024) similarly distill 2D foundational knowledge into 3D but rely on Lidar and camera inputs to classify Lidar/RGB-D points at test-time. ...

Better Call SAL: Towards Learning to Segment Anything in Lidar
  • Citing Chapter
  • October 2024

... The rapid advancement of large multimodal models (LMMs) has revolutionized the fields of both text-to-video (T2V) generation [1][2][3] and video-to-text (V2T) interpretation [4][5][6], leading to highquality video generation and comprehensive multimodal video understanding capabilities. However, state-of-the-art T2V models may still produce videos with degraded perceptual quality and limited text-video correspondence, thus may fail to meet human preferences [7][8][9]. Given the high cost and inefficiency of human evaluation, it is of great significance to develop a reliable and scalable evaluation metric that aligns well with human preferences for AI-generated videos (AIGVs) and corresponding T2V models. ...

Evaluating and Improving Compositional Text-to-Visual Generation
  • Citing Conference Paper
  • June 2024

... This task requires interpreting the question and generating an answer through natural language, given a specific document. Recently, DocVQA models have increasingly leveraged large vision-language models for their ability to process both textual and visual modalities at scale (Wang et al., 2024;Zhao et al., 2024;Rasheed et al., 2024;Parashar et al., 2024). ...

The Neglected Tails in Vision-Language Models
  • Citing Conference Paper
  • June 2024

... While traditionally a manual process, APE automates this refinement and has been widely applied in LLMs (Shin et al., 2020;Zhou et al., 2022;Pryzant et al., 2023) to improve text prompts. In the vision-language domain, research has also focused on optimizing textual prompts for CLIP (Liu et al., 2024a) or text-to-image diffusion models (Mañas et al., 2024;Liu et al., 2024b). With LLMs evolve into multimodal system, capable of handling both text and visual data, APE's application to visual inputs is still largely unexplored. ...

Language Models as Black-Box Optimizers for Vision-Language Models
  • Citing Conference Paper
  • June 2024

... Research on 3D scene reconstruction from multi-view images has gained momentum since NeRF [29], which introduced an implicit representation of 3D space using neural networks, enabling photo-realistic novel view synthesis. However, NeRF suffers from long training and inference times, which led to subsequent research incorporating voxel grids [10,14,44,60], triplanes [6,11,20], and hybrid representations [47,65]. ...

HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces
  • Citing Conference Paper
  • June 2024

... Meanwhile, techniques like VastGaussians [12], PyGS [13], and GS-LRM [14] explore the application of 3DGS for large-scale urban scene reconstruction. Other research efforts, such as SplaTAM [15], RTG-SLAM [16] and MonoGS [17], incorporate 3DGS into simultaneous localization and mapping (SLAM) frameworks, while DrivingGaussian [18], GaussianBEV [19] and GaussianFormer [20] investigates the use of 3DGS in autonomous driving scenarios. Despite these promising developments, unique challenges persist in extending 3DGS to outdoor, unconstrained datasets, particularly in terms of scalability and robustness. ...

SplaTAM: Splat, Track & Map 3D Gaussians for Dense RGB-D SLAM
  • Citing Conference Paper
  • June 2024

... 3D Dynamics with Physics. Recent approaches such as dynamic NeRF [15,32,42,44] and dynamic 3D Gaussians [21,36] extend geometric reasoning to a dynamic domain by first learning the scene in a canonical space and then mapping this space into a deformed at a particular timestep. While they develop a dynamic 3D geometric understanding of the scene at each timestep, their representations do not capture the underlying physics (for example, internal and external forces) and material-related behaviors that come into play and influence the trajectory of the objects in the scene. ...

Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis
  • Citing Conference Paper
  • March 2024