Li Fei-Fei’s research while affiliated with Stanford University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (443)


Figure 3. Sampling weight dynamics over iterations for example videos. Ground truth frames are marked in red. Sampling weights progressively focus on ground truth frames across iterations (1, 11, and 21), indicating improved model alignment with keyframes over time. Notably, due to the efficient sampling in temporal search, our model can simultaneously zoom in and focus on distantly located key frames (e.g., around 50s and 100s in the top plot).
Figure 4. Performance improvement with increasing search frames. T* consistently enhances accuracy and reaches near-human oracle performance at 64 frames.
Re-thinking Temporal Search for Long-Form Video Understanding
  • Preprint
  • File available

April 2025

·

11 Reads

Jinhui Ye

·

Zihan Wang

·

Haosen Sun

·

[...]

·

Manling Li

Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: First, we formulate temporal search as a Long Video Haystack problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create LV-Haystack, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset. Next, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T*, which casts the expensive temporal search as a spatial search problem. T* leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T* significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, T* improves GPT-4o's performance from 50.5% to 53.1% and LLaVA-OneVision-72B's performance from 56.5% to 62.4% on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material.

Download

WorldScore: A Unified Evaluation Benchmark for World Generation

April 2025

·

5 Reads

We introduce the WorldScore benchmark, the first unified benchmark for world generation. We decompose world generation into a sequence of next-scene generation tasks with explicit camera trajectory-based layout specifications, enabling unified evaluation of diverse approaches from 3D and 4D scene generation to video generation models. The WorldScore benchmark encompasses a curated dataset of 3,000 test examples that span diverse worlds: static and dynamic, indoor and outdoor, photorealistic and stylized. The WorldScore metrics evaluate generated worlds through three key aspects: controllability, quality, and dynamics. Through extensive evaluation of 19 representative models, including both open-source and closed-source ones, we reveal key insights and challenges for each category of models. Our dataset, evaluation code, and leaderboard can be found at https://haoyi-duan.github.io/WorldScore/


Evaluating large language models in echocardiography reporting: opportunities and challenges

March 2025

·

2 Reads

European Heart Journal - Digital Health

Aims The increasing need for diagnostic echocardiography tests presents challenges in preserving the quality and promptness of reports. While Large Language Models (LLMs) have proven effective in summarizing clinical texts, their application in echo remains underexplored. Methods and results Adult echocardiography studies, conducted at the Mayo Clinic from 1 January 2017 to 31 December 2017, were categorized into two groups: development (all Mayo locations except Arizona) and Arizona validation sets. We adapted open-source LLMs (Llama-2, MedAlpaca, Zephyr, and Flan-T5) using In-Context Learning and Quantized Low-Rank Adaptation fine-tuning (FT) for echo report summarization from ‘Findings’ to ‘Impressions.’ Against cardiologist-generated Impressions, the models’ performance was assessed both quantitatively with automatic metrics and qualitatively by cardiologists. The development dataset included 97 506 reports from 71 717 unique patients, predominantly male (55.4%), with an average age of 64.3 ± 15.8 years. EchoGPT, a fine-tuned Llama-2 model, outperformed other models with win rates ranging from 87% to 99% in various automatic metrics, and produced reports comparable to cardiologists in qualitative review (significantly preferred in conciseness (P < 0.001), with no significant preference in completeness, correctness, and clinical utility). Correlations between automatic and human metrics were fair to modest, with the best being RadGraph F1 scores vs. clinical utility (r = 0.42) and automatic metrics showed insensitivity (0–5% drop) to changes in measurement numbers. Conclusion EchoGPT can generate draft reports for human review and approval, helping to streamline the workflow. However, scalable evaluation approaches dedicated to echo reports remains necessary.


Figure 11. Comparisons of 3DGS fitting methods. Our method achieves high fitting quality comparable to the upper bound with significantly fewer valid 3D Gaussians with positive opacity.
GaussianVerse contains large-scale 3DGS fittings with adaptive number of Gaussians, which allows various applications.
Repurposing 2D Diffusion Models with Gaussian Atlas for 3D Generation

March 2025

·

32 Reads

Recent advances in text-to-image diffusion models have been driven by the increasing availability of paired 2D data. However, the development of 3D diffusion models has been hindered by the scarcity of high-quality 3D data, resulting in less competitive performance compared to their 2D counterparts. To address this challenge, we propose repurposing pre-trained 2D diffusion models for 3D object generation. We introduce Gaussian Atlas, a novel representation that utilizes dense 2D grids, enabling the fine-tuning of 2D diffusion models to generate 3D Gaussians. Our approach demonstrates successful transfer learning from a pre-trained 2D diffusion model to a 2D manifold flattened from 3D structures. To support model training, we compile GaussianVerse, a large-scale dataset comprising 205K high-quality 3D Gaussian fittings of various 3D objects. Our experimental results show that text-to-image diffusion models can be effectively adapted for 3D content generation, bridging the gap between 2D and 3D modeling.


Figure 2. Example reconstructions. Comparison of original and reconstructed images of faces and text. OpenMagViT-V2 and FlowMoLo are 0.07-bits per pixel tokenizers to be compared against each other. LlamaGen-32 and FlowMo-Hi are 0.22-bits per pixel tokenizers to be compared against each other. Best viewed zoomed-in in the electronic version. More comparisons are available on our website.
Figure 8. Multimodal reconstruction. After post-training, FlowMo reconstruction remains multimodal, but biased towards preserving the perceptually relevant details of the image, which manifests here by the variance concentrating in the background.
Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization

March 2025

·

16 Reads

Since the advent of popular visual generation frameworks like VQGAN and latent diffusion models, state-of-the-art image generation systems have generally been two-stage systems that first tokenize or compress visual data into a lower-dimensional latent space before learning a generative model. Tokenizer training typically follows a standard recipe in which images are compressed and reconstructed subject to a combination of MSE, perceptual, and adversarial losses. Diffusion autoencoders have been proposed in prior work as a way to learn end-to-end perceptually-oriented image compression, but have not yet shown state-of-the-art performance on the competitive task of ImageNet-1K reconstruction. We propose FlowMo, a transformer-based diffusion autoencoder that achieves a new state-of-the-art for image tokenization at multiple compression rates without using convolutions, adversarial losses, spatially-aligned two-dimensional latent codes, or distilling from other tokenizers. Our key insight is that FlowMo training should be broken into a mode-matching pre-training stage and a mode-seeking post-training stage. In addition, we conduct extensive analyses and explore the training of generative models atop the FlowMo tokenizer. Our code and models will be available at http://kylesargent.github.io/flowmo .


Figure 1. Visualizations of Sample Questions from MOMA-QA. We illustrate the three distinct types of questions in our dataset, each representing a different category for video question answering. All questions in our dataset are generated from a human-annotated spatio-temporal scene graph (shown on the right). The node of interest for the relationship and motion questions is colored red in the scene graph and outlined in the video.
Zero-Shot Performance Comparison of SGVLM with Baselines on MOMA-QA Dataset. Our method outperforms, or performs on-par with existing methods in the zero-shot setting.
Fine-tuned Performance Comparison of SGVLM with Baselines on MOMA-QA Dataset. SGVLMNoLoc: An ablation of SGVLM where the frame localizer is removed and replaced with uniform frame sampling. SGVLMNoSG: An ablation of SGVLM where the scene graph predictor is removed, and the model inferences solely on the frame embeddings.
Towards Fine-Grained Video Question Answering

March 2025

·

2 Reads

In the rapidly evolving domain of video understanding, Video Question Answering (VideoQA) remains a focal point. However, existing datasets exhibit gaps in temporal and spatial granularity, which consequently limits the capabilities of existing VideoQA methods. This paper introduces the Multi-Object Multi-Actor Question Answering (MOMA-QA) dataset, which is designed to address these shortcomings by emphasizing temporal localization, spatial relationship reasoning, and entity-centric queries. With ground truth scene graphs and temporal interval annotations, MOMA-QA is ideal for developing models for fine-grained video understanding. Furthermore, we present a novel video-language model, SGVLM, which incorporates a scene graph predictor, an efficient frame retriever, and a pre-trained large language model for temporal localization and fine-grained relationship understanding. Evaluations on MOMA-QA and other public datasets demonstrate the superior performance of our model, setting new benchmarks for VideoQA.


BEHAVIOR Robot Suite: Streamlining Real-World Whole-Body Manipulation for Everyday Household Activities

March 2025

·

28 Reads

Real-world household tasks present significant challenges for mobile manipulation robots. An analysis of existing robotics benchmarks reveals that successful task performance hinges on three key whole-body control capabilities: bimanual coordination, stable and precise navigation, and extensive end-effector reachability. Achieving these capabilities requires careful hardware design, but the resulting system complexity further complicates visuomotor policy learning. To address these challenges, we introduce the BEHAVIOR Robot Suite (BRS), a comprehensive framework for whole-body manipulation in diverse household tasks. Built on a bimanual, wheeled robot with a 4-DoF torso, BRS integrates a cost-effective whole-body teleoperation interface for data collection and a novel algorithm for learning whole-body visuomotor policies. We evaluate BRS on five challenging household tasks that not only emphasize the three core capabilities but also introduce additional complexities, such as long-range navigation, interaction with articulated and deformable objects, and manipulation in confined spaces. We believe that BRS's integrated robotic embodiment, data collection interface, and learning framework mark a significant step toward enabling real-world whole-body manipulation for everyday household tasks. BRS is open-sourced at https://behavior-robot-suite.github.io/


A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards

February 2025

·

13 Reads

Task specification for robotic manipulation in open-world environments is challenging, requiring flexible and adaptive objectives that align with human intentions and can evolve through iterative feedback. We introduce Iterative Keypoint Reward (IKER), a visually grounded, Python-based reward function that serves as a dynamic task specification. Our framework leverages VLMs to generate and refine these reward functions for multi-step manipulation tasks. Given RGB-D observations and free-form language instructions, we sample keypoints in the scene and generate a reward function conditioned on these keypoints. IKER operates on the spatial relationships between keypoints, leveraging commonsense priors about the desired behaviors, and enabling precise SE(3) control. We reconstruct real-world scenes in simulation and use the generated rewards to train reinforcement learning (RL) policies, which are then deployed into the real world-forming a real-to-sim-to-real loop. Our approach demonstrates notable capabilities across diverse scenarios, including both prehensile and non-prehensile tasks, showcasing multi-step task execution, spontaneous error recovery, and on-the-fly strategy adjustments. The results highlight IKER's effectiveness in enabling robots to perform multi-step tasks in dynamic environments through iterative reward shaping.


s1: Simple test-time scaling

January 2025

·

62 Reads

·

2 Citations

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.



Citations (39)


... learned features [13,32] or learned "doppelganger" detectors [8]. More recently, approaches have departed from this classical pipeline to directly learn multi-view tasks such as 2D correspondences [33,47,59], camera estimation [30,63,73], pointmap prediction [26,65] and novel-view synthesis [34,46,61] in an end-to-end manner. This shift towards learning-based components and approaches has led to impressive progress, particularly in challenging scenarios, e.g., sparsely sampled input, or varying illumination. ...

Reference:

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis
ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image
  • Citing Conference Paper
  • June 2024

... Recent works, such as PointLLM and 3D-LLM (Hong et al. 2023), aim to align the latent representations of textual descriptions with 3D point clouds, allowing machine perception systems to interpret and interact with the physical world more effectively through text-based instructions. For instance, as illustrated in Figure 1, if a system can accurately identify a target in a scene based on text descriptions, it could significantly enhance the intelligence of robotics applications, enabling tasks such as completing household chores through verbal instructions (Li et al. 2023 ;Ge et al. 2024) or improving human-robot collaboration Team et al. 2023;Huang et al. 2023c). ...

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
  • Citing Conference Paper
  • June 2024

... In terms of the type of robotic embodiment, most works use parallel grippers or simpler end-effectors. However, few methods perform dexterous manipulation using DMs (Si et al., 2024;Ma et al., 2024a;Ze et al., 2024;Chen et al., 2024;Wang et al., 2024a;Freiberg et al., 2025), to facilitate their stability and robustness, also in this high-dimensional setting. ...

DexCap: Scalable and Portable Mocap Data Collection System for Dexterous Manipulation
  • Citing Conference Paper
  • July 2024

... Laboratories and specialized clinics have clearly demonstrated that camera-based imaging systems can identify biomarkers of neurological and musculoskeletal pathology with increasing precision as camera and visual processing algorithms evolve [14]. These systems, however, continue to present challenges in effective and scalable deployment for remote monitoring in the world outside of the clinic or lab. ...

Data-driven discovery of movement-linked heterogeneity in neurodegenerative diseases

Nature Machine Intelligence

... Furthermore, some other work focuses on generalizing the policy to different camera views [69,46,63], scene appearance [30,51], and embodiments [12]. Some studies exploit the power of Large Language Models (LLMs) and Vision Language Models (VLMs) to endow robots with generalization abilities [23,7,39,14]. Instead of adopting generalizable policy architecture, auxiliary learning objectives and powerful foundation models, our work is concentrated on generating high-quality, diverse, and realistic data to instill generalization abilities to the learned policy. ...

Open X-Embodiment: Robotic Learning Datasets and RT-X Models
  • Citing Conference Paper
  • May 2024

... A recent advancement of embodied AI is cooperative embodied AI systems, where multiple autonomous agents collaborate to achieve shared goals [11,12,20,21,26,29,30,43,75,83,[86][87][88]90]. These systems typically employ modular frameworks to perform long-horizon tasks, integrating high-level planning and reasoning with low-level action execution. ...

MindAgent: Emergent Gaming Interaction
  • Citing Conference Paper
  • January 2024

... Modelbased approaches usually ensure privacy by leveraging differential privacy (DP) [8,16,17,33]. This method provides a theoretical and empirical guarantee of privacy by incorporating noisy mechanisms into the training algorithms, using the privacy parameters ϵ and δ [1,13,14,38,42]. However, its effectiveness is limited when it comes to post-training privacy analysis such as visual privacy. ...

Differentially Private Video Activity Recognition
  • Citing Conference Paper
  • January 2024

... On the other hand, implicit representations like Neural Radiance Fields (NeRFs) [5,19] and point-based representations like 3D Gaussian Splatting (3DGS) [20,21] offer high visual-fidelity human reconstruction from monocular videos. However, they struggle with occlusions as they often require pixel-level fine details for subject-specific optimization, which can be largely affected by occlusion noises, as discussed in [12,22]. ...

Rendering Humans from Object-Occluded Monocular Videos
  • Citing Conference Paper
  • October 2023

... Each subtask can then be mapped to a combination of skills. At this level, the granularity of skills is still relatively coarse and can be further broken down into primitive skills [10]. Real-life tasks are virtually limitless, and as the environment changes over time, they often require the acquisition of unforeseen skills. ...

Primitive Skill-Based Robot Learning from Human Evaluative Feedback
  • Citing Conference Paper
  • October 2023

... In this work, we consider skills that are continuously parameterized and we focus on parameter policy learning [1,15,20,40,45] as a mechanism for rapidly specializing skills. For example, a "pick" skill may be parameterized by a relative grasp and a "sweep" skill by a sweeping velocity ( Figure 1). ...

Active Task Randomization: Learning Robust Skills via Unsupervised Generation of Diverse and Feasible Tasks
  • Citing Conference Paper
  • October 2023