Alan Yuille’s research while affiliated with Johns Hopkins University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (931)


How Well Do Supervised 3D Models Transfer to Medical Imaging Tasks?
  • Preprint

January 2025

Wenxuan Li

·

Alan Yuille

·

Zongwei Zhou

The pre-training and fine-tuning paradigm has become prominent in transfer learning. For example, if the model is pre-trained on ImageNet and then fine-tuned to PASCAL, it can significantly outperform that trained on PASCAL from scratch. While ImageNet pre-training has shown enormous success, it is formed in 2D, and the learned features are for classification tasks; when transferring to more diverse tasks, like 3D image segmentation, its performance is inevitably compromised due to the deviation from the original ImageNet context. A significant challenge lies in the lack of large, annotated 3D datasets rivaling the scale of ImageNet for model pre-training. To overcome this challenge, we make two contributions. Firstly, we construct AbdomenAtlas 1.1 that comprises 9,262 three-dimensional computed tomography (CT) volumes with high-quality, per-voxel annotations of 25 anatomical structures and pseudo annotations of seven tumor types. Secondly, we develop a suite of models that are pre-trained on our AbdomenAtlas 1.1 for transfer learning. Our preliminary analyses indicate that the model trained only with 21 CT volumes, 672 masks, and 40 GPU hours has a transfer learning ability similar to the model trained with 5,050 (unlabeled) CT volumes and 1,152 GPU hours. More importantly, the transfer learning ability of supervised models can further scale up with larger annotated datasets, achieving significantly better performance than preexisting pre-trained models, irrespective of their pre-training methodologies or data sources. We hope this study can facilitate collective efforts in constructing larger 3D medical datasets and more releases of supervised pre-trained models.


VideoAuteur: Towards Long Narrative Video Generation
  • Preprint
  • File available

January 2025

·

9 Reads

Junfei Xiao

·

Feng Cheng

·

Lu Qi

·

[...]

·

Lu Jiang

Recent video generation models have shown promising results in producing high-quality video clips lasting several seconds. However, these models face challenges in generating long sequences that convey clear and informative events, limiting their ability to support coherent narrations. In this paper, we present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain. We validate the quality of our proposed dataset in terms of visual fidelity and textual caption accuracy using state-of-the-art Vision-Language Models (VLMs) and video generation models, respectively. We further introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos and emphasize the role of aligning visual embeddings to achieve improved overall video quality. Our method demonstrates substantial improvements in generating visually detailed and semantically aligned keyframes, supported by finetuning techniques that integrate text and image embeddings within the video generation process. Project page: https://videoauteur.github.io/

Download

RadGPT: Constructing 3D Image-Text Tumor Datasets

January 2025

·

21 Reads

With over 85 million CT scans performed annually in the United States, creating tumor-related reports is a challenging and time-consuming task for radiologists. To address this need, we present RadGPT, an Anatomy-Aware Vision-Language AI Agent for generating detailed reports from CT scans. RadGPT first segments tumors, including benign cysts and malignant tumors, and their surrounding anatomical structures, then transforms this information into both structured reports and narrative reports. These reports provide tumor size, shape, location, attenuation, volume, and interactions with surrounding blood vessels and organs. Extensive evaluation on unseen hospitals shows that RadGPT can produce accurate reports, with high sensitivity/specificity for small tumor (<2 cm) detection: 80/73% for liver tumors, 92/78% for kidney tumors, and 77/77% for pancreatic tumors. For large tumors, sensitivity ranges from 89% to 97%. The results significantly surpass the state-of-the-art in abdominal CT report generation. RadGPT generated reports for 17 public datasets. Through radiologist review and refinement, we have ensured the reports' accuracy, and created the first publicly available image-text 3D medical dataset, comprising over 1.8 million text tokens and 2.7 million images from 9,262 CT scans, including 2,947 tumor scans/reports of 8,562 tumor instances. Our reports can: (1) localize tumors in eight liver sub-segments and three pancreatic sub-segments annotated per-voxel; (2) determine pancreatic tumor stage (T1-T4) in 260 reports; and (3) present individual analyses of multiple tumors--rare in human-made reports. Importantly, 948 of the reports are for early-stage tumors.


ScaleMAI: Accelerating the Development of Trusted Datasets and AI Models

January 2025

·

13 Reads

Building trusted datasets is critical for transparent and responsible Medical AI (MAI) research, but creating even small, high-quality datasets can take years of effort from multidisciplinary teams. This process often delays AI benefits, as human-centric data creation and AI-centric model development are treated as separate, sequential steps. To overcome this, we propose ScaleMAI, an agent of AI-integrated data curation and annotation, allowing data quality and AI performance to improve in a self-reinforcing cycle and reducing development time from years to months. We adopt pancreatic tumor detection as an example. First, ScaleMAI progressively creates a dataset of 25,362 CT scans, including per-voxel annotations for benign/malignant tumors and 24 anatomical structures. Second, through progressive human-in-the-loop iterations, ScaleMAI provides Flagship AI Model that can approach the proficiency of expert annotators (30-year experience) in detecting pancreatic tumors. Flagship Model significantly outperforms models developed from smaller, fixed-quality datasets, with substantial gains in tumor detection (+14%), segmentation (+5%), and classification (72%) on three prestigious benchmarks. In summary, ScaleMAI transforms the speed, scale, and reliability of medical dataset creation, paving the way for a variety of impactful, data-driven applications.


MedShapeNet - a large-scale dataset of 3D medical shapes for computer vision

December 2024

·

174 Reads

·

18 Citations

Objectives The shape is commonly used to describe the objects. State-of-the-art algorithms in medical imaging are predominantly diverging from computer vision, where voxel grids, meshes, point clouds, and implicit surface models are used. This is seen from the growing popularity of ShapeNet (51,300 models) and Princeton ModelNet (127,915 models). However, a large collection of anatomical shapes (e.g., bones, organs, vessels) and 3D models of surgical instruments is missing. Methods We present MedShapeNet to translate data-driven vision algorithms to medical applications and to adapt state-of-the-art vision algorithms to medical problems. As a unique feature, we directly model the majority of shapes on the imaging data of real patients. We present use cases in classifying brain tumors, skull reconstructions, multi-class anatomy completion, education, and 3D printing. Results By now, MedShapeNet includes 23 datasets with more than 100,000 shapes that are paired with annotations (ground truth). Our data is freely accessible via a web interface and a Python application programming interface and can be used for discriminative, reconstructive, and variational benchmarks as well as various applications in virtual, augmented, or mixed reality, and 3D printing. Conclusions MedShapeNet contains medical shapes from anatomy and surgical instruments and will continue to collect data for benchmarks and applications. The project page is: https://medshapenet.ikim.nrw/ .


Figure 6. Comparison of real and synthetic pancreatic tumors across three types: Cyst, PDAC, and PNET. The top row displays real medical images, highlighting the distinct characteristics of each tumor type-cysts with smooth, fluid-filled appearances, hypoattenuating PDAC masses, and hypervascular PNET lesions with bright enhancement. The bottom row shows their corresponding synthetic counterparts, demonstrating the ability of the model to replicate texture, shape, and contrast features unique to each tumor type.
Text-Driven Tumor Synthesis

December 2024

·

6 Reads

Tumor synthesis can generate examples that AI often misses or over-detects, improving AI performance by training on these challenging cases. However, existing synthesis methods, which are typically unconditional -- generating images from random variables -- or conditioned only by tumor shapes, lack controllability over specific tumor characteristics such as texture, heterogeneity, boundaries, and pathology type. As a result, the generated tumors may be overly similar or duplicates of existing training data, failing to effectively address AI's weaknesses. We propose a new text-driven tumor synthesis approach, termed TextoMorph, that provides textual control over tumor characteristics. This is particularly beneficial for examples that confuse the AI the most, such as early tumor detection (increasing Sensitivity by +8.5%), tumor segmentation for precise radiotherapy (increasing DSC by +6.3%), and classification between benign and malignant tumors (improving Sensitivity by +8.2%). By incorporating text mined from radiology reports into the synthesis process, we increase the variability and controllability of the synthetic tumors to target AI's failure cases more precisely. Moreover, TextoMorph uses contrastive learning across different texts and CT scans, significantly reducing dependence on scarce image-report pairs (only 141 pairs used in this study) by leveraging a large corpus of 34,035 radiology reports. Finally, we have developed rigorous tests to evaluate synthetic tumors, including Text-Driven Visual Turing Test and Radiomics Pattern Analysis, showing that our synthetic tumors is realistic and diverse in texture, heterogeneity, boundaries, and pathology.


Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

December 2024

·

1 Read

Diffusion models, and their generalization, flow matching, have had a remarkable impact on the field of media generation. Here, the conventional approach is to learn the complex mapping from a simple source distribution of Gaussian noise to the target media distribution. For cross-modal tasks such as text-to-image generation, this same mapping from noise to image is learnt whilst including a conditioning mechanism in the model. One key and thus far relatively unexplored feature of flow matching is that, unlike Diffusion models, they are not constrained for the source distribution to be noise. Hence, in this paper, we propose a paradigm shift, and ask the question of whether we can instead train flow matching models to learn a direct mapping from the distribution of one modality to the distribution of another, thus obviating the need for both the noise distribution and conditioning mechanism. We present a general and simple framework, CrossFlow, for cross-modal flow matching. We show the importance of applying Variational Encoders to the input data, and introduce a method to enable Classifier-free guidance. Surprisingly, for text-to-image, CrossFlow with a vanilla transformer without cross attention slightly outperforms standard flow matching, and we show that it scales better with training steps and model size, while also allowing for interesting latent arithmetic which results in semantically meaningful edits in the output space. To demonstrate the generalizability of our approach, we also show that CrossFlow is on par with or outperforms the state-of-the-art for various cross-modal / intra-modal mapping tasks, viz. image captioning, depth estimation, and image super-resolution. We hope this paper contributes to accelerating progress in cross-modal media generation.


FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching

December 2024

·

3 Reads

Autoregressive (AR) modeling has achieved remarkable success in natural language processing by enabling models to generate text with coherence and contextual understanding through next token prediction. Recently, in image generation, VAR proposes scale-wise autoregressive modeling, which extends the next token prediction to the next scale prediction, preserving the 2D structure of images. However, VAR encounters two primary challenges: (1) its complex and rigid scale design limits generalization in next scale prediction, and (2) the generator's dependence on a discrete tokenizer with the same complex scale structure restricts modularity and flexibility in updating the tokenizer. To address these limitations, we introduce FlowAR, a general next scale prediction method featuring a streamlined scale design, where each subsequent scale is simply double the previous one. This eliminates the need for VAR's intricate multi-scale residual tokenizer and enables the use of any off-the-shelf Variational AutoEncoder (VAE). Our simplified design enhances generalization in next scale prediction and facilitates the integration of Flow Matching for high-quality image synthesis. We validate the effectiveness of FlowAR on the challenging ImageNet-256 benchmark, demonstrating superior generation performance compared to previous methods. Codes will be available at \url{https://github.com/OliverRensu/FlowAR}.


Figure 6 | Three exploration modes -interactive, GPT-assisted, and goal-driven -each defined by distinct exploration instructions.
Figure 7 | GenEx-driven imaginative exploration can gather observations that are just as informed as those obtained through physical exploration. Algorithm 2 Imagination-Augmented Policy Require: • Initial observation í µí±– 0 and world initialization description í µí±™ 0 • A goal í µí±” to answer embodied questions. E.g, "Danger ahead-stop or go ahead?" • A navigation instruction I. E.g, "Navigate to the unseen parts of the environment." • GenEx í µí±(x 0:í µí±‡ |í µí±– 0 , í µí±™ 0 , I) defined in § 2.1 and Algorithm 1. • An embodied policy í µí¼‹ í µí¼ƒ 3 ( í µí°´|í µí±œ, í µí±”) conditioned on observation variable í µí±œ and goal í µí±”. 1: Gather imagined observations with GenEx: x 0:í µí±‡ ∼ í µí±(x 0:í µí±‡ | í µí±– 0 , í µí±™ 0 , I) 2: Select an action with imagined observations to maximize the policy: í µí°´=µí°´= arg max í µí°´í µí°´í µí¼‹ í µí¼ƒ ( í µí°´|µí°´| í µí±– 0 , x 0:í µí±‡ , í µí±”)
Figure 8 | Single agent reasoning with imagination and multi-agent reasoning and planning with imagination. (a) The single agent can imagine previously unobserved views to better understand the environment. (b) In the multi-agent scenario, the agent infers the perspective of others to make decisions based on a more complete understanding of the situation. Input and generated images are panoramic; cubes are extracted for visualization.
Figure 12 | Active 3D mapping from a single image.
GenEx: Generating an Explorable World

December 2024

·

15 Reads

Understanding, navigating, and exploring the 3D physical real world has long been a central challenge in the development of artificial intelligence. In this work, we take a step toward this goal by introducing GenEx, a system capable of planning complex embodied world exploration, guided by its generative imagination that forms priors (expectations) about the surrounding environments. GenEx generates an entire 3D-consistent imaginative environment from as little as a single RGB image, bringing it to life through panoramic video streams. Leveraging scalable 3D world data curated from Unreal Engine, our generative model is rounded in the physical world. It captures a continuous 360-degree environment with little effort, offering a boundless landscape for AI agents to explore and interact with. GenEx achieves high-quality world generation, robust loop consistency over long trajectories, and demonstrates strong 3D capabilities such as consistency and active 3D mapping. Powered by generative imagination of the world, GPT-assisted agents are equipped to perform complex embodied tasks, including both goal-agnostic exploration and goal-driven navigation. These agents utilize predictive expectation regarding unseen parts of the physical world to refine their beliefs, simulate different outcomes based on potential decisions, and make more informed choices. In summary, we demonstrate that GenEx provides a transformative platform for advancing embodied AI in imaginative spaces and brings potential for extending these capabilities to real-world exploration.


Figure 3. Challenges of 3D spatial reasoning questions in our 3DSRBench. See Sec. 3.2. (a) Height questions requires 3D spatial reasoning over a combination of camera extrinsics and object 3D locations. Notice how different camera pitch rotations play a crucial role to determine the final answer. (b) Previous 2D spatial reasoning questions can be addressed by analyzing objects' 2D locations and depths, while our orientation questions require complex 3D spatial reasoning on objects' 3D orientations and 3D locations.
Figure 6. Distribution of closed-form answers in our 3DSRBench.
Figure 7. Two example questions for each of the 12 question types (part I): height and location questions.
Figure 8. Two example questions for each of the 12 question types (part II): orientation questions.
Figure 9. Two example questions for each of the 12 question types (part III): multi-object reasoning questions.
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

December 2024

·

6 Reads

3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of areas, such as autonomous navigation, robotics, and AR/VR. While large multi-modal models (LMMs) have achieved remarkable progress in a wide range of image and video understanding tasks, their capabilities to perform 3D spatial reasoning on diverse natural images are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 2,772 manually annotated visual question-answer pairs across 12 question types. We conduct robust and thorough evaluation of 3D spatial reasoning capabilities by balancing the data distribution and adopting a novel FlipEval strategy. To further study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench includes two subsets with 3D spatial reasoning questions on paired images with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, uncovering their limitations in various aspects of 3D awareness, such as height, orientation, location, and multi-object reasoning, as well as their degraded performance on images with uncommon camera viewpoints. Our 3DSRBench provide valuable findings and insights about the future development of LMMs with strong 3D reasoning capabilities. Our project page and dataset is available https://3dsrbench.github.io.


Citations (42)


... They require intrusion into the training code and additional computational cost due to retraining, which involves large labeled datasets and extensive training time. In the medical imaging domain, this is particularly challenging due to data privacy concerns [5,13,35] and the substantial size of medical datasets [19,21,36]. ...

Reference:

Post-Training Quantization for 3D Medical Image Segmentation: A Practical Study on Real Inference Engines
MedShapeNet - a large-scale dataset of 3D medical shapes for computer vision
  • Citing Article
  • December 2024

... Zero-shot tasks with diffusion models. Recently, diffusion models have also been extended to zero-shot tasks in RGB-based vision applications, such as semantic correspondence [65], segmentation [2,52], image captioning [64], and image classification [10,29]. These approaches often rely on large-scale pretrained diffusion models, such as LDMs [14,45], trained on datasets like LAION-5B [48] with billions of text-image pairs. ...

IG Captioner: Information Gain Captioners Are Strong Zero-Shot Classifiers
  • Citing Chapter
  • October 2024

... Recent large multi-modal models (LMMs) [1,3,43] have achieved significant improvements in a wide range of image and video understanding tasks, such as image captioning [2,28], visual question answering [10,20,24,31,44], visual grounding [49], decision-making [7,26,34], and action recognition [35,48]. However, recent studies have shown that even state-of-the-art LMMs exhibited limited 3D awareness [17,36] and understanding of spatial relationships [46,47], which are crucial for LMMs to develop Figure 1. ...

Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data
  • Citing Chapter
  • October 2024

... Driven by the goal of achieving rapid training speeds without compromising on accuracy in 3D language embedding fields, this paper presents Laser -an efficient end-toend framework for segmenting 3D neural radiance fields, designed to balance high performance with computational efficiency (The overall workflows of our work as shown in Figure 1 (b)). In order to obtain pixel-level features from CLIP, we draw inspiration from recent advancements in CLIP dense prediction research, such as MaskCLIP [13], CAT-Seg [14], and SCLIP [15]. These methods modify the final pooling layer within the CLIP encoder to derive dense image embeddings. ...

SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
  • Citing Chapter
  • October 2024

... Wang et al. [51] built a dataset of 100K CTs for pretraining but it is not publicly available for research. To collect large-scale 3D data for pre-training, the necessity arises to aggregate datasets from diverse sources, encompassing various hospitals across different regions and countries [29], [62]. This will lead to diverse image characteristics and inconsistent imaging quality in the dataset, introducing new challenges to pre-training. ...

Embracing Massive Medical Data
  • Citing Chapter
  • October 2024

... We hypothesize that the uniform structure of synthetic lesions stems from insufficient guidance, such as the conditions [6,8] used in conditional diffusion models [9,10], and poor modeling of the lesion's internal structure [1,5,11]. For LUS images in particular, we further hypothesize that lesion location plays a critical role in synthesizing realistic images. ...

From Pixel to Cancer: Cellular Automata in Computed Tomography
  • Citing Chapter
  • October 2024

... To generate views, these 3D Gaussians are projected onto the camera's imaging plane and rendered using point-based volume rendering [6]. Due to its compactness and rasterization speed, 3D-GS is applied to various scenarios, including 3D generation [58]- [60], autonomous driving [61], [62], scene understanding [63], and medical imaging [64]- [66]. ...

Radiative Gaussian Splatting for Efficient X-Ray Novel View Synthesis
  • Citing Chapter
  • September 2024

... Humans are good at detecting and isolating outliers (Chai et al., 2020). This is not the case when it comes to training machine learning models (Sukhbaatar et al., 2015;Wang et al., 2024a;Sabour et al., 2023). Robustly training deep learning models in the presence of outliers is an important challenge. ...

Benchmarking Robustness in Neural Radiance Fields
  • Citing Conference Paper
  • June 2024

... This challenge is one of the NTIRE 2024 Workshop 1 associated challenges on: dense and non-homogeneous dehazing [3], blind compressed image enhancement [40], shadow removal [35], efficient super resolution [32], image super resolution (×4) [13], light field image superresolution [38], stereo image super-resolution [37], HR depth from images of specular and transparent surfaces [43], bracketing image restoration and enhancement [45], portrait quality assessment [10], quality assessment for AI-generated content [27], restore any image model (RAIM) in the wild [25], RAW image superresolution [14], short-form UGC video quality assessment [24], low light enhancement [28], and RAW burst alignment and ISP challenge. ...

NTIRE 2024 Challenge on Low Light Image Enhancement: Methods and Results

... 3D Gaussian Splatting (3DGS) (Kerbl et al., 2023) has emerged as a significant advancement in 3D reconstruction, offering high-resolution real-time rendering capabilities that surpass traditional Neural Radiance Field (NeRF) methods Pumarola et al., 2021;Park et al., 2021;Barron et al., 2021;2023;Xu et al., 2022). This efficiency facilitates various downstream applications Cai et al., 2024b;Liu et al., 2024b;Qin et al., 2024a;Wu et al., 2024a;Cai et al., 2024a;Zhang et al., 2024a;Xu et al., 2024;Ren et al., 2024;Yu et al., 2024;Keetha et al., 2024;Liu et al., 2024a). For dynamic scenes, 4D-GS (Wu et al., 2024a) has been adapted to handle temporal changes efficiently, enabling real-time rendering of deformable scenes. ...

Structure-Aware Sparse-View X-Ray 3D Reconstruction
  • Citing Conference Paper
  • June 2024