Yan-Pei Cao’s research while affiliated with Tsinghua University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (29)


MV-Adapter: Multi-view Consistent Image Generation Made Easy
  • Preprint
  • File available

December 2024

·

12 Reads

Zehuan Huang

·

·

Haoran Wang

·

[...]

·

Lu Sheng

Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require full fine-tuning, leading to (1) high computational costs, especially with large base models and high-resolution images, and (2) degradation in image quality due to optimization difficulties and scarce high-quality 3D data. In this paper, we propose the first adapter-based solution for multi-view image generation, and introduce MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without altering the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in pre-trained models, mitigating overfitting risks. To efficiently model the 3D geometric knowledge within the adapter, we introduce innovative designs that include duplicated self-attention layers and parallel attention architecture, enabling the adapter to inherit the powerful priors of the pre-trained models to model the novel 3D knowledge. Moreover, we present a unified condition encoder that seamlessly integrates camera parameters and geometric information, facilitating applications such as text- and image-based 3D generation and texturing. MV-Adapter achieves multi-view generation at 768 resolution on Stable Diffusion XL (SDXL), and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling broader applications. We demonstrate that MV-Adapter sets a new quality standard for multi-view image generation, and opens up new possibilities due to its efficiency, adaptability and versatility.

Download

Figure 5. Qualitative comparisons on synthetic datasets, including 3D-Front [15] and BlendSwap [1].
Figure 9. Detailed comparison between existing compositional generation methods and our multi-instance diffusion.
Ablation studies. We evaluate the number of multi- instance attention layers (#K), the inclusion of global scene im- age (S.) input, and the use of Objaverse [9] (O.) for mixed training. #K S. O. CD-S↓ F-Score-S↑ CD-O↓ F-Score-O↑ IoU-B↑
MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

December 2024

·

23 Reads

This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.


OctFusion: Octree-based Diffusion Models for 3D Shape Generation

August 2024

·

20 Reads

Diffusion models have emerged as a popular method for 3D generation. However, it is still challenging for diffusion models to efficiently generate diverse and high-quality 3D shapes. In this paper, we introduce OctFusion, which can generate 3D shapes with arbitrary resolutions in 2.5 seconds on a single Nvidia 4090 GPU, and the extracted meshes are guaranteed to be continuous and manifold. The key components of OctFusion are the octree-based latent representation and the accompanying diffusion models. The representation combines the benefits of both implicit neural representations and explicit spatial octrees and is learned with an octree-based variational autoencoder. The proposed diffusion model is a unified multi-scale U-Net that enables weights and computation sharing across different octree levels and avoids the complexity of widely used cascaded diffusion schemes. We verify the effectiveness of OctFusion on the ShapeNet and Objaverse datasets and achieve state-of-the-art performances on shape generation tasks. We demonstrate that OctFusion is extendable and flexible by generating high-quality color fields for textured mesh generation and high-quality 3D shapes conditioned on text prompts, sketches, or category labels. Our code and pre-trained models are available at \url{https://github.com/octree-nn/octfusion}.





BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis

December 2023

·

13 Reads

·

10 Citations

ACM Transactions on Graphics

Synthesizing photorealistic 4D human head avatars from videos is essential for VR/AR, telepresence, and video game applications. Although existing Neural Radiance Fields (NeRF)-based methods achieve high-fidelity results, the computational expense limits their use in real-time applications. To overcome this limitation, we introduce BakedAvatar , a novel representation for real-time neural head avatar synthesis, deployable in a standard polygon rasterization pipeline. Our approach extracts deformable multi-layer meshes from learned isosurfaces of the head and computes expression-, pose-, and view-dependent appearances that can be baked into static textures for efficient rasterization. We thus propose a three-stage pipeline for neural head avatar synthesis, which includes learning continuous deformation, manifold, and radiance fields, extracting layered meshes and textures, and fine-tuning texture details with differential rasterization. Experimental results demonstrate that our representation generates synthesis results of comparable quality to other state-of-the-art methods while significantly reducing the inference time required. We further showcase various head avatar synthesis results from monocular videos, including view synthesis, face reenactment, expression editing, and pose editing, all at interactive frame rates on commodity devices. Source codes and demos are available on our project page.





Citations (18)


... Videop2p [198] 2023 Diffusion Model(U-net) Dreamix [199] 2023 Diffusion Model(U-net) DynVideo [200] 2023 Diffusion Model(U-net) Anyv2v [201] 2023 Diffusion Model(U-net) MagicCrop [202] 2023 Diffusion Model(U-net) ControlAVideo [203] 2023 Diffusion Model(U-net) CCedit [204] 2024 ...

Reference:

Artificial Intelligence for Biomedical Video Generation
DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing
  • Citing Conference Paper
  • June 2024

... Szymanowicz et al. [40] proposed Splatter Image, which uses a 2D CNN to generate pseudo-images with colored 3D Gaussians per pixel, achieving efficient single-view 3D reconstruction with state-of-the-art performance, but requires multi-view supervision for training. Zou et al. [41] introduced a hybrid Triplane-Gaussian representation, combining explicit and implicit representations to enable fast and high-quality single-view 3D reconstruction. However, generating a complete point cloud of the object from a single image is a challenging task that requires the 3D ground truth to supervise the generation of Gaussian kernels. ...

Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers
  • Citing Conference Paper
  • June 2024

... To achieve practical results, we introduce diverse degradation augmentations [69] to the input multi-view images, simulating the distribution of coarse 3D data. In addition, we incorporate efficient multi-view row attention [28,36] to ensure consistency across multi-view features. To further reinforce coherent 3D textures and structures under significant viewpoint changes, we also introduce near-view epipolor aggregation modules, which directly propagate corresponding tokens across near views using epipolarconstrained feature matching [11,18]. ...

EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
  • Citing Conference Paper
  • June 2024

... Each representation offers distinct advantages and disadvantages concerning the FoVs and distortion levels. 2) Various spatial transformations: They occur when 360 images are not captured vertically, which makes the appearance of 360 images vary greatly [21]. For instance, in the VR environment, users tend to transform 360 images by changing their viewing directions and zooming in on objects of interest [22]. ...

OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution
  • Citing Conference Paper
  • October 2023

... PIFu [58] first introduces the implicit function for modeling, which is widely adopted by many subsequent methods [7,20,25,59,72,73,82]. With the rapid development of neural radiance fields, many methods [40,53,70] have adopted NeRF [48] to represent the human body. ...

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video
  • Citing Conference Paper
  • October 2023

... . Given a raw audio clip A, we first obtain F m and F f via Eq.( 1). The speaker's lip shape and expression are generally related to the phonemes, rhythm, and other information in the audio (Wu et al., 2023). For the generation of facial expressions for the speaker, we follow (Yi et al., 2023) and model it as a regression task. ...

Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video
  • Citing Conference Paper
  • October 2023

... Multiple recent works have leveraged 3D Morphable Face Models (3DMM) [Blanz and Vetter 1999] as strong priors. This has led to the development of avatars that can be driven by 3DMM expression coefficients [Buehler et al. 2021;Duan et al. 2023;Niemeyer et al. 2022;Zheng et al. 2022Zheng et al. , 2023. Such avatars are typically trained on a sequence of monocular video frames showing various head poses and facial expressions. ...

BakedAvatar: Baking Neural Fields for Real-Time Head Avatar Synthesis
  • Citing Article
  • December 2023

ACM Transactions on Graphics

... In these tasks, it is critical to have a good understanding of the spatial structure of an object. On the other hand, point cloud completion aims to estimate the complete shape of objects from partial observations [28,34,36,38], which pays more attention to the geometric details. Manipulation Tasks. ...

Snowflake Point Deconvolution for Point Cloud Completion and Generation With Skip-Transformer
  • Citing Article
  • October 2022

IEEE Transactions on Pattern Analysis and Machine Intelligence

... SelfRecon [2] combines explicit SMPL+D [48] and implicit IDR [77] to obtain coherent geometry. DoubleField [60] combines Neural Surface Field [45] and NeRF at the feature level in an implicit manner. Some methods [55,61,69,81] use a hybrid representation that binds 3DGS to a single mesh achieving deformation of the avatar. ...

DoubleField: Bridging the Neural Surface and Radiance Fields for High-fidelity Human Reconstruction and Rendering
  • Citing Conference Paper
  • June 2022

... Several studies have explored augmenting NeRF with temporal positional encoding to render dynamic scenes at various time points [33,39]. Other works have focused on integrating temporal information with voxel representations to significantly reduce training time [5,19]. Additionally, some research has adopted k-plane representations to optimize temporal and spatial dimensions [7,10,2,37]. ...

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes