Xun Huang’s research while affiliated with NVIDIA and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (24)


Effect of varying the S 3 dataset size.
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
  • Preprint
  • File available

July 2024

·

12 Reads

Yu Zeng

·

Vishal M. Patel

·

Haochen Wang

·

[...]

·

Yogesh Balaji

Personalized text-to-image generation models enable users to create images that depict their individual possessions in diverse scenes, finding applications in various domains. To achieve the personalization capability, existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset, which can be non-trivial for general users, resource-intensive, and time-consuming. Despite attempts to develop finetuning-free methods, their generation quality is much lower compared to their finetuning counterparts. In this paper, we propose Joint-Image Diffusion (\jedi), an effective technique for learning a finetuning-free personalization model. Our key idea is to learn the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning, we propose a scalable synthetic dataset generation technique. Once trained, our model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. Our approach does not require any expensive optimization process or additional modules and can faithfully preserve the identity represented by any number of reference images. Experimental results show that our model achieves state-of-the-art generation quality, both quantitatively and qualitatively, significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines.

Download




DiffCollage: Parallel Generation of Large Content with Diffusion Models

March 2023

·

234 Reads

We present DiffCollage, a compositional diffusion model that can generate large content by leveraging diffusion models trained on generating pieces of the large content. Our approach is based on a factor graph representation where each factor node represents a portion of the content and a variable node represents their overlap. This representation allows us to aggregate intermediate outputs from diffusion models defined on individual nodes to generate content of arbitrary size and shape in parallel without resorting to an autoregressive generation procedure. We apply DiffCollage to various tasks, including infinite image generation, panorama image generation, and long-duration text-guided motion generation. Extensive experimental results with a comparison to strong autoregressive baselines verify the effectiveness of our approach.


Magic3D: High-Resolution Text-to-3D Content Creation

November 2022

·

463 Reads

·

12 Citations

DreamFusion has recently demonstrated the utility of a pre-trained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF), achieving remarkable text-to-3D synthesis results. However, the method has two inherent limitations: (a) extremely slow optimization of NeRF and (b) low-resolution image space supervision on NeRF, leading to low-quality 3D models with a long processing time. In this paper, we address these limitations by utilizing a two-stage optimization framework. First, we obtain a coarse model using a low-resolution diffusion prior and accelerate with a sparse 3D hash grid structure. Using the coarse representation as the initialization, we further optimize a textured 3D mesh model with an efficient differentiable renderer interacting with a high-resolution latent diffusion model. Our method, dubbed Magic3D, can create high quality 3D mesh models in 40 minutes, which is 2x faster than DreamFusion (reportedly taking 1.5 hours on average), while also achieving higher resolution. User studies show 61.7% raters to prefer our approach over DreamFusion. Together with the image-conditioned generation capabilities, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications.


eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

November 2022

·

897 Reads

·

13 Citations

Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiffi's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiffi/


Multimodal Conditional Image Synthesis with Product-of-Experts GANs

October 2022

·

28 Reads

·

76 Citations

Lecture Notes in Computer Science

Existing conditional image synthesis frameworks generate images based on user inputs in a single modality, such as text, segmentation, or sketch. They do not allow users to simultaneously use inputs in multiple modalities to control the image synthesis output. This reduces their practicality as multimodal inputs are more expressive and complement each other. To address this limitation, we propose the Product-of-Experts Generative Adversarial Networks (PoE-GAN) framework, which can synthesize images conditioned on multiple input modalities or any subset of them, even the empty set. We achieve this capability with a single trained model. PoE-GAN consists of a product-of-experts generator and a multimodal multiscale projection discriminator. Through our carefully designed training scheme, PoE-GAN learns to synthesize images with high quality and diversity. Besides advancing the state of the art in multimodal conditional image synthesis, PoE-GAN also outperforms the best existing unimodal conditional image synthesis approaches when tested in the unimodal setting. The project website is available at this link.KeywordsImage synthesisMultimodal learningGAN


Multimodal Conditional Image Synthesis with Product-of-Experts GANs

December 2021

·

31 Reads

·

1 Citation

Existing conditional image synthesis frameworks generate images based on user inputs in a single modality, such as text, segmentation, sketch, or style reference. They are often unable to leverage multimodal user inputs when available, which reduces their practicality. To address this limitation, we propose the Product-of-Experts Generative Adversarial Networks (PoE-GAN) framework, which can synthesize images conditioned on multiple input modalities or any subset of them, even the empty set. PoE-GAN consists of a product-of-experts generator and a multimodal multiscale projection discriminator. Through our carefully designed training scheme, PoE-GAN learns to synthesize images with high quality and diversity. Besides advancing the state of the art in multimodal conditional image synthesis, PoE-GAN also outperforms the best existing unimodal conditional image synthesis approaches when tested in the unimodal setting. The project website is available at https://deepimagination.github.io/PoE-GAN .


Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications

February 2021

·

486 Reads

·

194 Citations

Proceedings of the IEEE

The generative adversarial network (GAN) framework has emerged as a powerful tool for various image and video synthesis tasks, allowing the synthesis of visual content in an unconditional or input-conditional manner. It has enabled the generation of high-resolution photorealistic images and videos, a task that was challenging or impossible with prior methods. It has also led to the creation of many new applications in content creation. In this article, we provide an overview of GANs with a special focus on algorithms and applications for visual synthesis. We cover several important techniques to stabilize GAN training, which has a reputation for being notoriously difficult. We also discuss its applications to image translation, image processing, video synthesis, and neural rendering.


Citations (20)


... 2 Related Works 2.1 2D Generation and Stylization 2D generation has rapidly advanced across generative modeling, customization, conditional control, editing, and stylization. Initial breakthroughs in 2D synthesis with VAEs and GANs [2,20,28] were furthered by diffusion models [35,55,78,83], enhancing image quality and diversity for complex manipulation. For efficiency, frequency-based fine-tuning and wavelet VAEs have enabled lightweight models [18,57]. ...

Reference:

StyleMe3D: Stylization with Disentangled Priors by Multiple Encoders on 3D Gaussians
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
  • Citing Conference Paper
  • June 2024

... Arbitrarily long motion. Some works propose motion diffusion models that can generalize to motions longer than training instances [5,39,46,64]. For example, Dou-bleTake [46], STMC [39], and DiffCollage [64] propose generating multiple motion segments, each with a temporal length within the training distribution, and then applying a special sampling mechanism to smoothly combine them into a longer motion. ...

DiffCollage: Parallel Generation of Large Content with Diffusion Models
  • Citing Conference Paper
  • June 2023

... Building upon advancements in 2D diffusion models, DreamFusion [43] introduced score distillation sampling (SDS) to train 3D representation models like NeRF [41] and 3DGS [22] based on text input. Subsequently, numerous methods have been developed to enhance this approach [1,4,8,20,28,29,29,31,38,40,45,50,53,60,62,78,85]. However, a significant limitation of these methods is the need to train a separate 3D model for each text input, which can take tens of minutes or even hours per text. ...

Magic3D: High-Resolution Text-to-3D Content Creation
  • Citing Conference Paper
  • June 2023

... This paper proposes a transformative approach that harnesses the power of edge computing and Machine Learning (ML) to expedite these practices through real-time 3D modeling and integration with a live commerce framework. We present an advanced system that leverages NVIDIA's Magic3D technology and Gaussian Splatting techniques to generate accurate, highresolution 3D models of laboratory settings, enabling rapid virtual evaluations and modifications (Chen, Wang, & Liu, 2023;Kerbl et al., 2023) (Lin et al., 2023. Furthermore, we integrate this with an AI-driven live commerce platform, creating a seamless transition from virtual modeling to real-world procurement. ...

Magic3D: High-Resolution Text-to-3D Content Creation

... We consider the underlying GenAI and VLFMs as the engineering primitives to drive the novel interaction experience, where the feedback providers could efficiently create companion reference images for the feedback comments, while focusing on text typing. The enhancements of the inference quality of recent text-to-image SOTA models could help generalizing the practical applicability of MemoVis through potentially more photorealistic synthesized images [17], simpler and more intuitive prompts [42], as well as a reduced inference latency [105]. For example, SOTA pipelines, such as Promptist [42] for optimizing text-to-image GenAI prompts, could potentially reduce the failures when feedback providers write low-quality prompts. ...

eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

... The first limitation is the difficulty in maintaining visual contents in unedited background regions. Existing multimodal facial editing techniques [6], [7], [8], [9] can only edit the facial image as a whole and are prone to introduce unwanted changes to unedited background regions. When users are not satisfied with some local effects, these techniques will fail to edit the local regions in an incremental manner. ...

Multimodal Conditional Image Synthesis with Product-of-Experts GANs
  • Citing Chapter
  • October 2022

Lecture Notes in Computer Science

... For scene composition, it involves scene generation. This makes us to recall different Generative Adversarial Networks (GANs) [1] methods, including iGAN [2], GANBrush [3], and PoE-GAN [4]. However, the above methods make uses of the image prior which limit the user inputs into certain types. ...

Multimodal Conditional Image Synthesis with Product-of-Experts GANs
  • Citing Preprint
  • December 2021

... However, a significant portion of these works are confined to 2D images. 20 Moreover, segmenting brain MR with conditional GAN generators allows training the Cycle GAN model to accurately recognize geometric violations in growing MR and generates more training data for BT segmentation. 21 The main contribution of the paper is summarized as, ...

Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications
  • Citing Article
  • February 2021

Proceedings of the IEEE

... Generative adversarial network based enhancements bolster compression, offering improved compression and decompression. They also have been demonstrated to perform superior than standard video coding approaches in low bit-rate range [3,4]. Some of the other prominent works in the field of deep learning-based video compression is listed in Table 1. ...

Generative Adversarial Networks for Image and Video Synthesis: Algorithms and Applications
  • Citing Preprint
  • August 2020

... Image-to-image (I2I) translation tasks are a key component of many image processing, computer graphics, and computer vision problems, as well as other similar problems. Some I2I methods that have been proposed are given in, for example, [1][2][3][4] for semantic image synthesis, [5][6][7][8] for image-to-image translation, and [9,10] for image super-resolution. They consist of constructing the mapping that translates images from one (source) domain to another (target) domain (or many of these), thus preserving the content of the image while the style of the image belonging to the first domain is changed to that of the second domain. ...

Few-Shot Unsupervised Image-to-Image Translation
  • Citing Conference Paper
  • October 2019