Diederik P. Kingma’s research while affiliated with Google Inc. and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (34)


Figure 1: Before and after MCMC correction. In (a)(b), the left columns are x = g θ (z), the right columns are updated x after 300 steps of MCMC sampling jointly on x and z. (a) illustrates the effect of correction in ImageNet. Note that the off-manifold images are corrected. (b) illustrates the correction in the embedding space of Stable Diffusion v1.5, which are decoded to image space in (c). Note the disentanglement of the cats and sharpness of the sofa. Zoom in for better viewing.
Figure 3: (a)(b) Gradient norms and FIDs for complete noise cancellation, last-step noise cancellation and no noise cancellation. (c)(d) FIDs and Recalls of EMD with different numbers of Langevin steps.
Class-conditional genreation on ImageNet 64×64.
Class-conditional generation on ImageNet 128×128.
FID-30k for text-to-image generation in MSCOCO. † Results are evaluated by Yin et al. [23].

+2

EM Distillation for One-step Diffusion Models
  • Preprint
  • File available

May 2024

·

142 Reads

Sirui Xie

·

Zhisheng Xiao

·

Diederik P Kingma

·

[...]

·

Ruiqi Gao

While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.

Download


Figure 2: A visualization of our result from Section 3.4. The weighted loss L w equals the area under the curves in the left and right figures. The two figures illustrate two alternative methods for calculating the area. Equation 7 corresponds to the left figure, where L w is calculated as the sum of two areas: a rectangular area w(λ min )L(λ min ) equal to the weighted prior loss, plus a curved area equal to the weighted loss. This area is an integral, that can be intuitively understood as a Riemann sum over many tiny rectangles going from left (λ min ) to right (λ max ), each with height w(λ) and infinitesimal width dL(λ). Equation 10 corresponds to the right figure, where the same total area is computed as the sum of a rectangular area w(λ max )L(λ max ) and an integral going upwards from λ max to λ min , with each tiny rectangle having width L(λ) and height dw(λ). The area of each of those tiny rectangles can be understood as the ELBO at that noise level, L(λ), times the weight of the ELBO at that noise level, dw(λ).
Understanding the Diffusion Objective as a Weighted Integral of ELBOs

March 2023

·

121 Reads

·

1 Citation

Diffusion models in the literature are optimized with various objectives that are special cases of a weighted loss, where the weighting function specifies the weight per noise level. Uniform weighting corresponds to maximizing the ELBO, a principled approximation of maximum likelihood. In current practice diffusion models are optimized with non-uniform weighting due to better results in terms of sample quality. In this work we expose a direct relationship between the weighted loss (with any weighting) and the ELBO objective. We show that the weighted loss can be written as a weighted integral of ELBOs, with one ELBO per noise level. If the weighting function is monotonic, then the weighted loss is a likelihood-based objective: it maximizes the ELBO under simple data augmentation, namely Gaussian noise perturbation. Our main contribution is a deeper theoretical understanding of the diffusion objective, but we also performed some experiments comparing monotonic with non-monotonic weightings, finding that monotonic weighting performs competitively with the best published results.


On Distillation of Guided Diffusion Models

October 2022

·

63 Reads

·

1 Citation

Classifier-free guided diffusion models have recently been shown to be highly effective at high-resolution image generation, and they have been widely used in large-scale diffusion frameworks including DALL-E 2, GLIDE and Imagen. However, a downside of classifier-free guided diffusion models is that they are computationally expensive at inference time since they require evaluating two diffusion models, a class-conditional model and an unconditional model, hundreds of times. To deal with this limitation, we propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then progressively distill that model to a diffusion model that requires much fewer sampling steps. On ImageNet 64x64 and CIFAR-10, our approach is able to generate images visually comparable to that of the original model using as few as 4 sampling steps, achieving FID/IS scores comparable to that of the original model while being up to 256 times faster to sample from.


Imagen Video: High Definition Video Generation with Diffusion Models

October 2022

·

186 Reads

·

24 Citations

We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confirm and transfer findings from previous work on diffusion-based image generation to the video generation setting. Finally, we apply progressive distillation to our video models with classifier-free guidance for fast, high quality sampling. We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding. See https://imagen.research.google/video/ for samples.


Variational Diffusion Models

July 2021

·

929 Reads

·

4 Citations

Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to turn the model into a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum.



How to Train Your Energy-Based Models

January 2021

·

207 Reads

·

4 Citations

Energy-Based Models (EBMs), also known as non-normalized probabilistic models, specify probability density or mass functions up to an unknown normalizing constant. Unlike most other probabilistic models, EBMs do not place a restriction on the tractability of the normalizing constant, thus are more flexible to parameterize and can model a more expressive family of probability distributions. However, the unknown normalizing constant of EBMs makes training particularly difficult. Our goal is to provide a friendly introduction to modern approaches for EBM training. We start by explaining maximum likelihood training with Markov chain Monte Carlo (MCMC), and proceed to elaborate on MCMC-free approaches, including Score Matching (SM) and Noise Constrastive Estimation (NCE). We highlight theoretical connections among these three approaches, and end with a brief survey on alternative training methods, which are still under active research. Our tutorial is targeted at an audience with basic understanding of generative models who want to apply EBMs or start a research project in this direction.


Learning Energy-Based Models by Diffusion Recovery Likelihood

December 2020

·

55 Reads

While energy-based models (EBMs) exhibit a number of desirable properties, training and sampling on high-dimensional datasets remains challenging. Inspired by recent progress on diffusion probabilistic models, we present a diffusion recovery likelihood method to tractably learn and sample from a sequence of EBMs trained on increasingly noisy versions of a dataset. Each EBM is trained by maximizing the recovery likelihood: the conditional probability of the data at a certain noise level given their noisy versions at a higher noise level. The recovery likelihood objective is more tractable than the marginal likelihood objective, since it only requires MCMC sampling from a relatively concentrated conditional distribution. Moreover, we show that this estimation method is theoretically consistent: it learns the correct conditional and marginal distributions at each noise level, given sufficient data. After training, synthesized images can be generated efficiently by a sampling process that initializes from a spherical Gaussian distribution and progressively samples the conditional distributions at decreasingly lower noise levels. Our method generates high fidelity samples on various image datasets. On unconditional CIFAR-10 our method achieves FID 9.60 and inception score 8.58, superior to the majority of GANs. Moreover, we demonstrate that unlike previous work on EBMs, our long-run MCMC samples from the conditional distributions do not diverge and still represent realistic images, allowing us to accurately estimate the normalized density of data even for high-dimensional datasets.


Score-Based Generative Modeling through Stochastic Differential Equations

November 2020

·

633 Reads

·

56 Citations

Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the reverse-time SDE depends only on the time-dependent gradient field (a.k.a., score) of the perturbed data distribution. By leveraging advances in score-based generative modeling, we can accurately estimate these scores with neural networks, and use numerical SDE solvers to generate samples. We show that this framework encapsulates previous approaches in diffusion probabilistic modeling and score-based generative modeling, and allows for new sampling procedures. In particular, we introduce a predictor-corrector framework to correct errors in the evolution of the discretized reverse-time SDE. We also derive an equivalent neural ODE that samples from the same distribution as the SDE, which enables exact likelihood computation, and improved sampling efficiency. In addition, our framework enables conditional generation with an unconditional model, as we demonstrate with experiments on class-conditional generation, image inpainting, and colorization. Combined with multiple architectural improvements, we achieve record-breaking performance for unconditional image generation on CIFAR-10 with an Inception score of 9.89 and FID of 2.20, a competitive likelihood of 3.10 bits/dim, and demonstrate high fidelity generation of 1024×10241024 \times 1024 images for the first time from a score-based generative model.


Citations (27)


... Therefore, researchers typically use knowledge distillation to maintain sample quality, training a new (step-reduced) network to emulate the original (high-latency) DPM. For example, progressive distillation (Salimans & Ho, 2022;Meng et al., 2023) iteratively trains a student network to predict the original DPM's two-step output in a single step, repeatedly halving the number calls to the U-Net. Since then, a variety of new methods have been proposed to extend this simple approach. ...

Reference:

Fast Sampling Through The Reuse Of Attention Maps In Diffusion Models
On Distillation of Guided Diffusion Models
  • Citing Conference Paper
  • June 2023

... During training, we use a dynamic noise scheduling (D. P. Kingma & Gao, 2023), which is adapted to the approximation error of the neural network and further explained in Appendix A4 of Appendix A. On the one end, by setting λ(0) large enough, we achieve α 0 ≈ 1 and approximately recover x t+12 h from z 0 . On the other end, by setting λ(1) small enough, the signal amplitude goes toward zero, α 1 ≈ 0, and p(z 1 ) ≈ N(0, I) approximately holds (D. Kingma et al., 2021). ...

Understanding the Diffusion Objective as a Weighted Integral of ELBOs

... Within these models, utilizing a more efficient sampling approach, the score-based generative model further enhances the generative capabilities. There has been a recent surge in attention towards diffusion model and SBGM (23)(24)(25), with notable interest reflected in the works of Austin et al. (26)(27)(28). This increased attention has led to significant progress in advancing the modeling of continuous data. ...

Variational Diffusion Models

... The CVAE-VC framework offers three key advantages: (1) end-to-end waveform generation enables high-quality audio synthesis [11,12,13], (2) learning latent acoustic tokens from the VAE encoder, rather than relying on a fixed input representation (e.g., mel spectrograms), enhances efficiency and effectiveness in diverse sound generation [14], and (3) a learnable prior network allows more complex and flexible distribution compared to a fixed and simple normal distribution [4,11,15,16]. ...

Wave-Tacotron: Spectrogram-Free End-to-End Text-to-Speech Synthesis
  • Citing Conference Paper
  • June 2021

... Energy-based models define an unnormalized probability landscape via a learned energy function and typically rely on sampling techniques to estimate likelihoods [46,47]. More recently, score-based diffusion models have been introduced, which approximate distributions by progressively corrupting data with noise and learning to reverse this process through denoising dynamics [48,49]. Other approaches shift focus from the distribution itself to learning the underlying time evolution equations, such as the chemical master equation, using neural networks thereby replacing traditional solvers during inference [50][51][52]. ...

Score-Based Generative Modeling through Stochastic Differential Equations
  • Citing Preprint
  • November 2020

... The Variational Autoencoder (VAE) (Kingma (2013); Kingma and Welling (2019)) integrates the variational Bayesian theory (Rezende et al. (2014)) with an encoder-decoder architecture (Rumelhart et al. (1986); Bourlard and Kamp (1988); Rumelhart et al. (1986)). The encoding process can be conceptualized as a mapping from observed data X to latent variables Z, that is, Z = f −1 (X) (Wei et al. (2024a); Wei et al. (2024b)), or equivalently as the calculation of posterior probability P (Z|X) (Wei et al. (2024a)). ...

An Introduction to Variational Autoencoders
  • Citing Book
  • January 2019