Diffusion Primer

Diffusion, a class of generative models, boils down to an MSE between added noise and network predicted noise.
Why would learning to predict the noise help with image generation (which is what it is most used for)? How did we arrive at MSE? This post dives deep into the math to answer these questions.

Background

One way to interpret Diffusion is as a continuous VAE (Variational Auto Encoder).
A VAE computes a lower bound on the likelihood of generating real data samples () by approximating the unknown posterior with a learnable one [Fig. 1]:

Figure 1: VAE

The process of encoding and decoding leads to regularized compression of images in the space (hence the name auto encoder; variational comes from approximating distributions). During inference, random images can be generated from the decoder, based on sampling of .

Derivation

Figure 2: Markov chain of forward [] (reverse []) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Ho et al. 2020)

Similar to VAE, diffusion models compute a lower bound on the likelihood of generating real data samples () by approximating the unknown posterior with a computable one , for all [Fig. 2] (It follows the same math as VAE for all , therefore could be thought of as its continuous version):

Solving for R.H.S.:

Taking each piece apart:

: Can be ignored while training since has no learnable parameters, and is just Gaussian noise.

: Match the unknown () to the tractable quantity (). Define the reverse process as:

Simplifying to known variance:

Solving (zoom out if not fully visible):

Scratch Work

Computing the parameters ( and ) of the new Gaussian:

must match , and since is given, learn to predict the noise at step , .

Which means

The KL divergence between Gaussians with mean as the only parameter leads to the following loss:

In the DDPM paper[1], it is found empirically that the training works better if the scaling factor is omitted, i.e., the loss can be simplified to

: The image () is originally constructed from D pixels with values scaled to the range . The DDPM paper[1] suggests using an independent discrete decoder derived from the Gaussian (since doesn’t apply here):

Intuitively, the network predicts the mean of a pixel which is used to draw from a Gaussian distribution which is integrated[2] over according to the real pixel value as in the original image, from the real value minus to the real value plus . If the predicted mean is close to the original value of the pixel, the result of the integral will be high.

The integral is approximated by the Gaussian probability density function times the bin width, which makes the math workout similar to the case:

Putting the pieces together, the DDPM paper[1] suggests using this loss function for training (followed by the algorithm):

Algorithm 1 DDPM Training

We saw how we can go from noise to an image via the reverse process, but how do we go from random noise random (aggregate of training set) image to random noise image of interest? The next section talks about this.

Conditioning

Dhariwal & Nichol (2021)[3] use gradients of a classifier to condition diffusion sampling on (e.g. a target class label) where is noisy image. The score function for the joint distribution is computed as,


Thus, a new classifier-guided predictor modifies the initial predictor as follows,


To control the strength of the classifier guidance, a weight is added to the delta part,


Ho & Salimans (2021)[4] propose to run conditional diffusion without an independent classifier . Consider an unconditional denoising diffusion model parameterized through a score estimator and a conditional model parameterized through . These two models can be learned via a single neural network. Specifically, a conditional diffusion model is trained on paired data , where the conditioning information gets discarded periodically at random such that the model knows how to generate images unconditionally as well, i.e. .

The gradient of an implicit classifier is proportional to . Taking log on both sides,


Below is a comparison of sampling algorithms across DDPM, DDIM and DDIM with classifier guidance.

Algorithm 2 DDPM

Algorithm 3 DDIM

Algorithm 4 DDIM with classifier guidance

Optimization

So far we discussed training and sampling algorithms, but what is a good value for T? Can we make the sampling process efficient and reduce the number of steps somehow? We explore this part here.

Langevin dynamics can generate samples from a probability density using only the gradients in a Markov chain of updates:



where is the step size. When equals the true probability density . Compared to SGD, Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.

Score-based generative modeling trains a score network to estimate the above gradient i.e. . To make the data points cover the whole space, the score network jointly estimates the scores of data perturbed at different noise levels (Noise Conditional Score Network) i.e. . The schedule of increasing noise levels resembles the forward diffusion process.

Given a Gaussian distribution , the derivative of the logarithm of its density function is equal to where .

Recall that . Therefore,

Figure 3: Non Markov chain of forward [] (reverse []) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Song et al. 2022)

To make diffusion more efficient, DDIM [Fig. 3] propose fewer sampling steps by modeling the forward process as a non Markovian chain: (DDPM omits the conditioning). Similar to [3], can be written as


If is 0, the sampling process is deterministic and because is predicted (which is what we finally want + we can replace with ) each time, some steps [T..1] can be skipped.

Conclusion

We answered all the questions we asked—loss formulation, (required) image generation, sampling efficiency and background connecting to VAE. There are (many) additional papers that have come out since the initial set, but this post focusses on the fundamentals and therefore ends here.

REFERENCES

  1. ^
  2. ^

    The integration of Gaussian distribution over the given range is necessary because is discrete. The alternate is to treat as continuous and compute the value of the Gaussian distribution at . Both approaches lead to an MSE though, since the integral is approximated in DDPM.

  3. ^
  4. ^
No comments.