Diffusion Primer

Diffusion, a class of generative models, boils down to an MSE between added noise and network predicted noise.
Why would learning to predict the noise help with image generation (which is what it is most used for)? How did we arrive at MSE? This post dives deep into the math to answer these questions.

Background

One way to interpret Diffusion is as a continuous VAE (Variational Auto Encoder).
A VAE computes a lower bound on the likelihood of generating real data samples ( $log p_{θ} (x)$ ) by approximating the unknown posterior $p_{θ} (z | x)$ with a learnable one $q_{ϕ} (z | x)$ [Fig. 1]:

$\begin{matrix} - log p_{θ} (x) & \leq - log p_{θ} (x) \leq - log p_{θ} (x) + D_{K L} (q_{ϕ} (z | x) ∥ p_{θ} (z | x)) & [KL is always positive] \leq - log p_{θ} (x) + \int q_{ϕ} (z | x) log (\frac{q_{ϕ} (z | x)}{p_{θ} (z | x)}) d z & [Definition of KL] \leq - log p_{θ} (x) + \int q_{ϕ} (z | x) log (\frac{q_{ϕ} (z | x) p_{θ} (x)}{p_{θ} (z, x)}) d z & [conditional to joint] \leq - log p_{θ} (x) + \int q_{ϕ} (z | x) (log p_{θ} (x) + log (\frac{q_{ϕ} (z | x)}{p_{θ} (z, x)})) d z \leq - log p_{θ} (x) + log p_{θ} (x) + \int q_{ϕ} (z | x) (log (\frac{q_{ϕ} (z | x)}{p_{θ} (x | z) p_{θ} (z)})) d z & [p_{θ} independent of z; joint to conditional] = - E_{z_{\sim} q_{ϕ} (z | x)} [log (\frac{q_{ϕ} (z | x)}{p_{θ} (z)}) - log p_{θ} (x | z)] & [Definition of E for continuous variable z] = - E_{z_{\sim} q_{ϕ} (z | x)} log p_{θ} (x | z) + D_{K L} (q_{ϕ} (z | x) ∥ p_{θ} (z)) & [Tractable] \end{matrix}$

The process of encoding and decoding leads to regularized compression of images in the $z$ space (hence the name auto encoder; variational comes from approximating distributions). During inference, random images can be generated from the decoder, based on sampling of $z$ .

Derivation

Figure 2: Markov chain of forward [ $q$ ] (reverse [ $p_{θ}$ ]) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Ho et al. 2020)

Similar to VAE, diffusion models compute a lower bound on the likelihood of generating real data samples ( $log p_{θ} (x_{0})$ ) by approximating the unknown posterior $p_{θ} (x_{t - 1} | x_{t})$ with a computable one $q_{ϕ} (x_{t - 1} | x_{t}, x_{0})$ , for all $t$ [Fig. 2] (It follows the same math as VAE for all $t$ , therefore could be thought of as its continuous version):

$\begin{matrix} - log p_{θ} (x_{0}) & \leq - log p_{θ} (x_{0}) \leq - log p_{θ} (x_{0}) + D_{K L} (q (x_{1 : T} | x_{0}) ∥ p_{θ} (x_{1 : T} | x_{0})) & [KL is always positive] = - log p_{θ} (x_{0}) + E_{x_{1 : T} \sim q (x_{1 : T} | x_{0})} [log (\frac{q (x_{1 : T} | x_{0})}{p_{θ} (x_{1 : T}) | x_{0}})] = - log p_{θ} (x_{0}) + E_{x_{1 : T} \sim q (x_{1 : T} | x_{0})} ⎡ ⎢ ⎢ ⎣ log ⎛ ⎜ ⎜ ⎝ \frac{q (x_{1 : T} | x_{0})}{(\frac{p_{θ} (x_{0} | x_{1 : T}) p_{θ} (x_{1 : T})}{p_{θ} (x_{0})})} ⎞ ⎟ ⎟ ⎠ ⎤ ⎥ ⎥ ⎦ & [Bayes' rule] = - log p_{θ} (x_{0}) + E_{x_{1 : T} \sim q (x_{1 : T} | x_{0})} ⎡ ⎢ ⎢ ⎣ log ⎛ ⎜ ⎜ ⎝ \frac{q (x_{1 : T} | x_{0})}{(\frac{p_{θ} (x_{0}, x_{1 : T})}{p_{θ} (x_{0})})} ⎞ ⎟ ⎟ ⎠ ⎤ ⎥ ⎥ ⎦ & [conditional to joint] = - log p_{θ} (x_{0}) + E_{x_{1 : T} \sim q (x_{1 : T} | x_{0})} ⎡ ⎢ ⎢ ⎣ log ⎛ ⎜ ⎜ ⎝ \frac{q (x_{1 : T} | x_{0})}{(\frac{p_{θ} (x_{0 : T})}{p_{θ} (x_{0})})} ⎞ ⎟ ⎟ ⎠ ⎤ ⎥ ⎥ ⎦ & [merge joint] = - log p_{θ} (x_{0}) + E_{x_{1 : T} \sim q (x_{1 : T} | x_{0})} [log (\frac{q (x_{1 : T} | x_{0})}{p_{θ} (x_{0 : T})})] + log (p_{θ} (x_{0})) E_{q} [- log (p_{θ} (x_{0}))] & \leq E_{q} [log (\frac{q (x_{1 : T} | x_{0})}{p_{θ} (x_{0 : T})})] \end{matrix}$

Solving for R.H.S.:

Taking each piece apart:

$L_{T}$ : Can be ignored while training since $q$ has no learnable parameters, and $x_{T}$ is just Gaussian noise.

$L_{t - 1}$ : Match the unknown ( $p_{θ}$ ) to the tractable quantity ( $q$ ). Define the reverse process as:

$p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), Σ_{θ} (x_{t}, t))$

Simplifying to known variance:

$p_{θ} (x_{t - 1} | x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ_{t}^{2} I)$

Solving $q$ (zoom out if not fully visible):

	Scratch Work
$\begin{matrix} q (x_{t - 1} \| x_{t}, x_{0}) & = N (x_{t}; {~ μ}_{t} (x_{t}, x_{0}), {~ β}_{t} I) q (x_{t - 1} \| x_{t}, x_{0}) & = N (x_{t}; \sqrt{α_{t}} x_{t - 1}, β_{t} I) (\frac{N (x_{t - 1}; \sqrt{{¯ α}_{t - 1}} x_{0}, 1 - {¯ α}_{t - 1} I)}{N (x_{t}; \sqrt{{¯ α}_{t}} x_{0}, 1 - {¯ α}_{t} I)}) \end{matrix}$	$\begin{matrix} q (x_{t - 1} \| x_{t}, x_{0}) & = q (x_{t} \| x_{t - 1}, x_{0}) (\frac{q (x_{t - 1} \| x_{0})}{q (x_{t} \| x_{0})}) & [Bayes' rule] q (x_{1 : T} \| x_{0}) & = T \prod t = 1 q (x_{t} \| x_{t - 1}) & [Gaussian noising] β_{t} & = 1 - α_{t} & [Variance schedule] ∴ q (x_{t} \| x_{t - 1}) & = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I) x_{t} & = \sqrt{α_{t}} x_{t - 1} + \sqrt{1 - α_{t}} ϵ_{t - 1} & [Reparameterization trick] = \sqrt{α_{t} α_{t - 1}} x_{t - 2} + \sqrt{1 - α_{t} α_{t - 1}} {¯ ϵ}_{t - 2} & [{¯ ϵ}_{t - 2} merges two Gaussians] = \dots = \sqrt{{¯ α}_{t}} x_{0} + \sqrt{1 - {¯ α}_{t}} ϵ & [{¯ α}_{t} = t \prod i = 1 α_{i}; ϵ \sim N (0, I)] ∴ q (x_{n} \| x_{0}) & = N (x_{n}; \sqrt{{¯ α}_{n}} x_{0}, (1 - {¯ α}_{n}) I) Also, x_{0} & = \frac{1}{\sqrt{{¯ α}_{t}}} (x_{t} - \sqrt{1 - {¯ α}_{t}} ϵ) & [1] \end{matrix}$
$\begin{matrix} q (x_{t - 1} \| x_{t}, x_{0}) & \propto exp ⎛ ⎝ - \frac{1}{2} ⎡ ⎣ \frac{{(x_{t} - \sqrt{α_{t}} x_{t - 1})}_{t}^{2}}{β_{t}} + \frac{{(x_{t - 1} - \sqrt{{¯ α}_{t - 1}} x_{0})}_{t - 1}^{2}}{1 - {¯ α}_{t - 1}} - \frac{{(x_{t} - \sqrt{{¯ α}_{t}} x_{0})}_{t}^{2}}{1 - {¯ α}_{t}} ⎤ ⎦ ⎞ ⎠ \propto exp ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ - \frac{1}{2} ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ (\frac{α_{t}}{β_{t}} + \frac{1}{1 - {¯ α}_{t - 1}})      1 / σ^{2} x_{t - 1}^{2} - (\frac{2 \sqrt{α_{t}}}{β_{t}} x_{t} + \frac{2 \sqrt{{¯ α}_{t - 1}}}{1 - {¯ α}_{t - 1}} x_{0})      2 μ / σ^{2} x_{t - 1} + f (x_{t}, x_{0})      μ^{2} / σ^{2} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ \end{matrix}$	$\begin{matrix} N (x) & \propto exp (- \frac{1}{2} [\frac{{(x - μ)}^{2}}{σ^{2}}]) = exp (- \frac{1}{2} [\frac{x^{2} - 2 μ x + μ^{2}}{σ^{2}}]) = exp (- \frac{1}{2} [\frac{1}{σ^{2}} x^{2} - 2 \frac{μ}{σ^{2}} x + \frac{μ^{2}}{σ^{2}}]) \end{matrix}$

Scratch Work

\begin{matrix} q (x_{t - 1} | x_{t}, x_{0}) & = N (x_{t}; {~ μ}_{t} (x_{t}, x_{0}), {~ β}_{t} I) q (x_{t - 1} | x_{t}, x_{0}) & = N (x_{t}; \sqrt{α_{t}} x_{t - 1}, β_{t} I) (\frac{N (x_{t - 1}; \sqrt{{¯ α}_{t - 1}} x_{0}, 1 - {¯ α}_{t - 1} I)}{N (x_{t}; \sqrt{{¯ α}_{t}} x_{0}, 1 - {¯ α}_{t} I)}) \end{matrix}

\begin{matrix} q (x_{t - 1} | x_{t}, x_{0}) & = q (x_{t} | x_{t - 1}, x_{0}) (\frac{q (x_{t - 1} | x_{0})}{q (x_{t} | x_{0})}) & [Bayes' rule] q (x_{1 : T} | x_{0}) & = T \prod t = 1 q (x_{t} | x_{t - 1}) & [Gaussian noising] β_{t} & = 1 - α_{t} & [Variance schedule] ∴ q (x_{t} | x_{t - 1}) & = N (x_{t}; \sqrt{1 - β_{t}} x_{t - 1}, β_{t} I) x_{t} & = \sqrt{α_{t}} x_{t - 1} + \sqrt{1 - α_{t}} ϵ_{t - 1} & [Reparameterization trick] = \sqrt{α_{t} α_{t - 1}} x_{t - 2} + \sqrt{1 - α_{t} α_{t - 1}} {¯ ϵ}_{t - 2} & [{¯ ϵ}_{t - 2} merges two Gaussians] = \dots = \sqrt{{¯ α}_{t}} x_{0} + \sqrt{1 - {¯ α}_{t}} ϵ & [{¯ α}_{t} = t \prod i = 1 α_{i}; ϵ \sim N (0, I)] ∴ q (x_{n} | x_{0}) & = N (x_{n}; \sqrt{{¯ α}_{n}} x_{0}, (1 - {¯ α}_{n}) I) Also, x_{0} & = \frac{1}{\sqrt{{¯ α}_{t}}} (x_{t} - \sqrt{1 - {¯ α}_{t}} ϵ) & [1] \end{matrix}

\begin{matrix} q (x_{t - 1} | x_{t}, x_{0}) & \propto exp ⎛ ⎝ - \frac{1}{2} ⎡ ⎣ \frac{{(x_{t} - \sqrt{α_{t}} x_{t - 1})}_{t}^{2}}{β_{t}} + \frac{{(x_{t - 1} - \sqrt{{¯ α}_{t - 1}} x_{0})}_{t - 1}^{2}}{1 - {¯ α}_{t - 1}} - \frac{{(x_{t} - \sqrt{{¯ α}_{t}} x_{0})}_{t}^{2}}{1 - {¯ α}_{t}} ⎤ ⎦ ⎞ ⎠ \propto exp ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ - \frac{1}{2} ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ (\frac{α_{t}}{β_{t}} + \frac{1}{1 - {¯ α}_{t - 1}})      1 / σ^{2} x_{t - 1}^{2} - (\frac{2 \sqrt{α_{t}}}{β_{t}} x_{t} + \frac{2 \sqrt{{¯ α}_{t - 1}}}{1 - {¯ α}_{t - 1}} x_{0})      2 μ / σ^{2} x_{t - 1} + f (x_{t}, x_{0})      μ^{2} / σ^{2} ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ \end{matrix}

\begin{matrix} N (x) & \propto exp (- \frac{1}{2} [\frac{{(x - μ)}^{2}}{σ^{2}}]) = exp (- \frac{1}{2} [\frac{x^{2} - 2 μ x + μ^{2}}{σ^{2}}]) = exp (- \frac{1}{2} [\frac{1}{σ^{2}} x^{2} - 2 \frac{μ}{σ^{2}} x + \frac{μ^{2}}{σ^{2}}]) \end{matrix}

Computing the parameters ( $σ$ and $μ$ ) of the new Gaussian:

$\begin{matrix} σ^{2} = {~ β}_{t} & = \frac{1}{(\frac{α_{t}}{β_{t}} + \frac{1}{1 - {¯ α}_{t - 1}})} = (\frac{1 - {¯ α}_{t - 1}}{1 - {¯ α}_{t}}) β_{t} {~ μ}_{t} (x_{t}, x_{0}) & = (\frac{\sqrt{α_{t}}}{β_{t}} x_{t} + \frac{\sqrt{{¯ α}_{t - 1}}}{1 - {¯ α}_{t - 1}} x_{0}) \frac{1}{σ^{2}} = (\frac{\sqrt{α_{t}}}{β_{t}} x_{t} + \frac{\sqrt{{¯ α}_{t - 1}}}{1 - {¯ α}_{t - 1}} x_{0}) (\frac{1 - {¯ α}_{t - 1}}{1 - {¯ α}_{t}}) β_{t} = \frac{\sqrt{α_{t}} (1 - {¯ α}_{t - 1})}{1 - {¯ α}_{t}} x_{t} + \frac{\sqrt{{¯ α}_{t - 1}} β_{t}}{1 - {¯ α}_{t}} x_{0} {~ μ}_{t} (x_{t}, x_{0}) & = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}}} ϵ_{t}) & [From 1] \end{matrix}$

$μ_{θ}$ must match ${~ μ}_{t}$ , and since $x_{t}$ is given, learn to predict the noise at step $t$ , $ε_{t}$ .

$\begin{matrix} μ_{θ} (x_{t}, t) = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}}} ϵ_{θ} (x_{t}, t)) [2] \end{matrix}$

Which means
$\begin{matrix} x_{t - 1} = \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}}} ϵ_{θ} (x_{t}, t)) + σ_{t} z [3] \end{matrix}$

The KL divergence between Gaussians with mean as the only parameter leads to the following loss:

$\begin{matrix} L_{t} & = D_{K L} (q (x_{t - 1} | x_{t}, x_{0}) ∥ p_{θ} (x_{t - 1} | x_{t})) = \frac{1}{2 σ_{t}^{2}} {∥ {~ μ}_{t} (x_{t}, x_{0}) - μ_{θ} (x_{t}, t) ∥}_{t}^{2} = \frac{1}{2 σ_{t}^{2}} {∥ ∥ ∥ ∥ \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}}} ϵ_{t}) - \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}}} ϵ_{θ} (x_{t}, t)) ∥ ∥ ∥ ∥}^{2} = \frac{{(1 - α_{t})}_{t}^{2}}{2 α_{t} (1 - {¯ α}_{t}) σ_{t}^{2}} {∥ ϵ_{t} - ϵ_{θ} (x_{t}, t) ∥}_{t}^{2} \end{matrix}$

In the DDPM paper^[1], it is found empirically that the training works better if the scaling factor is omitted, i.e., the loss can be simplified to

$\begin{matrix} L_{t} = {∥ ϵ_{t} - ϵ_{θ} (x_{t}, t) ∥}_{t}^{2} \end{matrix}$

$L_{0}$ : The image ( $x_{0}$ ) is originally constructed from D $discrete$ pixels with values $[0, 1, 2, \dots, 255]$ scaled to the range $[- 1, 1]$ . The DDPM paper^[1] suggests using an independent discrete decoder derived from the Gaussian (since $L_{t}$ doesn’t apply here):

$p_{θ} (x_{0} | x_{1}) = \prod_{i = 1}^{D} \int_{δ_{-} (x_{0}^{i})}^{δ_{+} (x_{0}^{i})} N (x_{1}, μ_{θ}^{i} (x_{1}, 1), σ_{1}^{2}) d x$

$δ_{+} (x) = {\begin{matrix} \infty & if x = 1 x + \frac{1}{255} & if x < 1 \end{matrix} δ_{-} (x) = {\begin{matrix} - \infty & if x = - 1 x - \frac{1}{255} & if x > - 1 \end{matrix}$

Intuitively, the network predicts the mean of a pixel which is used to draw from a Gaussian distribution which is integrated^[2] over according to the real pixel value as in the original image, from the real value minus $1 / 255$ to the real value plus $1 / 255$ . If the predicted mean is close to the original value of the pixel, the result of the integral will be high.

The integral is approximated by the Gaussian probability density function times the bin width, which makes the math workout similar to the $L_{t - 1}$ case: $\begin{matrix} p_{θ} (x_{0} | x_{1}) & \approx \frac{1}{\sqrt{2 π} σ_{1}} exp (- \frac{1}{2} \frac{{∥ x_{0} - μ_{θ} (x_{1}, 1) ∥}_{0}^{2}}{σ_{1}^{2}}) (\frac{2}{255}) L_{0} = - log (p_{θ} (x_{0} | x_{1})) & \approx \frac{1}{2 σ_{1}^{2}} {∥ x_{0} - μ_{θ} (x_{1}, 1) ∥}_{0}^{2} + C \approx \frac{1}{2 σ_{1}^{2}} {∥ ∥ ∥ ∥ \frac{1}{\sqrt{α_{1}}} (x_{1} - \sqrt{1 - α_{1}} ϵ_{1}) - \frac{1}{\sqrt{α_{1}}} (x_{1} - \frac{1 - α_{1}}{\sqrt{1 - α_{1}}} ϵ_{θ} (x_{1}, 1)) ∥ ∥ ∥ ∥}^{2} + C [From 1, 2] \approx {∥ ϵ_{1} - ϵ_{θ} (x_{1}, 1) ∥}_{1}^{2} \end{matrix}$

Putting the pieces together, the DDPM paper^[1] suggests using this loss function for training (followed by the algorithm):
$\begin{matrix} L_{simple} & := E_{t \sim U [1, T], x_{0}, ϵ} [{∥ ϵ - ϵ_{θ} (x_{t}, t) ∥}_{θ}^{2}] = E_{t \sim U [1, T], x_{0}, ϵ} [{∥ ∥ ϵ - ϵ_{θ} (\sqrt{{¯ α}_{t}} x_{0} + \sqrt{1 - {¯ α}_{t}} ϵ, t) ∥ ∥}_{θ}^{2}] \end{matrix}$

Algorithm 1 DDPM Training
$\begin{matrix} repeat x_{0} \sim q (x_{0}) t \sim Uniform ([1, \dots, T]) ϵ \sim N (0, I) Take gradient descent step on \nabla_{θ} {∥ ∥ ϵ - ϵ_{θ} (\sqrt{{¯ α}_{t}} x_{0} + \sqrt{1 - {¯ α}_{t}} ϵ, t) ∥ ∥}_{θ}^{2} until converged \end{matrix}$

We saw how we can go from noise to an image via the reverse process, but how do we go from random noise $\to$ random (aggregate of training set) image to random noise $\to$ image of interest? The next section talks about this.

Conditioning

Dhariwal & Nichol (2021)^[3] use gradients $\nabla_{x} log f_{ϕ} (y | x_{t})$ of a classifier $f_{ϕ} (y | x_{t}, t)$ to condition diffusion sampling on $y$ (e.g. a target class label) where $x_{t}$ is noisy image. The score function for the joint distribution $q (x_{t}, y)$ is computed as,

$\begin{matrix} \nabla_{x_{t}} log q (x_{t}, y) & = \nabla_{x_{t}} log q (x_{t}) + \nabla_{x_{t}} log q (y | x_{t}) \approx - \frac{1}{\sqrt{1 - {¯ α}_{t}}} ϵ_{θ} (x_{t}, t) + \nabla_{x_{t}} log f_{ϕ} (y | x_{t}) = - \frac{1}{\sqrt{1 - {¯ α}_{t}}} (ϵ_{θ} (x_{t}, t) - \sqrt{1 - {¯ α}_{t}} \nabla_{x_{t}} log f_{ϕ} (y | x_{t})) \end{matrix}$

Thus, a new classifier-guided predictor ${¯ ϵ}_{θ}$ modifies the initial predictor $ϵ_{θ}$ as follows,

${¯ ϵ}_{θ} (x_{t}, t) = ϵ_{θ} (x_{t}, t) - \sqrt{1 - {¯ α}_{t}} \nabla_{x_{t}} log f_{ϕ} (y | x_{t})$

To control the strength of the classifier guidance, a weight $w$ is added to the delta part,
$\begin{matrix} {¯ ϵ}_{θ} (x_{t}, t) = ϵ_{θ} (x_{t}, t) - \sqrt{1 - {¯ α}_{t}} w \nabla_{x_{t}} log f_{ϕ} (y | x_{t}) & [4] \end{matrix}$

Ho & Salimans (2021)^[4] propose to run conditional diffusion without an independent classifier $f_{ϕ}$ . Consider an unconditional denoising diffusion model $p_{θ} (x)$ parameterized through a score estimator $ϵ_{θ} (x_{t}, t)$ and a conditional model $p_{θ} (x | y)$ parameterized through $ϵ_{θ} (x_{t}, t, y)$ . These two models can be learned via a single neural network. Specifically, a conditional diffusion model $p_{θ} (x | y)$ is trained on paired data $(x, y)$ , where the conditioning information $y$ gets discarded periodically at random such that the model knows how to generate images unconditionally as well, i.e. $ϵ_{θ} (x_{t}, t) = ϵ_{θ} (x_{t}, t, y = \emptyset)$ .

The gradient of an implicit classifier $p (y | x_{t})$ is proportional to $p (x_{t} | y) p (x_{t})$ . Taking log on both sides,

$\begin{matrix} \nabla_{x_{t}} log p (y | x_{t}) & = \nabla_{x_{t}} log p (x_{t} | y) - \nabla_{x_{t}} log p (x_{t}) = - \frac{1}{\sqrt{1 - {¯ α}_{t}}} (ϵ_{θ} (x_{t}, t, y) - ϵ_{θ} (x_{t}, t)) {¯ ϵ}_{θ} (x_{t}, t, y) & = ϵ_{θ} (x_{t}, t, y) - \sqrt{1 - {¯ α}_{t}} w \nabla_{x_{t}} log p (y | x_{t}) & [From 4] = ϵ_{θ} (x_{t}, t, y) + w (ϵ_{θ} (x_{t}, t, y) - ϵ_{θ} (x_{t}, t)) = (w + 1) ϵ_{θ} (x_{t}, t, y) - w ϵ_{θ} (x_{t}, t) \end{matrix}$

Below is a comparison of sampling algorithms across DDPM, DDIM and DDIM with classifier guidance.

Algorithm 2 DDPM

$\begin{matrix} Input: diffusion model ϵ_{θ} (x_{t}, t) x_{T} \sim N (0, I) for t from T to 1 z \sim N (0, I) if t > 1, else z = 0 x_{t - 1} \leftarrow \frac{1}{\sqrt{α_{t}}} (x_{t} - \frac{1 - α_{t}}{\sqrt{1 - {¯ α}_{t}}} ϵ_{θ} (x_{t}, t)) + σ_{t} z end for return x_{0} \end{matrix}$

Algorithm 3 DDIM

$\begin{matrix} Input: diffusion model ϵ_{θ} (x_{t}, t) x_{T} \sim N (0, I) for t from T to 1 x_{t - 1} \leftarrow \sqrt{{¯ α}_{t - 1}} (\frac{x_{t} - \sqrt{1 - {¯ α}_{t}} ϵ_{θ} (x_{t}, t)}{\sqrt{{¯ α}_{t}}}) + \sqrt{1 - {¯ α}_{t - 1}} ϵ_{θ} (x_{t}, t) end for return x_{0} \end{matrix}$

Algorithm 4 DDIM with classifier guidance

$\begin{matrix} Input: class label y, diffusion model ϵ_{θ} (x_{t}, t) and classifier f_{ϕ} (y | x_{t}) x_{T} \sim N (0, I) for t from T to 1^ϵ \leftarrow ϵ_{θ} (x_{t}, t) - \sqrt{1 - {¯ α}_{t}} \nabla_{x_{t}} log f_{ϕ} (y | x_{t}) x_{t - 1} \leftarrow \sqrt{{¯ α}_{t - 1}} (\frac{x_{t} - \sqrt{1 - {¯ α}_{t}}^ϵ}{\sqrt{{¯ α}_{t}}}) + \sqrt{1 - {¯ α}_{t - 1}}^ϵ end for return x_{0} \end{matrix}$

Optimization

So far we discussed training and sampling algorithms, but what is a good value for T? Can we make the sampling process efficient and reduce the number of steps somehow? We explore this part here.

Langevin dynamics can generate samples from a probability density $q (x)$ using only the gradients $\nabla_{x} log q (x)$ in a Markov chain of updates:

$x_{t} = x_{t - 1} + \frac{δ}{2} \nabla_{x} log q (x_{t - 1}) + \sqrt{δ} ϵ_{t}, where ϵ_{t} \sim N (0, I)$

where $δ$ is the step size. When $T \to \infty, ϵ \to 0, x_{T}$ equals the true probability density $q (x)$ . Compared to SGD, Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.

Score-based generative modeling trains a score network to estimate the above gradient i.e. $s_{θ} (x) \approx \nabla_{x} log q (x)$ . To make the data points cover the whole space, the score network jointly estimates the scores of data perturbed at different noise levels (Noise Conditional Score Network) i.e. $s_{θ} (x_{t}, t) \approx \nabla_{x_{t}} log q (x_{t})$ . The schedule of increasing noise levels resembles the forward diffusion process.

Given a Gaussian distribution $x \sim N (μ, σ^{2} I)$ , the derivative of the logarithm of its density function $\nabla_{x} log p (x)$ is equal to $\nabla_{x} (- \frac{1}{2} {[\frac{x - μ}{σ}]}^{2}) = - (\frac{x - μ}{σ}) = - \frac{ϵ}{σ}$ where $ϵ \sim N (0, I)$ .

Recall that $q (x_{t} | x_{0}) \sim N (\sqrt{{¯ α}_{t}} x_{0}, (1 - {¯ α}_{t}) I)$ . Therefore,
$s_{θ} (x_{t}, t) \approx \nabla_{x_{t}} log q (x_{t}) = E_{q (x_{0})} [\nabla_{x_{t}} log q (x_{t} | x_{0})] = E_{q (x_{0})} [- \frac{ϵ_{θ} (x_{t}, t)}{\sqrt{1 - {¯ α}_{t}}}] = - \frac{ϵ_{θ} (x_{t}, t)}{\sqrt{1 - {¯ α}_{t}}}$

Figure 3: Non Markov chain of forward [ $q$ ] (reverse [ $p_{θ}$ ]) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Song et al. 2022)

To make diffusion more efficient, DDIM [Fig. 3] propose fewer sampling steps by modeling the forward process as a non Markovian chain: $q (x_{1 : T} | x_{0}) = q (x_{T} | x_{0}) \prod_{t = 2}^{T} q (x_{t - 1} | x_{t}, x_{0})$ (DDPM omits the $x_{0}$ conditioning). Similar to [3], $x_{t - 1}$ can be written as
$\begin{matrix} = \sqrt{{¯ α}_{t - 1}} x_{0} + \sqrt{(1 - {¯ α}_{t - 1} - σ_{t}^{2}) ϵ_{θ} (x_{t}, t)} + σ_{t} ϵ_{t} = \sqrt{{¯ α}_{t - 1}} (\frac{x_{t} - \sqrt{1 - {¯ α}_{t}} ϵ_{θ} (x_{t}, t)}{\sqrt{{¯ α}_{t}}})      "predicted x_{0} " + \sqrt{(1 - {¯ α}_{t - 1} - σ_{t}^{2}) ϵ_{θ} (x_{t}, t)}      "direction pointing to x_{t} " + σ_{t} ϵ_{t}    random noise = \sqrt{{¯ α}_{t - 1}} (\frac{x_{t} - \sqrt{1 - {¯ α}_{t}} ϵ_{θ} (x_{t}, t)}{\sqrt{{¯ α}_{t}}}) + \sqrt{(1 - {¯ α}_{t - 1}) ϵ_{θ} (x_{t}, t)} & [σ = 0] \end{matrix}$

If $σ$ is 0, the sampling process is deterministic and because $x_{0}$ is predicted (which is what we finally want + we can replace $t$ with $τ$ ) each time, some steps [T..1] can be skipped.

Conclusion

We answered all the questions we asked—loss formulation, (required) image generation, sampling efficiency and background connecting to VAE. There are (many) additional papers that have come out since the initial set, but this post focusses on the fundamentals and therefore ends here.

REFERENCES

^
Denoising Diffusion Probabilistic Models
^
The integration of Gaussian distribution over the given range is necessary because $x_{0}$ is discrete. The alternate is to treat $x_{0}$ as continuous and compute the value of the Gaussian distribution at $x_{0}$ . Both approaches lead to an MSE though, since the integral is approximated in DDPM.
^
Diffusion Models Beat GANs on Image Synthesis
^
DENOISING DIFFUSION IMPLICIT MODELS
https://dzdata.medium.com/intro-to-diffusion-model-part-3-5d699e5f0714
https://lilianweng.github.io/posts/2021-07-11-diffusion-models/#nice
https://medium.com/better-programming/diffusion-models-ddpms-ddims-and-classifier-free-guidance-e07b297b2869
https://mbernste.github.io/posts/vae/