Diffusion, a class of generative models, boils down to an MSE between added noise and network predicted noise. Why would learning to predict the noise help with image generation (which is what it is most used for)? How did we arrive at MSE? This post dives deep into the math to answer these questions.
Background
One way to interpret Diffusion is as a continuous VAE (Variational Auto Encoder). A VAE computes a lower bound on the likelihood of generating real data samples (logpθ(x)) by approximating the unknown posterior pθ(z|x) with a learnable one qϕ(z|x) [Fig. 1]:
−logpθ(x)≤−logpθ(x)≤−logpθ(x)+DKL(qϕ(z|x)∥pθ(z|x))[KL is always positive]≤−logpθ(x)+∫qϕ(z|x)log(qϕ(z|x)pθ(z|x))dz[Definition of KL]≤−logpθ(x)+∫qϕ(z|x)log(qϕ(z|x)pθ(x)pθ(z,x))dz[conditional to joint]≤−logpθ(x)+∫qϕ(z|x)(logpθ(x)+log(qϕ(z|x)pθ(z,x)))dz≤−logpθ(x)+logpθ(x)+∫qϕ(z|x)(log(qϕ(z|x)pθ(x|z)pθ(z)))dz[pθ independent of z; joint to conditional]=−Ez∼qϕ(z|x)[log(qϕ(z|x)pθ(z))−logpθ(x|z)][Definition of E for continuous variable z]=−Ez∼qϕ(z|x)logpθ(x|z)+DKL(qϕ(z|x)∥pθ(z))[Tractable]
The process of encoding and decoding leads to regularized compression of images in the z space (hence the name auto encoder; variational comes from approximating distributions). During inference, random images can be generated from the decoder, based on sampling of z.
Derivation
Figure 2: Markov chain of forward [q] (reverse [pθ]) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Ho et al. 2020)
Similar to VAE, diffusion models compute a lower bound on the likelihood of generating real data samples (logpθ(x0)) by approximating the unknown posterior pθ(xt−1|xt) with a computable one qϕ(xt−1|xt,x0), for all t [Fig. 2] (It follows the same math as VAE for all t, therefore could be thought of as its continuous version):
−logpθ(x0)≤−logpθ(x0)≤−logpθ(x0)+DKL(q(x1:T|x0)∥pθ(x1:T|x0))[KL is always positive]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)[log(q(x1:T|x0)pθ(x1:T)|x0)]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢
⎢⎣log⎛⎜
⎜⎝q(x1:T|x0)(pθ(x0|x1:T)pθ(x1:T)pθ(x0))⎞⎟
⎟⎠⎤⎥
⎥⎦[Bayes' rule]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢
⎢⎣log⎛⎜
⎜⎝q(x1:T|x0)(pθ(x0,x1:T)pθ(x0))⎞⎟
⎟⎠⎤⎥
⎥⎦[conditional to joint]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢
⎢⎣log⎛⎜
⎜⎝q(x1:T|x0)(pθ(x0:T)pθ(x0))⎞⎟
⎟⎠⎤⎥
⎥⎦[merge joint]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)[log(q(x1:T|x0)pθ(x0:T))]+log(pθ(x0))Eq[−log(pθ(x0))]≤Eq[log(q(x1:T|x0)pθ(x0:T))]
Solving for R.H.S.:
log(q(x1:T|x0)pθ(x0:T))=log(∏Tt=1q(xt|xt−1)pθ(xT)∏Tt=1pθ(xt−1|xt))=−logpθ(xT)+T∑t=1log(q(xt|xt−1)pθ(xt−1|xt))=−logpθ(xT)+T∑t=2log(q(xt|xt−1)pθ(xt−1|xt))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt)q(xt)pθ(xt−1|xt)q(xt−1))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)q(xt|x0)pθ(xt−1|xt)q(xt−1|x0))+log(q(x1|x0)pθ(x0|x1))[conditioning on x0 for tractability]=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+T∑t=2log(q(xt|x0)q(xt−1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(x2|x0)…q(xT−1|x0)q(xT|x0)q(x1|x0)q(x2|x0)…q(xT−1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(xT|x0)q(x1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(xT|x0)q(x1|x0)q(x1|x0)pθ(x0|x1))=logq(xT|x0)−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))−logpθ(x0|x1)=log(q(xT|x0)pθ(xT))+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))−logpθ(x0|x1)log(q(x1:T|x0)pθ(x0:T))=DKL(q(xT|x0)∥pθ(xT))LT+T∑t=2DKL(q(xt−1|xt,x0)∥pθ(xt−1|xt))Lt−1−logpθ(x0|x1)L0
Taking each piece apart:
LT: Can be ignored while training since q has no learnable parameters, and xT is just Gaussian noise.
Lt−1: Match the unknown (pθ) to the tractable quantity (q). Define the reverse process as:
In the DDPM paper[1], it is found empirically that the training works better if the scaling factor is omitted, i.e., the loss can be simplified to
Lt=∥ϵt−ϵθ(xt,t)∥2
L0: The image (x0) is originally constructed from D discrete pixels with values [0,1,2,…,255] scaled to the range [−1,1]. The DDPM paper[1] suggests using an independent discrete decoder derived from the Gaussian (since Lt doesn’t apply here):
δ+(x)={∞ if x=1x+1255 if x<1δ−(x)={−∞ if x=−1x−1255 if x>−1
Intuitively, the network predicts the mean of a pixel which is used to draw from a Gaussian distribution which is integrated[2] over according to the real pixel value as in the original image, from the real value minus 1/255 to the real value plus 1/255. If the predicted mean is close to the original value of the pixel, the result of the integral will be high.
The integral is approximated by the Gaussian probability density function times the bin width, which makes the math workout similar to the Lt−1 case:pθ(x0|x1)≈1√2πσ1exp(−12∥x0−μθ(x1,1)∥2σ21)(2255)L0=−log(pθ(x0|x1))≈12σ21∥x0−μθ(x1,1)∥2+C≈12σ21∥∥
∥∥1√α1(x1−√1−α1ϵ1)−1√α1(x1−1−α1√1−α1ϵθ(x1,1))∥∥
∥∥2+C[From 1, 2]≈∥ϵ1−ϵθ(x1,1)∥2
Putting the pieces together, the DDPM paper[1] suggests using this loss function for training (followed by the algorithm): Lsimple :=Et∼U[1,T],x0,ϵ[∥ϵ−ϵθ(xt,t)∥2]=Et∼U[1,T],x0,ϵ[∥∥ϵ−ϵθ(√¯αtx0+√1−¯αtϵ,t)∥∥2]
Algorithm 1 DDPM Training repeatx0∼q(x0)t∼Uniform([1,…,T])ϵ∼N(0,I)Take gradient descent step on∇θ∥∥ϵ−ϵθ(√¯αtx0+√1−¯αtϵ,t)∥∥2until converged
We saw how we can go from noise to an image via the reverse process, but how do we go from random noise → random (aggregate of training set) image to random noise → image of interest? The next section talks about this.
Conditioning
Dhariwal & Nichol (2021)[3] use gradients ∇xlogfϕ(y|xt) of a classifier fϕ(y|xt,t) to condition diffusion sampling on y (e.g. a target class label) where xt is noisy image. The score function for the joint distribution q(xt,y) is computed as,
Thus, a new classifier-guided predictor ¯ϵθ modifies the initial predictor ϵθ as follows,
¯ϵθ(xt,t)=ϵθ(xt,t)−√1−¯αt∇xtlogfϕ(y|xt)
To control the strength of the classifier guidance, a weight w is added to the delta part, ¯ϵθ(xt,t)=ϵθ(xt,t)−√1−¯αtw∇xtlogfϕ(y|xt)[4]
Ho & Salimans (2021)[4] propose to run conditional diffusion without an independent classifier fϕ. Consider an unconditional denoising diffusion model pθ(x) parameterized through a score estimator ϵθ(xt,t) and a conditional model pθ(x|y) parameterized through ϵθ(xt,t,y). These two models can be learned via a single neural network. Specifically, a conditional diffusion model pθ(x|y) is trained on paired data (x,y), where the conditioning information y gets discarded periodically at random such that the model knows how to generate images unconditionally as well, i.e. ϵθ(xt,t)=ϵθ(xt,t,y=∅).
The gradient of an implicit classifier p(y|xt) is proportional to p(xt|y)p(xt). Taking log on both sides,
Below is a comparison of sampling algorithms across DDPM, DDIM and DDIM with classifier guidance.
Algorithm 2 DDPM
Input: diffusion model ϵθ(xt,t)xT∼N(0,I)fort from T to 1z∼N(0,I) if t>1, else z=0xt−1←1√αt(xt−1−αt√1−¯αtϵθ(xt,t))+σtzend forreturn x0
Algorithm 3 DDIM
Input: diffusion model ϵθ(xt,t)xT∼N(0,I)fort from T to 1xt−1←√¯αt−1(xt−√1−¯αtϵθ(xt,t)√¯αt)+√1−¯αt−1ϵθ(xt,t)end forreturn x0
Algorithm 4 DDIM with classifier guidance
Input: class label y, diffusion model ϵθ(xt,t) and classifier fϕ(y|xt)xT∼N(0,I)fort from T to 1^ϵ←ϵθ(xt,t)−√1−¯αt∇xtlogfϕ(y|xt)xt−1←√¯αt−1(xt−√1−¯αt^ϵ√¯αt)+√1−¯αt−1^ϵend forreturn x0
Optimization
So far we discussed training and sampling algorithms, but what is a good value for T? Can we make the sampling process efficient and reduce the number of steps somehow? We explore this part here.
Langevin dynamics can generate samples from a probability density q(x) using only the gradients ∇xlogq(x) in a Markov chain of updates:
xt=xt−1+δ2∇xlogq(xt−1)+√δϵt, where ϵt∼N(0,I)
where δ is the step size. When T→∞,ϵ→0,xT equals the true probability density q(x). Compared to SGD, Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.
Score-based generative modeling trains a score network to estimate the above gradient i.e. sθ(x)≈∇xlogq(x). To make the data points cover the whole space, the score network jointly estimates the scores of data perturbed at different noise levels (Noise Conditional Score Network) i.e. sθ(xt,t)≈∇xtlogq(xt). The schedule of increasing noise levels resembles the forward diffusion process.
Given a Gaussian distribution x∼N(μ,σ2I), the derivative of the logarithm of its density function ∇xlogp(x) is equal to ∇x(−12[x−μσ]2)=−(x−μσ)=−ϵσ where ϵ∼N(0,I).
Recall that q(xt|x0)∼N(√¯αtx0,(1−¯αt)I). Therefore, sθ(xt,t)≈∇xtlogq(xt)=Eq(x0)[∇xtlogq(xt|x0)]=Eq(x0)[−ϵθ(xt,t)√1−¯αt]=−ϵθ(xt,t)√1−¯αt
Figure 3: Non Markov chain of forward [q] (reverse [pθ]) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Song et al. 2022)
To make diffusion more efficient, DDIM [Fig. 3] propose fewer sampling steps by modeling the forward process as a non Markovian chain: q(x1:T|x0)=q(xT|x0)∏Tt=2q(xt−1|xt,x0) (DDPM omits the x0 conditioning). Similar to [3], xt−1 can be written as =√¯αt−1x0+√(1−¯αt−1−σ2t)ϵθ(xt,t)+σtϵt=√¯αt−1(xt−√1−¯αtϵθ(xt,t)√¯αt)"predicted x0"+√(1−¯αt−1−σ2t)ϵθ(xt,t)"direction pointing to xt"+σtϵtrandom noise=√¯αt−1(xt−√1−¯αtϵθ(xt,t)√¯αt)+√(1−¯αt−1)ϵθ(xt,t)[σ = 0]
If σ is 0, the sampling process is deterministic and because x0 is predicted (which is what we finally want + we can replace t with τ) each time, some steps [T..1] can be skipped.
Conclusion
We answered all the questions we asked—loss formulation, (required) image generation, sampling efficiency and background connecting to VAE. There are (many) additional papers that have come out since the initial set, but this post focusses on the fundamentals and therefore ends here.
The integration of Gaussian distribution over the given range is necessary because x0 is discrete. The alternate is to treat x0 as continuous and compute the value of the Gaussian distribution at x0. Both approaches lead to an MSE though, since the integral is approximated in DDPM.
Diffusion Primer
Diffusion, a class of generative models, boils down to an MSE between added noise and network predicted noise.
Why would learning to predict the noise help with image generation (which is what it is most used for)? How did we arrive at MSE? This post dives deep into the math to answer these questions.
Background
One way to interpret Diffusion is as a continuous VAE (Variational Auto Encoder).
A VAE computes a lower bound on the likelihood of generating real data samples (logpθ(x)) by approximating the unknown posterior pθ(z|x) with a learnable one qϕ(z|x) [Fig. 1]:
−logpθ(x)≤−logpθ(x)≤−logpθ(x)+DKL(qϕ(z|x)∥pθ(z|x))[KL is always positive]≤−logpθ(x)+∫qϕ(z|x)log(qϕ(z|x)pθ(z|x))dz [Definition of KL]≤−logpθ(x)+∫qϕ(z|x)log(qϕ(z|x)pθ(x)pθ(z,x))dz [conditional to joint]≤−logpθ(x)+∫qϕ(z|x)(logpθ(x)+log(qϕ(z|x)pθ(z,x)))dz ≤−logpθ(x)+logpθ(x)+∫qϕ(z|x)(log(qϕ(z|x)pθ(x|z)pθ(z)))dz [pθ independent of z; joint to conditional]=−Ez∼qϕ(z|x)[log(qϕ(z|x)pθ(z))−logpθ(x|z)][Definition of E for continuous variable z]=−Ez∼qϕ(z|x)logpθ(x|z)+DKL(qϕ(z|x)∥pθ(z))[Tractable]
The process of encoding and decoding leads to regularized compression of images in the z space (hence the name auto encoder; variational comes from approximating distributions). During inference, random images can be generated from the decoder, based on sampling of z.
Derivation
Similar to VAE, diffusion models compute a lower bound on the likelihood of generating real data samples (logpθ(x0)) by approximating the unknown posterior pθ(xt−1|xt) with a computable one qϕ(xt−1|xt,x0), for all t [Fig. 2] (It follows the same math as VAE for all t, therefore could be thought of as its continuous version):
−logpθ(x0)≤−logpθ(x0)≤−logpθ(x0)+DKL(q(x1:T|x0)∥pθ(x1:T|x0))[KL is always positive]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)[log(q(x1:T|x0)pθ(x1:T)|x0)]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢ ⎢⎣log⎛⎜ ⎜⎝q(x1:T|x0)(pθ(x0|x1:T)pθ(x1:T)pθ(x0))⎞⎟ ⎟⎠⎤⎥ ⎥⎦[Bayes' rule]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢ ⎢⎣log⎛⎜ ⎜⎝q(x1:T|x0)(pθ(x0,x1:T)pθ(x0))⎞⎟ ⎟⎠⎤⎥ ⎥⎦[conditional to joint]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)⎡⎢ ⎢⎣log⎛⎜ ⎜⎝q(x1:T|x0)(pθ(x0:T)pθ(x0))⎞⎟ ⎟⎠⎤⎥ ⎥⎦[merge joint]=−logpθ(x0)+Ex1:T∼q(x1:T|x0)[log(q(x1:T|x0)pθ(x0:T))]+log(pθ(x0))Eq[−log(pθ(x0))]≤Eq[log(q(x1:T|x0)pθ(x0:T))]
Solving for R.H.S.:
log(q(x1:T|x0)pθ(x0:T))=log(∏Tt=1q(xt|xt−1)pθ(xT)∏Tt=1pθ(xt−1|xt))=−logpθ(xT)+T∑t=1log(q(xt|xt−1)pθ(xt−1|xt))=−logpθ(xT)+T∑t=2log(q(xt|xt−1)pθ(xt−1|xt))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt)q(xt)pθ(xt−1|xt)q(xt−1))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)q(xt|x0)pθ(xt−1|xt)q(xt−1|x0))+log(q(x1|x0)pθ(x0|x1))[conditioning on x0 for tractability]=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+T∑t=2log(q(xt|x0)q(xt−1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(x2|x0)…q(xT−1|x0)q(xT|x0)q(x1|x0)q(x2|x0)…q(xT−1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(xT|x0)q(x1|x0))+log(q(x1|x0)pθ(x0|x1))=−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))+log(q(xT|x0)q(x1|x0)q(x1|x0)pθ(x0|x1))=logq(xT|x0)−logpθ(xT)+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))−logpθ(x0|x1)=log(q(xT|x0)pθ(xT))+T∑t=2log(q(xt−1|xt,x0)pθ(xt−1|xt))−logpθ(x0|x1)log(q(x1:T|x0)pθ(x0:T))=DKL(q(xT|x0)∥pθ(xT))LT+T∑t=2DKL(q(xt−1|xt,x0)∥pθ(xt−1|xt))Lt−1−logpθ(x0|x1)L0
Taking each piece apart:
LT: Can be ignored while training since q has no learnable parameters, and xT is just Gaussian noise.
Lt−1: Match the unknown (pθ) to the tractable quantity (q). Define the reverse process as:
pθ(xt−1|xt)=N(xt−1;μθ(xt,t),Σθ(xt,t))
Simplifying to known variance:
pθ(xt−1|xt)=N(xt−1;μθ(xt,t),σ2tI)
Solving q (zoom out if not fully visible):
Computing the parameters (σ and μ) of the new Gaussian:
σ2=~βt=1(αtβt+11−¯αt−1)=(1−¯αt−11−¯αt)βt~μt(xt,x0)=(√αtβtxt+√¯αt−11−¯αt−1x0)1σ2=(√αtβtxt+√¯αt−11−¯αt−1x0)(1−¯αt−11−¯αt)βt=√αt(1−¯αt−1)1−¯αtxt+√¯αt−1βt1−¯αtx0~μt(xt,x0)=1√αt(xt−1−αt√1−¯αtϵt)[From 1]
μθ must match ~μt, and since xt is given, learn to predict the noise at step t, εt.
μθ(xt,t)=1√αt(xt−1−αt√1−¯αtϵθ(xt,t))[2]
Which means
xt−1=1√αt(xt−1−αt√1−¯αtϵθ(xt,t))+σtz[3]
The KL divergence between Gaussians with mean as the only parameter leads to the following loss:
Lt=DKL(q(xt−1|xt,x0)∥pθ(xt−1|xt))=12σ2t∥~μt(xt,x0)−μθ(xt,t)∥2=12σ2t∥∥ ∥∥1√αt(xt−1−αt√1−¯αtϵt)−1√αt(xt−1−αt√1−¯αtϵθ(xt,t))∥∥ ∥∥2=(1−αt)22αt(1−¯αt)σ2t∥ϵt−ϵθ(xt,t)∥2
In the DDPM paper[1], it is found empirically that the training works better if the scaling factor is omitted, i.e., the loss can be simplified to
Lt=∥ϵt−ϵθ(xt,t)∥2
L0: The image (x0) is originally constructed from D discrete pixels with values [0,1,2,…,255] scaled to the range [−1,1]. The DDPM paper[1] suggests using an independent discrete decoder derived from the Gaussian (since Lt doesn’t apply here):
pθ(x0|x1)=∏Di=1∫δ+(xi0)δ−(xi0)N(x1,μiθ(x1,1),σ21)dx
δ+(x)={∞ if x=1x+1255 if x<1δ−(x)={−∞ if x=−1x−1255 if x>−1
The integral is approximated by the Gaussian probability density function times the bin width, which makes the math workout similar to the Lt−1 case:pθ(x0|x1)≈1√2πσ1exp(−12∥x0−μθ(x1,1)∥2σ21)(2255)L0=−log(pθ(x0|x1))≈12σ21∥x0−μθ(x1,1)∥2+C≈12σ21∥∥ ∥∥1√α1(x1−√1−α1ϵ1)−1√α1(x1−1−α1√1−α1ϵθ(x1,1))∥∥ ∥∥2+C[From 1, 2]≈∥ϵ1−ϵθ(x1,1)∥2
Putting the pieces together, the DDPM paper[1] suggests using this loss function for training (followed by the algorithm):
Lsimple :=Et∼U[1,T],x0,ϵ[∥ϵ−ϵθ(xt,t)∥2]=Et∼U[1,T],x0,ϵ[∥∥ϵ−ϵθ(√¯αtx0+√1−¯αtϵ,t)∥∥2]
Algorithm 1 DDPM Training
repeatx0∼q(x0)t∼Uniform([1,…,T])ϵ∼N(0,I)Take gradient descent step on∇θ∥∥ϵ−ϵθ(√¯αtx0+√1−¯αtϵ,t)∥∥2until converged
We saw how we can go from noise to an image via the reverse process, but how do we go from random noise → random (aggregate of training set) image to random noise → image of interest? The next section talks about this.
Conditioning
Dhariwal & Nichol (2021)[3] use gradients ∇xlogfϕ(y|xt) of a classifier fϕ(y|xt,t) to condition diffusion sampling on y (e.g. a target class label) where xt is noisy image. The score function for the joint distribution q(xt,y) is computed as,
∇xtlogq(xt,y)=∇xtlogq(xt)+∇xtlogq(y|xt)≈−1√1−¯αtϵθ(xt,t)+∇xtlogfϕ(y|xt)=−1√1−¯αt(ϵθ(xt,t)−√1−¯αt∇xtlogfϕ(y|xt))
Thus, a new classifier-guided predictor ¯ϵθ modifies the initial predictor ϵθ as follows,
¯ϵθ(xt,t)=ϵθ(xt,t)−√1−¯αt∇xtlogfϕ(y|xt)
To control the strength of the classifier guidance, a weight w is added to the delta part,
¯ϵθ(xt,t)=ϵθ(xt,t)−√1−¯αtw∇xtlogfϕ(y|xt)[4]
Ho & Salimans (2021)[4] propose to run conditional diffusion without an independent classifier fϕ. Consider an unconditional denoising diffusion model pθ(x) parameterized through a score estimator ϵθ(xt,t) and a conditional model pθ(x|y) parameterized through ϵθ(xt,t,y). These two models can be learned via a single neural network. Specifically, a conditional diffusion model pθ(x|y) is trained on paired data (x,y), where the conditioning information y gets discarded periodically at random such that the model knows how to generate images unconditionally as well, i.e. ϵθ(xt,t)=ϵθ(xt,t,y=∅).
The gradient of an implicit classifier p(y|xt) is proportional to p(xt|y)p(xt). Taking log on both sides,
∇xtlogp(y|xt)=∇xtlogp(xt|y)−∇xtlogp(xt)=−1√1−¯αt(ϵθ(xt,t,y)−ϵθ(xt,t))¯ϵθ(xt,t,y)=ϵθ(xt,t,y)−√1−¯αtw∇xtlogp(y|xt)[From 4]=ϵθ(xt,t,y)+w(ϵθ(xt,t,y)−ϵθ(xt,t))=(w+1)ϵθ(xt,t,y)−wϵθ(xt,t)
Below is a comparison of sampling algorithms across DDPM, DDIM and DDIM with classifier guidance.
Algorithm 2 DDPM
Input: diffusion model ϵθ(xt,t)xT∼N(0,I)for t from T to 1z∼N(0,I) if t>1, else z=0xt−1←1√αt(xt−1−αt√1−¯αtϵθ(xt,t))+σtzend forreturn x0
Algorithm 3 DDIM
Input: diffusion model ϵθ(xt,t)xT∼N(0,I)for t from T to 1xt−1←√¯αt−1(xt−√1−¯αtϵθ(xt,t)√¯αt)+√1−¯αt−1ϵθ(xt,t)end forreturn x0
Algorithm 4 DDIM with classifier guidance
Input: class label y, diffusion model ϵθ(xt,t) and classifier fϕ(y|xt)xT∼N(0,I)for t from T to 1^ϵ←ϵθ(xt,t)−√1−¯αt∇xtlogfϕ(y|xt)xt−1←√¯αt−1(xt−√1−¯αt^ϵ√¯αt)+√1−¯αt−1^ϵend forreturn x0
Optimization
So far we discussed training and sampling algorithms, but what is a good value for T? Can we make the sampling process efficient and reduce the number of steps somehow? We explore this part here.
Langevin dynamics can generate samples from a probability density q(x) using only the gradients ∇xlogq(x) in a Markov chain of updates:
xt=xt−1+δ2∇xlogq(xt−1)+√δϵt, where ϵt∼N(0,I)
where δ is the step size. When T→∞,ϵ→0,xT equals the true probability density q(x). Compared to SGD, Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.
Score-based generative modeling trains a score network to estimate the above gradient i.e. sθ(x)≈∇xlogq(x). To make the data points cover the whole space, the score network jointly estimates the scores of data perturbed at different noise levels (Noise Conditional Score Network) i.e. sθ(xt,t)≈∇xtlogq(xt). The schedule of increasing noise levels resembles the forward diffusion process.
Given a Gaussian distribution x∼N(μ,σ2I), the derivative of the logarithm of its density function ∇xlogp(x) is equal to ∇x(−12[x−μσ]2)=−(x−μσ)=−ϵσ where ϵ∼N(0,I).
Recall that q(xt|x0)∼N(√¯αtx0,(1−¯αt)I). Therefore,
sθ(xt,t)≈∇xtlogq(xt)=Eq(x0)[∇xtlogq(xt|x0)]=Eq(x0)[−ϵθ(xt,t)√1−¯αt]=−ϵθ(xt,t)√1−¯αt
To make diffusion more efficient, DDIM [Fig. 3] propose fewer sampling steps by modeling the forward process as a non Markovian chain: q(x1:T|x0)=q(xT|x0)∏Tt=2q(xt−1|xt,x0) (DDPM omits the x0 conditioning). Similar to [3], xt−1 can be written as
=√¯αt−1x0+√(1−¯αt−1−σ2t)ϵθ(xt,t)+σtϵt=√¯αt−1(xt−√1−¯αtϵθ(xt,t)√¯αt)"predicted x0"+√(1−¯αt−1−σ2t)ϵθ(xt,t)"direction pointing to xt"+σtϵtrandom noise=√¯αt−1(xt−√1−¯αtϵθ(xt,t)√¯αt)+√(1−¯αt−1)ϵθ(xt,t)[σ = 0]
If σ is 0, the sampling process is deterministic and because x0 is predicted (which is what we finally want + we can replace t with τ) each time, some steps [T..1] can be skipped.
Conclusion
We answered all the questions we asked—loss formulation, (required) image generation, sampling efficiency and background connecting to VAE. There are (many) additional papers that have come out since the initial set, but this post focusses on the fundamentals and therefore ends here.
REFERENCES
Denoising Diffusion Probabilistic Models
The integration of Gaussian distribution over the given range is necessary because x0 is discrete. The alternate is to treat x0 as continuous and compute the value of the Gaussian distribution at x0. Both approaches lead to an MSE though, since the integral is approximated in DDPM.
Diffusion Models Beat GANs on Image Synthesis
DENOISING DIFFUSION IMPLICIT MODELS
https://dzdata.medium.com/intro-to-diffusion-model-part-3-5d699e5f0714
https://lilianweng.github.io/posts/2021-07-11-diffusion-models/#nice
https://medium.com/better-programming/diffusion-models-ddpms-ddims-and-classifier-free-guidance-e07b297b2869
https://mbernste.github.io/posts/vae/