From Thermodynamics to Sora: A Comprehensive Introduction to Denoising Diffusion for Video Generation

Video Diffusion Models have recently experienced a sharp uptake in interest, both academically and popularly. Despite this increase in impact, relative to LLMs or other popular architectures, video diffusion models have received far less attention on an interpretability level. In order to boost understanding and lay a framework for model understanding, we offer a conceptual explanation of video diffusion models, as well as a mathematical framework in similar parlance to that of LLMs, with an emphasis on autoregressive approaches. This blog post assumes basic familiarity with neural networks and machine learning.

Background and Intro

The goal of this work is to provide a background to video diffusion models and intuitively discuss what we might expect to find from interpretability studies. Understanding video diffusion models involves the synthesis of information from a few different areas: generative diffusion, including specific techniques for diffusion based video generation, transformer models, and model interpretability. The goal of this post is to provide an easily readable guide to take someone from zero familiarity with diffusion to the point where they can reason about these models and their internal mechanisms. At the end, we’ll pose some current open questions about a specific video diffusion architecture. The post is organized into three parts: an introductory conceptual section guiding the reader from basic diffusion all the way to video diffusion, a section detailing the kinds of neural networks used in this process, and a section exploring the internal mechanisms of these networks.

1. Making it Clear: Diffusion From Noise

Understanding Video Diffusion models begins with understanding the video diffusion process. Understanding the video diffusion process involves understanding the following things, in this order:

The purpose of diffusion as a generative method
The “basic” denoising diffusion objective, and how it’s achieved
Diffusion for image generation
Latent image diffusion: A more practical manner of image diffusion
Generating Images Efficiently: DDIM/DDPM
Generating Images from Text: Classifier/Free Guidance
Latent video diffusion: Going from an image to a video
Video diffusion forcing: Producing long, high-quality, videos

1. The Purpose of Generative Diffusion

The purpose of generative diffusion (and generative content overall) is to produce high-quality, diverse media, as specified by the users, in an automatic manner. If you’ve used a tool like stable diffusion, Veo 3, or Sora, you understand this use case: You start with a small description, whether in text or from an image you have, of what you want, and it returns something close to what you’ve described.

For any given prompt, we all understand there are an infinite amount of outputs that would satisfy your criteria: there are an almost infinite number of possible pictures or videos of cats. The goal of generative diffusion is to produce a wide variety of these, but to stay accurate to what you requested. It wouldn’t be very fun to generate images or videos of cats if it was always the same cat, or if it produced random content most of the time instead of a cat. This requires an algorithmic process that incorporates both accuracy and diversity into creating the desired content. Secondarily, though still importantly, we want our algorithm to be reasonably fast. These are objectives and themes that should be kept in mind as you continue through the post.

2. Basic Denoising Diffusion

The origin of this use of diffusion, interestingly enough, has nothing to do with images or videos. It’s actually a technique from statistical thermodynamics that’s focused on something that seems at first glance pretty unrelated: mapping one statistical distribution to another. Consider, for example, mapping a common normal distribution, as encountered in statistics class, to the spiral distribution in figure 1. Though it seems far away from what we just discussed, it’s important to dive into how it works. Consider how one might find a way to map these two distributions. One could study both closely, and derive a function to map each point from the normal distribution to a corresponding one from the spiral distribution. This is a valid approach, but it can be pretty difficult mathematically, and it would only work for specific distributions. If I wanted a checkerboard distribution instead, I’d have to work out an entirely new function. Instead, it’s easiest to task a neural network to do this for us, using a surprisingly simple and clever way of thinking about that task. The formulation of the problem is as such: for a given point from our target distribution, repeatedly apply gaussian noise (sample from the normal distribution and add it) to the point until it’s so noisy that it’s indistinguishable from a sample of the normal distribution. Since we know the exact noise that’s applied, we know what the real point is as well, and we can task the neural network with simply helping to predict the original point from the noisy one (the neural network actually predicts the noise itself, and then we subtract it). Depending on the how much noise we applied to the point, it might take several steps to remove all noise, but eventually it will produce the point from the target distribution. This is a mathematical task that would be tedious and difficult for a person, but is actually very tractable for a neural network. This task, called, fittingly, Denoising Diffusion, is formalized in figure 2:

A Training and Sampling Algorithm for Denoising Diffusion

(Math warning) It’s fairly straightforward: given a sample (x_o) with some amount (α_t) of noise (ϵ) added and a timestep t (which is just the number of times we’ve already denoised the point), predict the noise that was added. Sampling, or taking noise and moving it to the new distribution, is also pretty straightforward: Start with noisy point x_T. Slightly denoise the point: the noise is estimated by our neural network ϵ_θ(x_t,t) and subtracted the noise from x_T, multiplied by an amount α_t (The few extra terms (σ_tz) and $\frac{1}{a_{t}}$ in Algorithm 2, line 4 will be discussed later). Repeat until you have a point from the target distribution. Now, you have an algorithm that can take a normal distribution, and turn it into any other distribution.

3. Denoising Diffusion for Images

Now that we understand basic denoising diffusion, we can talk about how this might work for images. It’s actually pretty similar: you take an image, and apply noise to it, until you get an image that looks like TV static (and statistically, is just a high-dimensional normal distribution). You perform the exact same operation, just with a much larger neural network: for a given image, apply noise, and task the neural network to help predict the noise-free image (as said, it predicts the noise, which we remove). Now, you have an algorithm that can take a static image and create a new one. It sounds like we’re almost already to video diffusion, but unfortunately, things get a little bit more complex at this stage, and there are a few more topics we’ll have to introduce.

A visualization of what image noising looks like, from here

4. Latent Diffusion

This first topic is about efficiency. In the section above, we discussed how we can perfectly replicate the basic denoising diffusion process by noising an image directly, and just using a larger neural network to do the diffusion process. However, the neural network in question actually needs to be very large to do this directly “in image space”, and that’s undesirable for a number of reasons. For starters, it can be expensive and slow—remember, we have to run it several times to denoise the image to begin with. There are additional mathematical reasons to do this as well, that we’ll cover next, but for now, a speedup is more than valuable enough of an objective to improve the process. The technique formulated to address the difficulties of performing diffusion directly in image space is called latent diffusion. It’s called so because the image is mapped to a vector representation of itself called a latent, or latent vector. The mapping is typically done following a standard computer vision paradigm: before using our neural network, we run a convolutional (or similar) image downsampling technique, which we call an encoder, to produce our latent. Then, similarly to the basic diffusion case, our neural network denoises the latent vector, which can be done with a much smaller network. At the end of the process, we use a decoder to upsample the clean latent into an image, through the reverse of the chosen downsampling process (there are many options on how to down and upsample—none are too complex).

A visualization of the encoder, latent vector, and decoder. The “bottleneck” in the middle is our latent vector. Original image is from here.

5. Efficient Sampling

Now that we’ve changed from noising and denoising images directly, our image diffusion process requires much less computation per denoising step to get samples from noise. However, if possible, it could be nice to reduce the number of steps we need to begin with. Depending on the exact distributions, the process as described now can take 50 or more rounds of denoising to produce a clean datapoint or image. This technique is called DDPM (Denoising Diffusion Probabilistic Models), which is actually the process shown in figure 2. It consists of slightly re-noising an image after each denoising step, to ensure diverse samples from the target distribution. The (σ_tz) term mentioned earlier is that slight amount of noise added back to the latent. This process matches our desired outcomes: it produces results that lie on the target distribution (accurate), and ensures lots of different points are selected (diverse), though can takes numerous steps to be sampled, which could make it slow.

A faster sampling process, called DDIM (Denoising Diffusion Implicit Models), uses a process mathematically very similar to denoising using the score function of the distribution, which is similar to a partial derivative. The algorithm is also pretty straightforward: have the network predict the score function of the distribution, which will point towards more likely points of the distribution. Simply move the point in the direction of the score function, and the point becomes more likely to be in the target distribution. The precise details behind DDIM, how we know the score function is are beyond the scope of our topic, but the relevant element is that efficient sampling is done using the score function, which produces a vector that points towards our distribution.

6. Classifier-Free Guidance and Text Conditioning

So far, we’ve discussed diffusion for mapping one distribution to another, applying this to images, and using an image encoder/decoder to perform latent diffusion, and improvements to the sampling techniques using the score function. This can get us from one distribution (starting noise) to another (our images), which should yield quality images. But, recalling our initial discussion on the goal of generative modeling, we still haven’t specified a way to control what gets generated, other than that it will be similar to our initial distribution. Where we are now, if we wanted to generate cat images, we’d have to train a model exclusively on images of cats, and have a different neural network for each kind of thing we wanted to generate. That would be expensive and time consuming. So, we need to slightly modify our diffusion process so we can control what part of the target distribution our point ends up in. There are two things we need to do to enable this: include text to our process (so the user can specify what they want), and find a way to have that influence our denoising. The first thing is done by including text labels for our desired images. The second is via an additional to the sampling process called, fittingly, guidance. There are two kinds of guidance: classifier guidance, and, yes, classifier-free guidance.

Classifier guidance is pretty simple. Suppose you want to generate cats, dogs, birds, and giraffes. Firstly, to allow text descriptions, each image needs to be labeled with its contents (cat, dog, etc.). Now, in addition to the standard denoising diffusion training, as you train, train a model to predict the image’s class from the latent. Now, when you sample you can request two score functions: for the diffusion model, and for the classifier. The score function of each will tell us the direction we should move the latent vector to increase the odds of the distribution, and the odds of classification. If we move a bit in both directions, we’ll simultaneously increase the odds of the point being in our distribution (the image will look good), and being what we want (image of a cat, not a dog).

Classifier-free guidance performs the same function as classifier guidance: move towards our desired category and towards valid images. But what if we don’t want to train two networks? And what if we don’t want to be bound by specific categories? E.g., we want an image with both a cat and a dog. A clever way of doing this is to train our normal denoising model to do two things: generate latents conditionally (with our label as an input to the neural network, as a second latent vector) and unconditionally (a noise vector). You can think of the unconditional process as heading indiscriminately towards the distribution, and the conditional process as heading towards where we want it to go. If we subtract the unconditional output from the conditional output and add that difference back, it will push it even further in the direction of the conditional output, which accelerates the sampling process.

7. Video Diffusion

Let’s summarize what we’ve talked about so far—which is quite a lot, so congratulations on making it so far. You’re basically there.

Our present algorithm is trained by taking images, downsampling them into latent vectors using an encoder, noising the images, and tasking a neural network with performing denoising diffusion on the latent vector to get it to resemble the original image by moving it, via the score function, both towards the target distribution, and towards the latent’s image’s specific classification within that distribution using classifier or classifier-free guidance. When the latent is properly denoised, we upsample the latent using the decoder. Once we’ve trained, sampling images looks like this: we take from the user a text prompt, and begin with a noisy latent vector. We use guidance to move that noise towards the part of the distribution specified by the text prompt. Once we’ve sampled the latent vector, we upsample it using the decoder into our new image.

Now we have a manner of not only mapping noise and text (our inputs) to a target distribution (a new image), but we can sample efficiently, and even towards a specific part of the distribution that we described. From this, getting to video diffusion only needs two modifications: To change the encoder and decoder at the end to account for the additional information present in a video: relationships between video frames such as time flow, spatial relations, changing perspective, etc., and to make sure our neural network can handle multiple latent vectors at once (video frames are a series of images). Otherwise, the process is exactly the same: as long as we have a captioned set of videos as a dataset for our neural network, we can train our denoising diffusion model to produce videos—awesome!

We can create an algorithm that will accept text and noise, and generate a brand new video for us, using the latent denoising diffusion process we described.

8. Autoregressive History-Guided Video Diffusion (DRAFT)

Great—now, we can produce videos from noise. Shouldn’t we be done? You may have noticed the post didn’t end.

Well, some questions may have begun to pop up for you -there’s a lot going on here. Could we just pop in noise and get a movie in a few minutes? What resolution videos we can make? How long can they be? These are all great questions to be asking.

Practically speaking, the approach just described will be able to reliably generate a few frames of coherent video—maybe from a handful to a few hundred—at any given time. But, people often watch videos at anywhere from 24 to 60 frames per second, and videos can be from several seconds to several thousand seconds long. So that means we need a way to repeatedly generate sets of frames if we ever want to make a video longer than a few seconds. This brings us to autoregressive video diffusion. “Autoregressive” is very popular term that essentially means to extrapolate a future extension of something from itself (ex: The temperature decreased this hour, so I estimate it will continue to decrease the next). What this means concretely for us is that our algorithm will take our current frames and give us a few more.

2. Video Diffusion Transformers and Variational Autoencoders (DRAFT)

How do you downsample an image into a latent? Section on VAEs (image downsamplers)

How do you make a neural network that can denoise latents? Section is on transformers (neural networks that denoise). Specifically, history guided diff forcing transformers. And how does that neural network work?

So, how does one make a neural network that can understand these relationships?

Given the vast number of excellently-written resources on transformer architecture, I’ll choose to direct the reader to those for familiarization if needed: Formal Algorithms for Transformers, GPT-2 paper, GPT-2 github.

There are numerous variations of the transformer architecture. The specific architecture we’re choosing to study, from History-Guided Video Diffusion, is an autoregressive diffusion transformer with 3D attention, trained using diffusion forcing. We’ve covered diffusion forcing just before—the transformer is trained to denoise a sequence of latents, using information from previous clean latents. The rest of those terms we’ll cover now.

Diffusion transformers mostly operate like a standard transformer—they share the same architecture—embeddings or patches, some number of residual blocks, each containing an attention mechanism, normalization, an MLP, and, after the residual blocks, finally an output layer. There are a few differences between a standard transformer, detailed below, and the DiT we’re studying, mainly in the contents of the initial elements of the residual stream, and the normalization. (Reference: Formal Algorithms for Transformers. Image from lesswrong post).

The basic structure of a transformer, from this post

Despite the use of the name “3D attention”, the attention mechanism in our chosen video diffusion transformer is the standard transformer attention mechanism, operating on two sequences. The 3D refers to the use of 3D positional embeddings (RoPE) to allow the network to represent relationships spanning vertical, horizontal and temporal position.

The MLP in the diffusion transformer for our chosen architecture is a standard multilayer perceptron as used in various transformers.

Arguably the biggest difference between the diffusion transformer at hand and a standard LLM architecture is the use of zero-initialized adaptive layer normalization, or AdaLN-Zero. Whereas a typical transformer uses layernorm, which keeps scaling and bias parameters to standardize the distribution of its attention and mlp outputs, DiTs use AdaLN-Zero, which takes a conditioning input, and includes a zero-initialized gate scaling mechanism. This means that residual addition starts as an identity function (gate_msa starts 0), and the network, in our formulation, receives input about the level of noise the latents presently have. This may drastically change the effect normalization might have on the contents of the residual.

AdaLN Zero and Attention Computation:

How AdaLN affection attention computation

VAEs and the residual stream
The residual stream is arguably the defining element of a transformer. It’s where all information in the network is stored, both as input for transformer blocks and final model output.
In a standard transformer, residual stream elements are initially taken directly from the embedding matrix, made to represent terms in the model’s vocabulary. In this circumstance, at the start of the network, elements are the results of VAE encoder output—downsampled local pixel information from that patch of the frame. The network outputs those VAE latents, enhanced by information the network adds during processing. The VAE decoder takes the generated sequence of latents, and upsamples those to full video frame patches. Importantly, both the VAE encoder and decoder don’t have any MLP elements, only convolutions and downsampling. So it does not introduce any new information to the image, just creates a compressed representation of it.
Add more VAE details? Convs + Spatial and Temporal Downsampling
The diffusion forcing transformer is an autoregressive transformer model, in the same manner a decoder LLM is (LLaMA, GPT-2, etc.). The notable difference here is, due to the use of the diffusion forcing training technique, that causality is only preserved from context to primary sequence, and any latents being generated may affect each other. It’s important to note that, at least in its present setup, the model we’re discussing does not use a causal mask between generated tokens. The causality is between the context (previous frames) and primary sequence (frames in generation). Diff Forcing (orig)
This should cover what we’re studying—an autoregressive diffusion transformer with 3D attention, trained using diffusion forcing.

3. A Mathematical Framework and Intuition for VDTs (DRAFT)

What do we think happens during this process?

So we now understand the general parameters of an algorithm capable of effectively modeling these kinds of relationships. Now we turn our attention to understanding/hypothesizing about how we expect these algorithms to perform these tasks.

We’ll use the following model and notation to discuss different transformer components. A transformer has the architecture denoted in this figure:

https://transformer-circuits.pub/2021/framework/index.html#high-level-architecture

DiT Blocks share the same basic form as standard transformer blocks: the input from the residual stream is normalized, fed to a multi-head attention operation, added, modulated, to the residual, normalized, fed to a multi-layer perception, and added, modulated, to the residual. As opposed to elements of the embedding matrix, the residual stream starts with outputs from the VAE, which are the results of a video downsampling process in the VAE. (Ref, github)

The components of a Diffusion Transformer block

Attention head breakdown and intuition—QK/OV

As mentioned, video diffusion transformer attention is a standard transformer attention mechanism. As such, attention circuits will be similar in form to those outlined in a mathematical framework for transformer models for token based transformers, specifically the existence of QK (determine context token info to copy) and OV (change token logits based on info) circuits. Whereas copying information from one token to another (QK task) remains similar, while a language transformer produces a logit distribution trained on an argmax-based task, a DiT produces latents. As such, the OV circuit might behave differently, as the contribution to the residual stream is in service of a different task. Reference: Attn Circuits

MLP intuition

Given the task of a diffusion transformer is to conditionally denoise a given latent (token) to push it towards the data distribution, we can expect the MLP to contain information about the desired condition and the interaction between the existing latent’s content and the condition, as well as geometric/visual information, and perform the denoising operation itself (no other part of the transformer can do this task). The MLP is responsible for the addition of any information not contained by the input sequence and the conditioning. This likely includes a general body of knowledge, integrating interactions between tokens (movement, reflection, etc.), and perhaps geometric information manipulation.

This includes latent/semantic processing, temporal and spatial processing, information for future inference. “it’s a wing, but covered by a wall right now”.

At the end, the residual stream must contain “what it is and how to display it” for the VAE decoder.

For the moment, we’re going to ignore the contributions of the AdaLN-Zero component, which is to adjust the residual to account for a specific noise level. That being said, how behavior changes conditionally based on noise certainly seems to be a phenomenon worth investigating. Hopefully this can be addressed later in the process.

In order to better understand the contributions of the various components of the video diffusion generation process, it might be beneficial to run experiments ablating various components of the model, such as omitting the diffusion transformer, running it for one step, and running it without AdaLN, though it remains a question the degree this can be ablated from the network. Here are five ablations we’ll consider:

What do we expect to happen if the diffusion transformer isn’t run? This is just running it through the VAE. Given that the goal of the VAE to to upsample a latent into a video frame, I’d expect the same frames to be generated, perhaps with slight modification due to priors creating a causal bias, e.g., there could be a bias for movement in a particular direction, bias for the frames image to change tint, etc.
What do we expect to happen if the diffusion transformer is run only once? Given that this would result in only one ‘partial derivative’ step in the latent space, we’d likely expect a noisy, blurry, or conceptually weak/unrelated image.
What if we ‘ablated’ AdaLN-Zero, by passing in dummy noise levels? It might just break the model, if not trained without AdaLN-Zero. If you put in “average” noise, it might make convergence odd. Probably just a noisy or blurry image, depending on how noise levels change normally. If one were to train without AdaLN-Zero, I expect there would be noisier and conceptually less precise output.
What if we trained a version of the diffusion model with few layers? Famously, in Mathematical Framework, a phase change in token based transformer models was discovered by running the network with one and then two layers, namely, induction circuits. Induction circuits are a composition of two attention heads in, one in a layer before the other—a previous token head, and then an induction head, which respectively recognize a token’s initial predecessor, and when that predecessor occurs again. This circuit allows transformers to reliably copy multi-token phrases, and is associated with a visible decrease in loss when formed. Notably, this functionality cannot form when there’s only one layer of attention heads in the model.
Attention only? What might we expect to happen if we disabled the MLPs in the DiT blocks? This would indicate only patch information could be used to create new frames.

Does an ‘induction circuit’ exist, or do other model circuits exist?

As mentioned, the induction circuit was a notable finding in an early token transformer interpretability work. Naturally, one might wonder whether there is a similar fairly central mechanism for video diffusion models, also associated with a notable increase in capability. If there were to be one, what might it look like? Notably, whereas the previous transformer deals with what are initially discrete semantic units, the diffusion transformer we study takes inputs that are less directly defined, and not discretely selected as output. That being said, depending on the kind of video in question, copying and moving concepts, as opposed to pixels, is still a core element of the model’s functionality. Would we expect there to be a pair of attention heads similar to the induction circuit, or a mechanically different but conceptually similar functionality, or a completely different framework of information transfer? (Also interesting would be a similar mechanism of different functionality). What might determine this is the degree to which information crosses patch boundaries. (Reference: Mathematical Framework: Induction Heads)