I’m not tailcalled, but yeah, it being (containing?) a transformer does make it pretty similar architecturally. Autoregressive transformers predict one output (e.g. a token) at a time. But lots of transformers (like some translation models) are sequence-to-sequence, so they take in a whole passage and output a whole passage.
There are differences, but iirc it’s mostly non-autoregressive transformers having some extra parts that autoregressive ones don’t need. Lots of overlap though. More like a different breed than a different species.
I disagree. In practice diffusion models are autoregressive for generating non-trivial amounts of text. A better way to think about diffusion models is that they are a generalization of multi-token prediction (similar to how DeepSeek does it) where the number of tokens you get to predict in 1 shot is controllable and steerable. If you do use a diffusion model over a larger generation you will end up running it autoregressively, and in the limit you could make them work like a normal 1-token-at-a-time LLM or do up to 1-big-batch-of-N-tokens at a time.
EDIT: I stand corrected, I tended to think of diffusion models as necessarily a classic series of convolution/neural network layers but obviously that’s just for images and there’s no reason to not use a transformer approach instead, so I realise the two things are decoupled, and what makes a diffusion model is its training objective, not its architecture.
Fair, you can use the same architecture just fine instead of simple NNs. It’s really a distinction between what’s your choice of universal function approximator vs what goal you optimise it against I guess.
the thing they aren’t is one-step crossentropy. that’s it, everything else is presumably sampled from the same distribution as existing LLMs. (this is like if someone finally upgraded BERT to be a primary model).
Text diffusion models are still LLMs, just not autoregressive.
Is this just a semantic quibble, or are you saying there’s fundamental similarities between them that are relevant?
I’m not tailcalled, but yeah, it being (containing?) a transformer does make it pretty similar architecturally. Autoregressive transformers predict one output (e.g. a token) at a time. But lots of transformers (like some translation models) are sequence-to-sequence, so they take in a whole passage and output a whole passage.
There are differences, but iirc it’s mostly non-autoregressive transformers having some extra parts that autoregressive ones don’t need. Lots of overlap though. More like a different breed than a different species.
Diffusion LLMs and autoregressive LLMs seem like basically the same technology to me.
Agreed. I highly recommend this blog post (https://sander.ai/2024/09/02/spectral-autoregression.html) for concretely understanding why autoregressive and diffusion models are so similar, despite seeming so different.
I disagree. In practice diffusion models are autoregressive for generating non-trivial amounts of text. A better way to think about diffusion models is that they are a generalization of multi-token prediction (similar to how DeepSeek does it) where the number of tokens you get to predict in 1 shot is controllable and steerable. If you do use a diffusion model over a larger generation you will end up running it autoregressively, and in the limit you could make them work like a normal 1-token-at-a-time LLM or do up to 1-big-batch-of-N-tokens at a time.
Point is they’re still LLMs.
Yeah, the thing they aren’t is transformers.
EDIT: I stand corrected, I tended to think of diffusion models as necessarily a classic series of convolution/neural network layers but obviously that’s just for images and there’s no reason to not use a transformer approach instead, so I realise the two things are decoupled, and what makes a diffusion model is its training objective, not its architecture.
You can train transformers as diffusion models (example paper), and that’s presumably what Gemini diffusion is.
Fair, you can use the same architecture just fine instead of simple NNs. It’s really a distinction between what’s your choice of universal function approximator vs what goal you optimise it against I guess.
the thing they aren’t is one-step crossentropy. that’s it, everything else is presumably sampled from the same distribution as existing LLMs. (this is like if someone finally upgraded BERT to be a primary model).