EDIT: I stand corrected, I tended to think of diffusion models as necessarily a classic series of convolution/neural network layers but obviously that’s just for images and there’s no reason to not use a transformer approach instead, so I realise the two things are decoupled, and what makes a diffusion model is its training objective, not its architecture.
Fair, you can use the same architecture just fine instead of simple NNs. It’s really a distinction between what’s your choice of universal function approximator vs what goal you optimise it against I guess.
the thing they aren’t is one-step crossentropy. that’s it, everything else is presumably sampled from the same distribution as existing LLMs. (this is like if someone finally upgraded BERT to be a primary model).
Yeah, the thing they aren’t is transformers.
EDIT: I stand corrected, I tended to think of diffusion models as necessarily a classic series of convolution/neural network layers but obviously that’s just for images and there’s no reason to not use a transformer approach instead, so I realise the two things are decoupled, and what makes a diffusion model is its training objective, not its architecture.
You can train transformers as diffusion models (example paper), and that’s presumably what Gemini diffusion is.
Fair, you can use the same architecture just fine instead of simple NNs. It’s really a distinction between what’s your choice of universal function approximator vs what goal you optimise it against I guess.
the thing they aren’t is one-step crossentropy. that’s it, everything else is presumably sampled from the same distribution as existing LLMs. (this is like if someone finally upgraded BERT to be a primary model).