Yeah, the thing they aren’t is transformers.
EDIT: I stand corrected, I tended to think of diffusion models as necessarily a classic series of convolution/neural network layers but obviously that’s just for images and there’s no reason to not use a transformer approach instead, so I realise the two things are decoupled, and what makes a diffusion model is its training objective, not its architecture.
Fair, you can use the same architecture just fine instead of simple NNs. It’s really a distinction between what’s your choice of universal function approximator vs what goal you optimise it against I guess.