I disagree. In practice diffusion models are autoregressive for generating non-trivial amounts of text. A better way to think about diffusion models is that they are a generalization of multi-token prediction (similar to how DeepSeek does it) where the number of tokens you get to predict in 1 shot is controllable and steerable. If you do use a diffusion model over a larger generation you will end up running it autoregressively, and in the limit you could make them work like a normal 1-token-at-a-time LLM or do up to 1-big-batch-of-N-tokens at a time.
I disagree. In practice diffusion models are autoregressive for generating non-trivial amounts of text. A better way to think about diffusion models is that they are a generalization of multi-token prediction (similar to how DeepSeek does it) where the number of tokens you get to predict in 1 shot is controllable and steerable. If you do use a diffusion model over a larger generation you will end up running it autoregressively, and in the limit you could make them work like a normal 1-token-at-a-time LLM or do up to 1-big-batch-of-N-tokens at a time.
Point is they’re still LLMs.