This seems like it’s only a big deal if we expect diffusion language models to scale at a pace comparable or better to more traditional autoregressive language transformers, which seems non-obvious to me.
Right now my distribution over possible scaling behaviors is pretty wide, I’m interested to hear more from people.
My understanding was that diffusion refers to a training objective, and isn’t tied to a specific architecture. For example OpenAI’s Sora is described as a diffusion transformer. Do you mean you expect diffusion transformers to scale worse than autoregressive transformers? Or do you mean you don’t think this model is a transformer in terms of architecture.
Oops, I wrote that without fully thinking about diffusion models. I meant to contrast diffusion LMs to more traditional autoregressive language transformers, yes. Thanks for the correction, I’ll clarify my original comment.
If it’s trained from scratch, and they release details, then it’s one data point for diffusion LLM scaling. But if it’s distilled, then it’s zero points of scaling data.
Because we are not interested in scaling which is distilled from a larger parent model, as that doesn’t push the frontier because it doesn’t help get the next larger parent model.
Apple also have LLM diffusion papers, with code. It seems like it might be helpful for alignment and interp because it would have a more interpretable and manipulable latent space.
Oh it’s not explicitly in the paper, but in Apple’s version they have an encoder/decoder with explicit latent space. This space would be much easier to work with and steerable than the hidden states we have in transformers.
With an explicit and nicely behaved latent space we would have a much better chance of finding a predictive “truth” neuron where intervention reveals deception 99% of the time even out of sample. Right now mechinterp research achieves much less, partly because the transformers have quite confusing activation spaces (attention sinks, suppressed neurons, etc).
I think what you’re saying is that because the output of the encoder is a semantic embedding vector per paragraph, that results in a coherent latent space that probably has nice algebraic properties (in the same sense that eg the Word2Vec embedding space does). Is that a good representation?
That does seem intuitively plausible, although I could also imagine that there might have to be some messy subspaces for meta-level information, maybe eg ‘I’m answering in language X, with tone Y, to a user with inferred properties Z’. I’m looking forward to seeing some concrete interpretability work on these models.
Yes, that’s exactly what I mean! If we have word2vec like properties, steering and interpretability would be much easier and more reliable. And I do think it’s a research direction that is prospective, but not certain.
Facebook also did an interesting tokenizer, that makes LLM’s operating in a much richer embeddings space: https://github.com/facebookresearch/blt. They embed sentences split by entropy/surprise. So it might be another way to test the hypothesis that a better embedding space would provide ice Word2Vec like properties.
If Gemini is distilled from a bigger LLM, then it’s also useful because a similar result is obtained with fewer compute. Consider o3 and o4-mini which is only a little less powerful and far cheaper. And that’s ignoring the possibility to amplify Gemini Diffusion, then re-distill it, obtaining GemDiff^2, etc. If this IDA process turns out to be far cheaper than that of LLMs, then we obtain a severe capabilities per compute increase...
Good point! And it’s plausible because diffusion seems to provide more supervision and get better results in generative vision models, so it’s a candidate for scaling.
This seems like it’s only a big deal if we expect diffusion language models to scale at a pace comparable or better to more traditional autoregressive language transformers, which seems non-obvious to me.
There are some use-cases where quick and precise inference is vital: for example, many agentic tasks (like playing most MOBAs or solving a physical Rubik’s cube; debatably most non-trivial physical tasks) require quick, effective, and multi-step reasoning. Current LLMs can’t do many of these tasks for a multitude of reasons; one of those reasons is the speed that it takes to generate responses, especially with chain-of-thought reasoning. A diffusion-based LLM could actually respond to novel events quickly, using a superbly detailed chain-of-thought, with only ‘commodity’ and therefore cheaper hardware (no WSL chips or other weirdness, only GPUs).
If non-trivial physical tasks (like automatically collecting and doing laundry) require detailed COTs (somewhat probable, 60%), and these tasks are very economically relevant (this seems highly probable to me, 80%), then the economic utility to training diffusion LLMs only requires said diffusion LLMs to have near-comparable scaling to traditional autoregressive LLMs; the economic use cases for fast inference more than justifies the required and higher training requirements (~48%).
There are some use-cases where quick and precise inference is vital: for example, many agentic tasks (like playing most MOBAs or solving a physical Rubik’s cube; debatably most non-trivial physical tasks) require quick, effective, and multi-step reasoning.
Yeah, diffusion LLMs could be important not for being better at predicting what action to take, but for hitting real-time latency constraints, because they intrinsically amortize their computation more cleanly over steps. This is part of why people were exploring diffusion models in RL: a regular bidirectional or unidirectional LLM tends to be all-or-nothing, in terms of the forward pass, so even if you are doing the usual optimization tricks, it’s heavyweight. A diffusion model lets you stop in the middle of the diffusing, or use that diffusion step to improve other parts, or pivot to a new output entirely.
A diffusion LLM in theory can do something like plan a sequence of future actions+states in addition to the token about to be executed, and so each token can be the result of a bunch of diffusion steps from a long time ago. This allows a small fast model to make good use of ‘easy’ timesteps to refine its next action: it just spends the compute to keep refining its model of the future and what it ought to do next, so at the next timestep, the action is ‘already predicted’ (if things were going according to plan). If something goes wrong, then the existing sequence may still be an efficient starting point compared to a blank slate, and quickly update to compensate. And this is quite natural compared to trying to bolt on something to do with MoEs or speculative decoding or something.
So your robot diffusion LLM can be diffusing a big context of thousands of tokens, which represents its plan and predicted environment observations over the next couple seconds, and each timestep, it does a little more thinking to tweak each token a little bit, and despite this being only a few milliseconds of thinking each time by a small model, it eventually turns into a highly capable robot model’s output and each action-token is ready by the time it’s necessary (and even if it’s not fully done, at least it is there to be executed—a low-quality action choice is often better than blowing the deadline and doing some default action like a no-op). You could do the same thing with a big classic GPT-style LLM, but the equivalent quality forward pass might take 100ms and now it’s not fast enough for good robotics (without spending a lot of time on expensive hardware or optimizing).
This seems like it’s only a big deal if we expect diffusion language models to scale at a pace comparable or better to more traditional autoregressive language transformers, which seems non-obvious to me.
Right now my distribution over possible scaling behaviors is pretty wide, I’m interested to hear more from people.
My understanding was that diffusion refers to a training objective, and isn’t tied to a specific architecture. For example OpenAI’s Sora is described as a diffusion transformer. Do you mean you expect diffusion transformers to scale worse than autoregressive transformers? Or do you mean you don’t think this model is a transformer in terms of architecture.
Oops, I wrote that without fully thinking about diffusion models. I meant to contrast diffusion LMs to more traditional autoregressive language transformers, yes. Thanks for the correction, I’ll clarify my original comment.
If it’s trained from scratch, and they release details, then it’s one data point for diffusion LLM scaling. But if it’s distilled, then it’s zero points of scaling data.
Because we are not interested in scaling which is distilled from a larger parent model, as that doesn’t push the frontier because it doesn’t help get the next larger parent model.
Apple also have LLM diffusion papers, with code. It seems like it might be helpful for alignment and interp because it would have a more interpretable and manipulable latent space.
Why would we expect that to be the case? (If the answer is in the Apple paper, just point me there)
Oh it’s not explicitly in the paper, but in Apple’s version they have an encoder/decoder with explicit latent space. This space would be much easier to work with and steerable than the hidden states we have in transformers.
With an explicit and nicely behaved latent space we would have a much better chance of finding a predictive “truth” neuron where intervention reveals deception 99% of the time even out of sample. Right now mechinterp research achieves much less, partly because the transformers have quite confusing activation spaces (attention sinks, suppressed neurons, etc).
I think what you’re saying is that because the output of the encoder is a semantic embedding vector per paragraph, that results in a coherent latent space that probably has nice algebraic properties (in the same sense that eg the Word2Vec embedding space does). Is that a good representation?
That does seem intuitively plausible, although I could also imagine that there might have to be some messy subspaces for meta-level information, maybe eg ‘I’m answering in language X, with tone Y, to a user with inferred properties Z’. I’m looking forward to seeing some concrete interpretability work on these models.
Yes, that’s exactly what I mean! If we have word2vec like properties, steering and interpretability would be much easier and more reliable. And I do think it’s a research direction that is prospective, but not certain.
Facebook also did an interesting tokenizer, that makes LLM’s operating in a much richer embeddings space: https://github.com/facebookresearch/blt. They embed sentences split by entropy/surprise. So it might be another way to test the hypothesis that a better embedding space would provide ice Word2Vec like properties.
If Gemini is distilled from a bigger LLM, then it’s also useful because a similar result is obtained with fewer compute. Consider o3 and o4-mini which is only a little less powerful and far cheaper. And that’s ignoring the possibility to amplify Gemini Diffusion, then re-distill it, obtaining GemDiff^2, etc. If this IDA process turns out to be far cheaper than that of LLMs, then we obtain a severe capabilities per compute increase...
Good point! And it’s plausible because diffusion seems to provide more supervision and get better results in generative vision models, so it’s a candidate for scaling.
There are some use-cases where quick and precise inference is vital: for example, many agentic tasks (like playing most MOBAs or solving a physical Rubik’s cube; debatably most non-trivial physical tasks) require quick, effective, and multi-step reasoning.
Current LLMs can’t do many of these tasks for a multitude of reasons; one of those reasons is the speed that it takes to generate responses, especially with chain-of-thought reasoning. A diffusion-based LLM could actually respond to novel events quickly, using a superbly detailed chain-of-thought, with only ‘commodity’ and therefore cheaper hardware (no WSL chips or other weirdness, only GPUs).
If non-trivial physical tasks (like automatically collecting and doing laundry) require detailed COTs (somewhat probable, 60%), and these tasks are very economically relevant (this seems highly probable to me, 80%), then the economic utility to training diffusion LLMs only requires said diffusion LLMs to have near-comparable scaling to traditional autoregressive LLMs; the economic use cases for fast inference more than justifies the required and higher training requirements (~48%).
Yeah, diffusion LLMs could be important not for being better at predicting what action to take, but for hitting real-time latency constraints, because they intrinsically amortize their computation more cleanly over steps. This is part of why people were exploring diffusion models in RL: a regular bidirectional or unidirectional LLM tends to be all-or-nothing, in terms of the forward pass, so even if you are doing the usual optimization tricks, it’s heavyweight. A diffusion model lets you stop in the middle of the diffusing, or use that diffusion step to improve other parts, or pivot to a new output entirely.
A diffusion LLM in theory can do something like plan a sequence of future actions+states in addition to the token about to be executed, and so each token can be the result of a bunch of diffusion steps from a long time ago. This allows a small fast model to make good use of ‘easy’ timesteps to refine its next action: it just spends the compute to keep refining its model of the future and what it ought to do next, so at the next timestep, the action is ‘already predicted’ (if things were going according to plan). If something goes wrong, then the existing sequence may still be an efficient starting point compared to a blank slate, and quickly update to compensate. And this is quite natural compared to trying to bolt on something to do with MoEs or speculative decoding or something.
So your robot diffusion LLM can be diffusing a big context of thousands of tokens, which represents its plan and predicted environment observations over the next couple seconds, and each timestep, it does a little more thinking to tweak each token a little bit, and despite this being only a few milliseconds of thinking each time by a small model, it eventually turns into a highly capable robot model’s output and each action-token is ready by the time it’s necessary (and even if it’s not fully done, at least it is there to be executed—a low-quality action choice is often better than blowing the deadline and doing some default action like a no-op). You could do the same thing with a big classic GPT-style LLM, but the equivalent quality forward pass might take 100ms and now it’s not fast enough for good robotics (without spending a lot of time on expensive hardware or optimizing).
For sure. It might be nothing, or it might be everything.