Diffusion language models are probably bad for alignment and safety because there isn’t a clear way to get a (faithful) Chain-of-Thought from them. Even if you can get them to generate something that looks like a CoT, compared with autoregressive LMs, there is even less reason to believe that this CoT is load-bearing and being used in a human-like way.
Agreed, but I’d guess this also hits capabilities unless you have some clever diffusion+reasoning approach which might recover guarantees which aren’t wildly worse than normal CoT guarantees. (Unless you directly generate blocks reasoning in diffusion neuralese or similar.)
That said, I’m surprised it gets 23% on AIME given this, so I think they must have found some reasoning strategy which works well enough in practice. I wonder how it solves these problems.
Based on how it appears to solve math problems, I’d guess the guarantees you get based on looking at the CoT aren’t wildly worse than what you get from autoregressive models, but probably somewhat more confusing to analyze and there might be a faster path to a particularly bad sort of neuralese. They show a video of it solving a math problem on the (desktop version of the) website, here is the final reasoning:
At the moment they seem to just make it imitate normal-ish CoT, which would presumably improve accuracy because the model has more token-positions/space/capacity to do things like check for self-consistency. You’re still scaling up a compute dimension that the model can use for solving things, and you can still do normal RL things to it from that point.
It’s just maybe worse in this case because the causality from reasoning chains → the part of the response containing the answer is worse (it was bad before, but now it is horrible).
Wait, is it obvious that they are worse for faithfulness than the normal CoT approach?
Yes, the outputs are no longer produced sequentially over the sequence dimension, so we definitely don’t have causal faithfulness along the sequence dimension.
But the outputs are produced sequentially along the time dimension. Plausibly by monitoring along that dimension we can get a window into the model’s thinking.
What’s more, I think you actually guarantee strict causal faithfulness in the diffusion case, unlike in the normal autoregressive paradigm. The reason being that in the normal paradigm there is a causal path from the inputs to the final outputs via the model’s activations which doesn’t go through the intermediate produced tokens. Whereas (if I understand correctly)[1] with diffusion models we throw away the model’s activations after each timestep and just feed in the token sequence. So the only way for the model’s thinking at earlier timesteps to affect later timesteps is via the token sequence. Any of the standard measures of chain-of-thought faithfulness that look like corrupting the intermediate reasoning steps and seeing whether the final answer changes will in fact say that the diffusion model is faithful (if applied along the time dimension).
Of course, this causal notion of faithfulness is necessary but not sufficient for what we really care about, since the intermediate outputs could still fail to make the model’s thinking comprehensible to us. Is there strong reason to think the diffusion setup is worse in that sense?
ETA: maybe “diffusion models + paraphrasing the sequence after each timestep” is promising for getting both causal faithfulness and non-encoded reasoning? By default this probably breaks the diffusion model, but perhaps it could be made to work.
One reason why they might be worse is that chain of thought might make less sense for diffusion models than autoregressive models. If you look at an example of when different tokens are predicted in sampling (from the linked LLaDA paper), the answer tokens are predicted about halfway through instead of at the end:
This doesn’t mean intermediate tokens can’t help though, and very likely do. But this kind of structure might lend itself more toward getting to less legible reasoning faster than autoregressive models do.
It does seem likely that this is less legible by default, although we’d need to look at complete examples of how the sequence changes across time to get a clear sense. Unfortunately I can’t see any in the paper.
On diffusion models + paraphrasing the sequence after each time step, I’m not sure this actually will break the diffusion model. With the current generation of diffusion models (at least the paper you cited, and Mercury, who knows about Gemini), they act basically like Masked LMs.
So they guess all of the masked tokens at each steps. (Some are re-masked to get the “diffusion process”). I bet there’s a sampling strategy in there of sample, paraphrase, arbitrarily remask; rinse and repeat. But I’m not sure either.
With LLMs we’re trying to extract high-level ideas/concepts that are implicit in the stream of tokens. It seems that with diffusion these high-level concepts should be something that arises first and thus might be easier to find?
(Disclaimer: I know next to nothing about diffusion models)
Diffusion language models are probably bad for alignment and safety because there isn’t a clear way to get a (faithful) Chain-of-Thought from them. Even if you can get them to generate something that looks like a CoT, compared with autoregressive LMs, there is even less reason to believe that this CoT is load-bearing and being used in a human-like way.
Agreed, but I’d guess this also hits capabilities unless you have some clever diffusion+reasoning approach which might recover guarantees which aren’t wildly worse than normal CoT guarantees. (Unless you directly generate blocks reasoning in diffusion neuralese or similar.)
That said, I’m surprised it gets 23% on AIME given this, so I think they must have found some reasoning strategy which works well enough in practice. I wonder how it solves these problems.
Based on how it appears to solve math problems, I’d guess the guarantees you get based on looking at the CoT aren’t wildly worse than what you get from autoregressive models, but probably somewhat more confusing to analyze and there might be a faster path to a particularly bad sort of neuralese. They show a video of it solving a math problem on the (desktop version of the) website, here is the final reasoning:
Note, the video doesn’t show up for me.
Why is this your intuition?
At the moment they seem to just make it imitate normal-ish CoT, which would presumably improve accuracy because the model has more token-positions/space/capacity to do things like check for self-consistency. You’re still scaling up a compute dimension that the model can use for solving things, and you can still do normal RL things to it from that point.
It’s just maybe worse in this case because the causality from reasoning chains → the part of the response containing the answer is worse (it was bad before, but now it is horrible).
Wait, is it obvious that they are worse for faithfulness than the normal CoT approach?
Yes, the outputs are no longer produced sequentially over the sequence dimension, so we definitely don’t have causal faithfulness along the sequence dimension.
But the outputs are produced sequentially along the time dimension. Plausibly by monitoring along that dimension we can get a window into the model’s thinking.
What’s more, I think you actually guarantee strict causal faithfulness in the diffusion case, unlike in the normal autoregressive paradigm. The reason being that in the normal paradigm there is a causal path from the inputs to the final outputs via the model’s activations which doesn’t go through the intermediate produced tokens. Whereas (if I understand correctly)[1] with diffusion models we throw away the model’s activations after each timestep and just feed in the token sequence. So the only way for the model’s thinking at earlier timesteps to affect later timesteps is via the token sequence. Any of the standard measures of chain-of-thought faithfulness that look like corrupting the intermediate reasoning steps and seeing whether the final answer changes will in fact say that the diffusion model is faithful (if applied along the time dimension).
Of course, this causal notion of faithfulness is necessary but not sufficient for what we really care about, since the intermediate outputs could still fail to make the model’s thinking comprehensible to us. Is there strong reason to think the diffusion setup is worse in that sense?
ETA: maybe “diffusion models + paraphrasing the sequence after each timestep” is promising for getting both causal faithfulness and non-encoded reasoning? By default this probably breaks the diffusion model, but perhaps it could be made to work.
Based on skimming this paper and assuming it’s representative.
One reason why they might be worse is that chain of thought might make less sense for diffusion models than autoregressive models. If you look at an example of when different tokens are predicted in sampling (from the linked LLaDA paper), the answer tokens are predicted about halfway through instead of at the end:
This doesn’t mean intermediate tokens can’t help though, and very likely do. But this kind of structure might lend itself more toward getting to less legible reasoning faster than autoregressive models do.
It does seem likely that this is less legible by default, although we’d need to look at complete examples of how the sequence changes across time to get a clear sense. Unfortunately I can’t see any in the paper.
There are examples for other diffusion models, see this comment.
On diffusion models + paraphrasing the sequence after each time step, I’m not sure this actually will break the diffusion model. With the current generation of diffusion models (at least the paper you cited, and Mercury, who knows about Gemini), they act basically like Masked LMs.
So they guess all of the masked tokens at each steps. (Some are re-masked to get the “diffusion process”). I bet there’s a sampling strategy in there of sample, paraphrase, arbitrarily remask; rinse and repeat. But I’m not sure either.
But maybe interpretability will be easier?
With LLMs we’re trying to extract high-level ideas/concepts that are implicit in the stream of tokens. It seems that with diffusion these high-level concepts should be something that arises first and thus might be easier to find?
(Disclaimer: I know next to nothing about diffusion models)