Wait, is it obvious that they are worse for faithfulness than the normal CoT approach?
Yes, the outputs are no longer produced sequentially over the sequence dimension, so we definitely don’t have causal faithfulness along the sequence dimension.
But the outputs are produced sequentially along the time dimension. Plausibly by monitoring along that dimension we can get a window into the model’s thinking.
What’s more, I think you actually guarantee strict causal faithfulness in the diffusion case, unlike in the normal autoregressive paradigm. The reason being that in the normal paradigm there is a causal path from the inputs to the final outputs via the model’s activations which doesn’t go through the intermediate produced tokens. Whereas (if I understand correctly)[1] with diffusion models we throw away the model’s activations after each timestep and just feed in the token sequence. So the only way for the model’s thinking at earlier timesteps to affect later timesteps is via the token sequence. Any of the standard measures of chain-of-thought faithfulness that look like corrupting the intermediate reasoning steps and seeing whether the final answer changes will in fact say that the diffusion model is faithful (if applied along the time dimension).
Of course, this causal notion of faithfulness is necessary but not sufficient for what we really care about, since the intermediate outputs could still fail to make the model’s thinking comprehensible to us. Is there strong reason to think the diffusion setup is worse in that sense?
ETA: maybe “diffusion models + paraphrasing the sequence after each timestep” is promising for getting both causal faithfulness and non-encoded reasoning? By default this probably breaks the diffusion model, but perhaps it could be made to work.
One reason why they might be worse is that chain of thought might make less sense for diffusion models than autoregressive models. If you look at an example of when different tokens are predicted in sampling (from the linked LLaDA paper), the answer tokens are predicted about halfway through instead of at the end:
This doesn’t mean intermediate tokens can’t help though, and very likely do. But this kind of structure might lend itself more toward getting to less legible reasoning faster than autoregressive models do.
It does seem likely that this is less legible by default, although we’d need to look at complete examples of how the sequence changes across time to get a clear sense. Unfortunately I can’t see any in the paper.
On diffusion models + paraphrasing the sequence after each time step, I’m not sure this actually will break the diffusion model. With the current generation of diffusion models (at least the paper you cited, and Mercury, who knows about Gemini), they act basically like Masked LMs.
So they guess all of the masked tokens at each steps. (Some are re-masked to get the “diffusion process”). I bet there’s a sampling strategy in there of sample, paraphrase, arbitrarily remask; rinse and repeat. But I’m not sure either.
Wait, is it obvious that they are worse for faithfulness than the normal CoT approach?
Yes, the outputs are no longer produced sequentially over the sequence dimension, so we definitely don’t have causal faithfulness along the sequence dimension.
But the outputs are produced sequentially along the time dimension. Plausibly by monitoring along that dimension we can get a window into the model’s thinking.
What’s more, I think you actually guarantee strict causal faithfulness in the diffusion case, unlike in the normal autoregressive paradigm. The reason being that in the normal paradigm there is a causal path from the inputs to the final outputs via the model’s activations which doesn’t go through the intermediate produced tokens. Whereas (if I understand correctly)[1] with diffusion models we throw away the model’s activations after each timestep and just feed in the token sequence. So the only way for the model’s thinking at earlier timesteps to affect later timesteps is via the token sequence. Any of the standard measures of chain-of-thought faithfulness that look like corrupting the intermediate reasoning steps and seeing whether the final answer changes will in fact say that the diffusion model is faithful (if applied along the time dimension).
Of course, this causal notion of faithfulness is necessary but not sufficient for what we really care about, since the intermediate outputs could still fail to make the model’s thinking comprehensible to us. Is there strong reason to think the diffusion setup is worse in that sense?
ETA: maybe “diffusion models + paraphrasing the sequence after each timestep” is promising for getting both causal faithfulness and non-encoded reasoning? By default this probably breaks the diffusion model, but perhaps it could be made to work.
Based on skimming this paper and assuming it’s representative.
One reason why they might be worse is that chain of thought might make less sense for diffusion models than autoregressive models. If you look at an example of when different tokens are predicted in sampling (from the linked LLaDA paper), the answer tokens are predicted about halfway through instead of at the end:
This doesn’t mean intermediate tokens can’t help though, and very likely do. But this kind of structure might lend itself more toward getting to less legible reasoning faster than autoregressive models do.
It does seem likely that this is less legible by default, although we’d need to look at complete examples of how the sequence changes across time to get a clear sense. Unfortunately I can’t see any in the paper.
There are examples for other diffusion models, see this comment.
On diffusion models + paraphrasing the sequence after each time step, I’m not sure this actually will break the diffusion model. With the current generation of diffusion models (at least the paper you cited, and Mercury, who knows about Gemini), they act basically like Masked LMs.
So they guess all of the masked tokens at each steps. (Some are re-masked to get the “diffusion process”). I bet there’s a sampling strategy in there of sample, paraphrase, arbitrarily remask; rinse and repeat. But I’m not sure either.