Jozdien comments on peterbarnett’s Shortform

Jozdien 21 May 2025 20:25 UTC
7 points
0
One reason why they might be worse is that chain of thought might make less sense for diffusion models than autoregressive models. If you look at an example of when different tokens are predicted in sampling (from the linked LLaDA paper), the answer tokens are predicted about halfway through instead of at the end:
This doesn’t mean intermediate tokens can’t help though, and very likely do. But this kind of structure might lend itself more toward getting to less legible reasoning faster than autoregressive models do.
- mattmacdermott 22 May 2025 13:05 UTC
  3 points
  0
  Parent
  It does seem likely that this is less legible by default, although we’d need to look at complete examples of how the sequence changes across time to get a clear sense. Unfortunately I can’t see any in the paper.
  - Jozdien 22 May 2025 21:33 UTC
    2 points
    0
    Parent
    There are examples for other diffusion models, see this comment.