There are some use-cases where quick and precise inference is vital: for example, many agentic tasks (like playing most MOBAs or solving a physical Rubik’s cube; debatably most non-trivial physical tasks) require quick, effective, and multi-step reasoning.
Yeah, diffusion LLMs could be important not for being better at predicting what action to take, but for hitting real-time latency constraints, because they intrinsically amortize their computation more cleanly over steps. This is part of why people were exploring diffusion models in RL: a regular bidirectional or unidirectional LLM tends to be all-or-nothing, in terms of the forward pass, so even if you are doing the usual optimization tricks, it’s heavyweight. A diffusion model lets you stop in the middle of the diffusing, or use that diffusion step to improve other parts, or pivot to a new output entirely.
A diffusion LLM in theory can do something like plan a sequence of future actions+states in addition to the token about to be executed, and so each token can be the result of a bunch of diffusion steps from a long time ago. This allows a small fast model to make good use of ‘easy’ timesteps to refine its next action: it just spends the compute to keep refining its model of the future and what it ought to do next, so at the next timestep, the action is ‘already predicted’ (if things were going according to plan). If something goes wrong, then the existing sequence may still be an efficient starting point compared to a blank slate, and quickly update to compensate. And this is quite natural compared to trying to bolt on something to do with MoEs or speculative decoding or something.
So your robot diffusion LLM can be diffusing a big context of thousands of tokens, which represents its plan and predicted environment observations over the next couple seconds, and each timestep, it does a little more thinking to tweak each token a little bit, and despite this being only a few milliseconds of thinking each time by a small model, it eventually turns into a highly capable robot model’s output and each action-token is ready by the time it’s necessary (and even if it’s not fully done, at least it is there to be executed—a low-quality action choice is often better than blowing the deadline and doing some default action like a no-op). You could do the same thing with a big classic GPT-style LLM, but the equivalent quality forward pass might take 100ms and now it’s not fast enough for good robotics (without spending a lot of time on expensive hardware or optimizing).
Yeah, diffusion LLMs could be important not for being better at predicting what action to take, but for hitting real-time latency constraints, because they intrinsically amortize their computation more cleanly over steps. This is part of why people were exploring diffusion models in RL: a regular bidirectional or unidirectional LLM tends to be all-or-nothing, in terms of the forward pass, so even if you are doing the usual optimization tricks, it’s heavyweight. A diffusion model lets you stop in the middle of the diffusing, or use that diffusion step to improve other parts, or pivot to a new output entirely.
A diffusion LLM in theory can do something like plan a sequence of future actions+states in addition to the token about to be executed, and so each token can be the result of a bunch of diffusion steps from a long time ago. This allows a small fast model to make good use of ‘easy’ timesteps to refine its next action: it just spends the compute to keep refining its model of the future and what it ought to do next, so at the next timestep, the action is ‘already predicted’ (if things were going according to plan). If something goes wrong, then the existing sequence may still be an efficient starting point compared to a blank slate, and quickly update to compensate. And this is quite natural compared to trying to bolt on something to do with MoEs or speculative decoding or something.
So your robot diffusion LLM can be diffusing a big context of thousands of tokens, which represents its plan and predicted environment observations over the next couple seconds, and each timestep, it does a little more thinking to tweak each token a little bit, and despite this being only a few milliseconds of thinking each time by a small model, it eventually turns into a highly capable robot model’s output and each action-token is ready by the time it’s necessary (and even if it’s not fully done, at least it is there to be executed—a low-quality action choice is often better than blowing the deadline and doing some default action like a no-op). You could do the same thing with a big classic GPT-style LLM, but the equivalent quality forward pass might take 100ms and now it’s not fast enough for good robotics (without spending a lot of time on expensive hardware or optimizing).