I think this is obvious enough that the big labs probably think of it, but not everyone might think of it, so it makes sense to share it here.
In addition to having tools like “lookup url”, “add to canvas” and “google X” they will get another tool “do diffusion on the last X tokens”.
If you train your model, you have two text chains. From chain (1) where the diffusion model changed no text you train just that in a specific model the decision to use the diffusion model was used. From chain (2) you can train on the text as enhanced by the diffusion model (so the decision to use the diffusion model doesn’t exist in chain 2)
This gives the reasoning model a better way to deal with mistakes it makes both correcting them in real time and also learning not to repeat the mistake in the future if the example is used as training data.
It can also get another forward looking diffusion tool: “Start with 1000 extra random tokens and do diffusion on those, try 4 different version” and then the standard model picks the version it likes best.
Even if diffusion models don’t beat non-diffusion models in the general case, the hybrid where you have a thinking model that can use both modes is likely going to be better than either type of model on their own.
I think this is obvious enough that the big labs probably think of it, but not everyone might think of it, so it makes sense to share it here.
In addition to having tools like “lookup url”, “add to canvas” and “google X” they will get another tool “do diffusion on the last X tokens”.
If you train your model, you have two text chains. From chain (1) where the diffusion model changed no text you train just that in a specific model the decision to use the diffusion model was used. From chain (2) you can train on the text as enhanced by the diffusion model (so the decision to use the diffusion model doesn’t exist in chain 2)
This gives the reasoning model a better way to deal with mistakes it makes both correcting them in real time and also learning not to repeat the mistake in the future if the example is used as training data.
It can also get another forward looking diffusion tool: “Start with 1000 extra random tokens and do diffusion on those, try 4 different version” and then the standard model picks the version it likes best.
Even if diffusion models don’t beat non-diffusion models in the general case, the hybrid where you have a thinking model that can use both modes is likely going to be better than either type of model on their own.