For an example of the kind of capabilities diffusion models offer that LLMs don’t, you don’t need to just predict tokens after a piece of text: you can natively edit somewhere in the middle.
That should be helpful for the various arithmetic operations in which the big-endian nature of arabic numerals requires carry operations to work backwards. Of course, there are also transformer models that use a different masking approach during training and thus don’t impose a forwards-only autoregressive generation, or one could just train a model to do arithmetic in a little-endian numerical notation and then reverse the answer.
Similarly, it seems likely that this approach might avoid the reversal curse.
That should be helpful for the various arithmetic operations in which the big-endian nature of arabic numerals requires carry operations to work backwards. Of course, there are also transformer models that use a different masking approach during training and thus don’t impose a forwards-only autoregressive generation, or one could just train a model to do arithmetic in a little-endian numerical notation and then reverse the answer.
Similarly, it seems likely that this approach might avoid the reversal curse.