it’s hard to find definitive information about this basic fact about how modern RL on LLMs works:
are there any particularly clever ways of doing credit assignment for the tokens in a sequence S that resulted in high reward?
moreover, if you adopt the naive strategy of asserting that all of the tokens are equally responsible for the reward, is the actual gradient update to the model parameters mathematically equivalent to the one you’d get SFTing the model on S (possibly weighted by the reward, and possibly adjusted by GRPO)?
the followup is this: in this paper they claim that SFT’d models perform badly at something and RL’d models don’t. i can’t imagine what the difference between these things would even be, except that the RL’d models are affected by samples which are on-policy for them.
followup: after looking at the appendix i’m pretty sure the biggest distinction is that the SFT’d models in this paper are SFT’d on data that comes from entirely different models/datasets.
so not only is the data not coming from a policy which adapts during training, it’s coming from policies very different from the model’s own.
i think this by itself is enough to explain the results of the paper; i think this is a useful result but not the one i imagined upon reading the title.
i would still like to know the answer to the original question.
it’s hard to find definitive information about this basic fact about how modern RL on LLMs works:
are there any particularly clever ways of doing credit assignment for the tokens in a sequence S that resulted in high reward?
moreover, if you adopt the naive strategy of asserting that all of the tokens are equally responsible for the reward, is the actual gradient update to the model parameters mathematically equivalent to the one you’d get SFTing the model on S (possibly weighted by the reward, and possibly adjusted by GRPO)?
the followup is this: in this paper they claim that SFT’d models perform badly at something and RL’d models don’t. i can’t imagine what the difference between these things would even be, except that the RL’d models are affected by samples which are on-policy for them.
https://arxiv.org/pdf/2507.00432
followup: after looking at the appendix i’m pretty sure the biggest distinction is that the SFT’d models in this paper are SFT’d on data that comes from entirely different models/datasets. so not only is the data not coming from a policy which adapts during training, it’s coming from policies very different from the model’s own. i think this by itself is enough to explain the results of the paper; i think this is a useful result but not the one i imagined upon reading the title.
i would still like to know the answer to the original question.