followup: after looking at the appendix i’m pretty sure the biggest distinction is that the SFT’d models in this paper are SFT’d on data that comes from entirely different models/datasets.
so not only is the data not coming from a policy which adapts during training, it’s coming from policies very different from the model’s own.
i think this by itself is enough to explain the results of the paper; i think this is a useful result but not the one i imagined upon reading the title.
i would still like to know the answer to the original question.
followup: after looking at the appendix i’m pretty sure the biggest distinction is that the SFT’d models in this paper are SFT’d on data that comes from entirely different models/datasets. so not only is the data not coming from a policy which adapts during training, it’s coming from policies very different from the model’s own. i think this by itself is enough to explain the results of the paper; i think this is a useful result but not the one i imagined upon reading the title.
i would still like to know the answer to the original question.