shawnghu comments on shawnghu’s Shortform

shawnghu 21 Sep 2025 23:18 UTC
1 point
0
followup: after looking at the appendix i’m pretty sure the biggest distinction is that the SFT’d models in this paper are SFT’d on data that comes from entirely different models/datasets. so not only is the data not coming from a policy which adapts during training, it’s coming from policies very different from the model’s own. i think this by itself is enough to explain the results of the paper; i think this is a useful result but not the one i imagined upon reading the title.

i would still like to know the answer to the original question.