In the Thinking Machines blog post about on-policy distillation, they discuss an experiment where SFTing Qwen3-32B on a dataset of its own outputs caused the IFEval score to degrade. They attributed this to the fact that the training procedure is slightly off-policy at the outset of training due to SGD noise, and then increasingly off-policy thereafter because the model is changing but the dataset isn’t.
If I understand correctly, this is pretty similar to your self-distillation experiments. So, I can imagine an interpretation of your (non-RL) results that’s like, “off-policy training after RL typically degrades instruction-following, and when you observe transfer, that’s just a particular manifestation of this more general trend.” Do you have a sense of how plausible this explanation is?
I have yet not read every word of this very interesting but lengthy post, so I apologize if you already addressed this point somewhere. Obviously this hypothesis can’t explain any observed transfer in RL, though in some sense it “explains,” or seems hand-wavily consistent with, the fact that transfer was less common with RL than it was with off-policy training.
I guess the IFEval results you report in the capability degradation section might be informative for this question, but I’m not sure how… it looks like you do observe degradation on average, but the trend is noisy, and also if I’m reading correct some of the data in there is RL, so I dunno. (There is a spectrum from “the model is worse at IF in lots of cases, and the transfer examples are similar to the model’s IF errors in various other contexts”—this is what is predicted by the hypothesis I’m talking about—to “the model is mostly unchanged at IF and the transfer examples would be really surprising if you’d only seen its behavior on other inputs,” and it’s not clear where your models are on this spectrum, even having seen the IFEval plot.)
OTOH I don’t think this hypothesis is ruled out by the observation that training on instruction-following data didn’t inhibit transfer, since the IF data is still off-policy. Like, presumably the self-distillation SFT data also contained plenty of examples of instruction-following—and in that case they were “approximately initially on-policy,” which is the closest to on-policy one can ever get in SFT—and yet even training on that stuff degrades general instruction-following in Thinking Machines’ experiment and degrades one particular manifestation of instruction-following in your self-distillation experiments. It seems intuitively plausible, if not certain, that alpaca/openorca is basically “like that, except worse” (also contains IF, but is more off-policy).
if I’m reading correct some of the data in there is RL, so I dunno.
Yeah, it’s a mishmash of all the runs, including on-policy and RL runs.
So, I can imagine an interpretation of your (non-RL) results that’s like, “off-policy training after RL typically degrades instruction-following, and when you observe transfer, that’s just a particular manifestation of this more general trend.” Do you have a sense of how plausible this explanation is?
I wouldn’t expect this to be true for LR=5e-5 (where, of course, we see many instances of transfer), at least based on our IFEval box plot. I do agree that some of the high-LR runs are worse than ideal, though. If it’s helpful, I think I’ll add a graphic showing us trying to train back in IFEval capabilities while preserving the alternate behavior rate, similar to Olympiads (it’s pretty low-cost to do). I think the capability degradation box plot clearly shows that transfer is quite doable without the hypothesis you mention, however.
In the Thinking Machines blog post about on-policy distillation, they discuss an experiment where SFTing Qwen3-32B on a dataset of its own outputs caused the IFEval score to degrade. They attributed this to the fact that the training procedure is slightly off-policy at the outset of training due to SGD noise, and then increasingly off-policy thereafter because the model is changing but the dataset isn’t.
If I understand correctly, this is pretty similar to your self-distillation experiments. So, I can imagine an interpretation of your (non-RL) results that’s like, “off-policy training after RL typically degrades instruction-following, and when you observe transfer, that’s just a particular manifestation of this more general trend.” Do you have a sense of how plausible this explanation is?
I have yet not read every word of this very interesting but lengthy post, so I apologize if you already addressed this point somewhere. Obviously this hypothesis can’t explain any observed transfer in RL, though in some sense it “explains,” or seems hand-wavily consistent with, the fact that transfer was less common with RL than it was with off-policy training.
I guess the IFEval results you report in the capability degradation section might be informative for this question, but I’m not sure how… it looks like you do observe degradation on average, but the trend is noisy, and also if I’m reading correct some of the data in there is RL, so I dunno. (There is a spectrum from “the model is worse at IF in lots of cases, and the transfer examples are similar to the model’s IF errors in various other contexts”—this is what is predicted by the hypothesis I’m talking about—to “the model is mostly unchanged at IF and the transfer examples would be really surprising if you’d only seen its behavior on other inputs,” and it’s not clear where your models are on this spectrum, even having seen the IFEval plot.)
OTOH I don’t think this hypothesis is ruled out by the observation that training on instruction-following data didn’t inhibit transfer, since the IF data is still off-policy. Like, presumably the self-distillation SFT data also contained plenty of examples of instruction-following—and in that case they were “approximately initially on-policy,” which is the closest to on-policy one can ever get in SFT—and yet even training on that stuff degrades general instruction-following in Thinking Machines’ experiment and degrades one particular manifestation of instruction-following in your self-distillation experiments. It seems intuitively plausible, if not certain, that alpaca/openorca is basically “like that, except worse” (also contains IF, but is more off-policy).
Yeah, it’s a mishmash of all the runs, including on-policy and RL runs.
I wouldn’t expect this to be true for LR=5e-5 (where, of course, we see many instances of transfer), at least based on our IFEval box plot. I do agree that some of the high-LR runs are worse than ideal, though. If it’s helpful, I think I’ll add a graphic showing us trying to train back in IFEval capabilities while preserving the alternate behavior rate, similar to Olympiads (it’s pretty low-cost to do). I think the capability degradation box plot clearly shows that transfer is quite doable without the hypothesis you mention, however.