if I’m reading correct some of the data in there is RL, so I dunno.
Yeah, it’s a mishmash of all the runs, including on-policy and RL runs.
So, I can imagine an interpretation of your (non-RL) results that’s like, “off-policy training after RL typically degrades instruction-following, and when you observe transfer, that’s just a particular manifestation of this more general trend.” Do you have a sense of how plausible this explanation is?
I wouldn’t expect this to be true for LR=5e-5 (where, of course, we see many instances of transfer), at least based on our IFEval box plot. I do agree that some of the high-LR runs are worse than ideal, though. If it’s helpful, I think I’ll add a graphic showing us trying to train back in IFEval capabilities while preserving the alternate behavior rate, similar to Olympiads (it’s pretty low-cost to do). I think the capability degradation box plot clearly shows that transfer is quite doable without the hypothesis you mention, however.
Yeah, it’s a mishmash of all the runs, including on-policy and RL runs.
I wouldn’t expect this to be true for LR=5e-5 (where, of course, we see many instances of transfer), at least based on our IFEval box plot. I do agree that some of the high-LR runs are worse than ideal, though. If it’s helpful, I think I’ll add a graphic showing us trying to train back in IFEval capabilities while preserving the alternate behavior rate, similar to Olympiads (it’s pretty low-cost to do). I think the capability degradation box plot clearly shows that transfer is quite doable without the hypothesis you mention, however.