if I’m reading correct some of the data in there is RL, so I dunno.
Yeah, it’s a mishmash of all the runs, including on-policy and RL runs.
So, I can imagine an interpretation of your (non-RL) results that’s like, “off-policy training after RL typically degrades instruction-following, and when you observe transfer, that’s just a particular manifestation of this more general trend.” Do you have a sense of how plausible this explanation is?
I wouldn’t expect this to be true for LR=5e-5 (where, of course, we see many instances of transfer), at least based on our IFEval box plot. I do agree that some of the high-LR runs are worse than ideal, though. If it’s helpful, I think I’ll add a graphic showing us trying to train back in IFEval capabilities while preserving the alternate behavior rate, similar to Olympiads (it’s pretty low-cost to do). I think the capability degradation box plot clearly shows that transfer is quite doable without the hypothesis you mention, however.
After further consideration, we decided to make the capability condition more stringent (including a new IF requirement) and have thus edited the post. Thanks for your comment!