Dylan Xu comments on How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Dylan Xu 20 Apr 2026 22:13 UTC
3 points
0

if I’m reading correct some of the data in there is RL, so I dunno.

Yeah, it’s a mishmash of all the runs, including on-policy and RL runs.

So, I can imagine an interpretation of your (non-RL) results that’s like, “off-policy training after RL typically degrades instruction-following, and when you observe transfer, that’s just a particular manifestation of this more general trend.” Do you have a sense of how plausible this explanation is?

I wouldn’t expect this to be true for LR=5e-5 (where, of course, we see many instances of transfer), at least based on our IFEval box plot. I do agree that some of the high-LR runs are worse than ideal, though. If it’s helpful, I think I’ll add a graphic showing us trying to train back in IFEval capabilities while preserving the alternate behavior rate, similar to Olympiads (it’s pretty low-cost to do). I think the capability degradation box plot clearly shows that transfer is quite doable without the hypothesis you mention, however.