Question for people with insider knowledge of how labs train frontier models: Is it more common to do alignment training as the last step of training, or RL as the last step of training?
Edit: I’m mainly referring to on-policy RL, e.g. the type of RL that is used to induce new capabilities like coding / reasoning / math / tool use. I’m excluding RLHF because I think it’s pretty disanalogous (though I also welcome disagreement / takes on this point.)
Naively I’d expect we want alignment to happen last. But I have a sense that usually RL happens last—why is this the case? Is it because RL capabilities are too brittle to subsequent finetuning?
We’re getting fairly close to the point that I would pretty strongly advise against having alignment training as the last stage of your training pipeline due to goal crystallization / alignment faking concerns. Fwiu from the open weight literature, RLHF comes after RLVR, but there is a fair bit of variation among the open weight models in training practices in general.
Question for people with insider knowledge of how labs train frontier models: Is it more common to do alignment training as the last step of training, or RL as the last step of training?
Edit: I’m mainly referring to on-policy RL, e.g. the type of RL that is used to induce new capabilities like coding / reasoning / math / tool use. I’m excluding RLHF because I think it’s pretty disanalogous (though I also welcome disagreement / takes on this point.)
Naively I’d expect we want alignment to happen last. But I have a sense that usually RL happens last—why is this the case? Is it because RL capabilities are too brittle to subsequent finetuning?
Why not both? I imagine you could average the gradients so that you learn both at the same time.
It definitely could be. This isn’t the sense I get but I’m happy to be proven wrong here
We’re getting fairly close to the point that I would pretty strongly advise against having alignment training as the last stage of your training pipeline due to goal crystallization / alignment faking concerns. Fwiu from the open weight literature, RLHF comes after RLVR, but there is a fair bit of variation among the open weight models in training practices in general.