We’re getting fairly close to the point that I would pretty strongly advise against having alignment training as the last stage of your training pipeline due to goal crystallization / alignment faking concerns. Fwiu from the open weight literature, RLHF comes after RLVR, but there is a fair bit of variation among the open weight models in training practices in general.
We’re getting fairly close to the point that I would pretty strongly advise against having alignment training as the last stage of your training pipeline due to goal crystallization / alignment faking concerns. Fwiu from the open weight literature, RLHF comes after RLVR, but there is a fair bit of variation among the open weight models in training practices in general.