Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 7 Nov 2025 13:22 UTC
18 points
−1
Question for people with insider knowledge of how labs train frontier models: Is it more common to do alignment training as the last step of training, or RL as the last step of training?
- Edit: I’m mainly referring to on-policy RL, e.g. the type of RL that is used to induce new capabilities like coding / reasoning / math / tool use. I’m excluding RLHF because I think it’s pretty disanalogous (though I also welcome disagreement / takes on this point.)
Naively I’d expect we want alignment to happen last. But I have a sense that usually RL happens last—why is this the case? Is it because RL capabilities are too brittle to subsequent finetuning?
- anaguma 8 Nov 2025 1:23 UTC
  2 points
  0
  Parent
  Why not both? I imagine you could average the gradients so that you learn both at the same time.
  - Daniel Tan 8 Nov 2025 8:43 UTC
    2 points
    0
    Parent
    It definitely could be. This isn’t the sense I get but I’m happy to be proven wrong here
- Cam 9 Nov 2025 17:52 UTC
  1 point
  0
  Parent
  We’re getting fairly close to the point that I would pretty strongly advise against having alignment training as the last stage of your training pipeline due to goal crystallization / alignment faking concerns. Fwiu from the open weight literature, RLHF comes after RLVR, but there is a fair bit of variation among the open weight models in training practices in general.