Oliver Daniels comments on Advice for making robust-to-training model organisms

Oliver Daniels 31 May 2026 3:41 UTC
1 point
0
vague sense that “on-policy” is better lol

seems like on-policy learning
a) limits catastrophic forgetting of prior tasks, and
b) instills the behavior more deeply (i.e. induces better generalization) than off-policy training.
(see RL’s Razor, Self-Distillation Enabled Continual Learning)
neither of these provide direct evidence for
c) on-policy training makes the behavior itself more robust to catastrophic forgetting from future training (e.g. pirate SFT)
but still, given a) and b), c) seems pretty plausible.
- SebastianP 2 Jun 2026 3:40 UTC
  1 point
  0
  Parent
  Hmm i see, I’m not sure I share the intuition that on-policy is super different from on-model (i.e. with a different prompt), for example, we didn’t see a difference here: https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities
  
  But thanks for suggesting, I think it’s worth a try, if I get some time this week I’ll run the experiment!