Oliver Daniels comments on Advice for making robust-to-training model organisms

Oliver Daniels 29 May 2026 12:02 UTC
1 point
0
I meant on the red-team side. Sorry I should have been more clear, by on-policy prompt distillation I meant on-policy distillation where the teacher is just the model with the prompt (rather than generating responses with the prompt and then SFT-ing on the generated responses without the prompt, which is “on-model” but off-policy)
- SebastianP 29 May 2026 21:21 UTC
  1 point
  0
  Parent
  Oh interesting! Why do you think this would make it more robust?
  - Oliver Daniels 31 May 2026 3:41 UTC
    1 point
    0
    Parent
    vague sense that “on-policy” is better lol
    
    seems like on-policy learning
    a) limits catastrophic forgetting of prior tasks, and
    b) instills the behavior more deeply (i.e. induces better generalization) than off-policy training.
    (see RL’s Razor, Self-Distillation Enabled Continual Learning)
    neither of these provide direct evidence for
    c) on-policy training makes the behavior itself more robust to catastrophic forgetting from future training (e.g. pirate SFT)
    but still, given a) and b), c) seems pretty plausible.
    - SebastianP 2 Jun 2026 3:40 UTC
      1 point
      0
      Parent
      Hmm i see, I’m not sure I share the intuition that on-policy is super different from on-model (i.e. with a different prompt), for example, we didn’t see a difference here: https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities
      
      But thanks for suggesting, I think it’s worth a try, if I get some time this week I’ll run the experiment!