SebastianP comments on Advice for making robust-to-training model organisms

SebastianP 2 Jun 2026 3:40 UTC
1 point
0
Hmm i see, I’m not sure I share the intuition that on-policy is super different from on-model (i.e. with a different prompt), for example, we didn’t see a difference here: https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities

But thanks for suggesting, I think it’s worth a try, if I get some time this week I’ll run the experiment!