I meant on the red-team side. Sorry I should have been more clear, by on-policy prompt distillation I meant on-policy distillation where the teacher is just the model with the prompt (rather than generating responses with the prompt and then SFT-ing on the generated responses without the prompt, which is “on-model” but off-policy)
seems like on-policy learning a) limits catastrophic forgetting of prior tasks, and b) instills the behavior more deeply (i.e. induces better generalization) than off-policy training. (see RL’s Razor, Self-Distillation Enabled Continual Learning)
neither of these provide direct evidence for c) on-policy training makes the behavior itself more robust to catastrophic forgetting from future training (e.g. pirate SFT) but still, given a) and b), c) seems pretty plausible.
I meant on the red-team side. Sorry I should have been more clear, by on-policy prompt distillation I meant on-policy distillation where the teacher is just the model with the prompt (rather than generating responses with the prompt and then SFT-ing on the generated responses without the prompt, which is “on-model” but off-policy)
Oh interesting! Why do you think this would make it more robust?
vague sense that “on-policy” is better lol
seems like on-policy learning
a) limits catastrophic forgetting of prior tasks, and
b) instills the behavior more deeply (i.e. induces better generalization) than off-policy training.
(see RL’s Razor, Self-Distillation Enabled Continual Learning)
neither of these provide direct evidence for
c) on-policy training makes the behavior itself more robust to catastrophic forgetting from future training (e.g. pirate SFT)
but still, given a) and b), c) seems pretty plausible.
Hmm i see, I’m not sure I share the intuition that on-policy is super different from on-model (i.e. with a different prompt), for example, we didn’t see a difference here: https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities
But thanks for suggesting, I think it’s worth a try, if I get some time this week I’ll run the experiment!