seems like on-policy learning a) limits catastrophic forgetting of prior tasks, and b) instills the behavior more deeply (i.e. induces better generalization) than off-policy training. (see RL’s Razor, Self-Distillation Enabled Continual Learning)
neither of these provide direct evidence for c) on-policy training makes the behavior itself more robust to catastrophic forgetting from future training (e.g. pirate SFT) but still, given a) and b), c) seems pretty plausible.
vague sense that “on-policy” is better lol
seems like on-policy learning
a) limits catastrophic forgetting of prior tasks, and
b) instills the behavior more deeply (i.e. induces better generalization) than off-policy training.
(see RL’s Razor, Self-Distillation Enabled Continual Learning)
neither of these provide direct evidence for
c) on-policy training makes the behavior itself more robust to catastrophic forgetting from future training (e.g. pirate SFT)
but still, given a) and b), c) seems pretty plausible.
Hmm i see, I’m not sure I share the intuition that on-policy is super different from on-model (i.e. with a different prompt), for example, we didn’t see a difference here: https://www.lesswrong.com/posts/9toz5YHYZsHpzy4ce/why-does-off-model-sft-degrade-capabilities
But thanks for suggesting, I think it’s worth a try, if I get some time this week I’ll run the experiment!