Addie Foote comments on Distillation Robustifies Unlearning

Addie Foote 19 Jun 2025 4:17 UTC
2 points
0
I think it depends on what kind of ‘alignment’ you’re referring to. Insofar as alignment is a behavioral property (not saying bad things to users, not being easily jail breakable) I think our results weakly suggest that this kind of alignment would transfer and perhaps even get more robust.

One hypothesis is that pretrained models learn many ‘personas’ (including ‘misaligned’ ones) and post training shapes/selects a desired persona. Maybe distilling the post trained model would only, or primarily, transfer the selected persona and not the other ones. I don’t think we can draw conclusions yet, but it sounds like an interesting idea for further work! Though it would be expensive to distill a large post trained model, it could be more tractable to find an open source one and evaluate various alignment properties compared to the teacher.

However, for more intrinsic alignment properties (is the model scheming, does the model have a misaligned goal), it’s less clear how they might develop in the first place. I’m not sure whether distillation would reliably transfer these properties.

Also importantly, I would be concerned that misalignment could emerge during the RL process or any further training.