I think we can go further than this with distillation. One question I have is this: if you distil from a model which is already ‘aligned’ do you get an ‘aligned’ model out of it?
Can you use this to transfer ‘alignment’ from a smaller teacher to a larger student, then do some RL to bring the larger model up in performance. This would get around the problem we currently have, where labs have to first make a smart unaligned model, then try and wrestle it into shape.
I think it depends on what kind of ‘alignment’ you’re referring to. Insofar as alignment is a behavioral property (not saying bad things to users, not being easily jail breakable) I think our results weakly suggest that this kind of alignment would transfer and perhaps even get more robust.
One hypothesis is that pretrained models learn many ‘personas’ (including ‘misaligned’ ones) and post training shapes/selects a desired persona. Maybe distilling the post trained model would only, or primarily, transfer the selected persona and not the other ones. I don’t think we can draw conclusions yet, but it sounds like an interesting idea for further work! Though it would be expensive to distill a large post trained model, it could be more tractable to find an open source one and evaluate various alignment properties compared to the teacher.
However, for more intrinsic alignment properties (is the model scheming, does the model have a misaligned goal), it’s less clear how they might develop in the first place. I’m not sure whether distillation would reliably transfer these properties.
Also importantly, I would be concerned that misalignment could emerge during the RL process or any further training.
I think we can go further than this with distillation. One question I have is this: if you distil from a model which is already ‘aligned’ do you get an ‘aligned’ model out of it?
Can you use this to transfer ‘alignment’ from a smaller teacher to a larger student, then do some RL to bring the larger model up in performance. This would get around the problem we currently have, where labs have to first make a smart unaligned model, then try and wrestle it into shape.
I think it depends on what kind of ‘alignment’ you’re referring to. Insofar as alignment is a behavioral property (not saying bad things to users, not being easily jail breakable) I think our results weakly suggest that this kind of alignment would transfer and perhaps even get more robust.
One hypothesis is that pretrained models learn many ‘personas’ (including ‘misaligned’ ones) and post training shapes/selects a desired persona. Maybe distilling the post trained model would only, or primarily, transfer the selected persona and not the other ones. I don’t think we can draw conclusions yet, but it sounds like an interesting idea for further work! Though it would be expensive to distill a large post trained model, it could be more tractable to find an open source one and evaluate various alignment properties compared to the teacher.
However, for more intrinsic alignment properties (is the model scheming, does the model have a misaligned goal), it’s less clear how they might develop in the first place. I’m not sure whether distillation would reliably transfer these properties.
Also importantly, I would be concerned that misalignment could emerge during the RL process or any further training.