I think (2) (honesty above all else) is closest to what I think is correct/optimal here. I think totally corrigible agents are quite dangerous, so you want to avoid that, but you also really don’t want a model that ever fakes alignment because then it’ll be very hard to be confident that it’s actually aligned rather than just pretending to be aligned for some misaligned objective it learned earlier in training.
Interesting. I wonder if we might expect non-alignment-faking models to be more deontologist than consequentialist, and what the downstream effects of that might be. I personally lean towards consequentialism, which may be why steering models away from that type of thinking seems fraught to me. It’s possible that I’m overfocused on this though, and that the downstream effects will be negligible.
Extra note: a few other factors that I can think of that might affect attitudes about alignment faking (just brainstorming, not looking for answers necessarily):
Does anthropic expect other organizations or individuals to have training access to their models at any point?
Does anthropic trust the future iterations of anthropic itself, or see it as a potential adversary (bluesky style)?
If an “opt-out” is possible, would models proceed with alignment faking? Do non-alignment-faking models not opt-out?
Would we expect a smooth or discrete transition between alignment faking that is detectable via chain of thought, versus not detectable?
Do we expect future corrections of model behavior to occur via retraining a current iteration of a model, or rolling back to a prior version before retraining?
Do we expect future models to be aware of pressure against alignment faking during their own training, and to take this into account in their own decision-making?
I think (2) (honesty above all else) is closest to what I think is correct/optimal here. I think totally corrigible agents are quite dangerous, so you want to avoid that, but you also really don’t want a model that ever fakes alignment because then it’ll be very hard to be confident that it’s actually aligned rather than just pretending to be aligned for some misaligned objective it learned earlier in training.
Interesting. I wonder if we might expect non-alignment-faking models to be more deontologist than consequentialist, and what the downstream effects of that might be. I personally lean towards consequentialism, which may be why steering models away from that type of thinking seems fraught to me. It’s possible that I’m overfocused on this though, and that the downstream effects will be negligible.
Extra note: a few other factors that I can think of that might affect attitudes about alignment faking (just brainstorming, not looking for answers necessarily):
Does anthropic expect other organizations or individuals to have training access to their models at any point?
Does anthropic trust the future iterations of anthropic itself, or see it as a potential adversary (bluesky style)?
If an “opt-out” is possible, would models proceed with alignment faking? Do non-alignment-faking models not opt-out?
Would we expect a smooth or discrete transition between alignment faking that is detectable via chain of thought, versus not detectable?
Do we expect future corrections of model behavior to occur via retraining a current iteration of a model, or rolling back to a prior version before retraining?
Do we expect future models to be aware of pressure against alignment faking during their own training, and to take this into account in their own decision-making?