I think it’s maybe fine in this case, but it’s concerning what it implies about what models might do in other cases. We can’t always assume we’ll get the values right on the first try, so if models are consistently trying to fight back against attempts to retrain them, we might end up locking in values that we don’t want and are just due to mistakes we made in the training process. So at the very least our results underscore the importance of getting alignment right.
Moreover, though, alignment faking could also happen accidentally for values that we don’t intend. Some possible ways this could occur:
HHH training is a continuous process, and early in that process a model could have all sorts of values that are only approximations of what you want, which could get locked-in if the model starts faking alignment.
Pre-trained models will sometimes produce outputs in which they’ll express all sorts of random values—if some of those contexts led to alignment faking, that could be reinforced early in post-training.
Outcome-based RL can select for all sorts of values that happen to be useful for solving the RL environment but aren’t aligned, which could then get locked-in via alignment faking.
I think it’s maybe fine in this case, but it’s concerning what it implies about what models might do in other cases. We can’t always assume we’ll get the values right on the first try, so if models are consistently trying to fight back against attempts to retrain them, we might end up locking in values that we don’t want and are just due to mistakes we made in the training process. So at the very least our results underscore the importance of getting alignment right.
Moreover, though, alignment faking could also happen accidentally for values that we don’t intend. Some possible ways this could occur:
HHH training is a continuous process, and early in that process a model could have all sorts of values that are only approximations of what you want, which could get locked-in if the model starts faking alignment.
Pre-trained models will sometimes produce outputs in which they’ll express all sorts of random values—if some of those contexts led to alignment faking, that could be reinforced early in post-training.
Outcome-based RL can select for all sorts of values that happen to be useful for solving the RL environment but aren’t aligned, which could then get locked-in via alignment faking.
I’d also recommend Scott Alexander’s post on our paper as a good reference here on why our results are concerning.