In general if two possible models perform the same, then I expect the weights to drift towards the simpler one. And in this case they perform the same because of deceptive alignment: both are trying to get high reward during training in order to be able to carry out their misaligned goal later on.
In general if two possible models perform the same, then I expect the weights to drift towards the simpler one. And in this case they perform the same because of deceptive alignment: both are trying to get high reward during training in order to be able to carry out their misaligned goal later on.