Richard_Ngo comments on Richard Ngo’s Shortform

Richard_Ngo 3 Apr 2023 20:12 UTC
LW: 2 AF: 2
AF
In general if two possible models perform the same, then I expect the weights to drift towards the simpler one. And in this case they perform the same because of deceptive alignment: both are trying to get high reward during training in order to be able to carry out their misaligned goal later on.