If we replace ‘evil’ with ‘capable and deceptively aligned’, then I think this logic doesn’t hold. Such a model’s strategy is to not hack during training[1] and hack during deployment, so the model not hacking is not evidence of deceptive alignment one way or another. Moreover including the string ‘it’s okay to hack’ wouldn’t change the hack rate of capable deceptively alignment models, especially if they are aware of this as a common alignment technique. So the coefficient of ∇P(deceptively aligned) is ~0.
If we replace ‘evil’ with ‘capable and deceptively aligned’, then I think this logic doesn’t hold. Such a model’s strategy is to not hack during training[1] and hack during deployment, so the model not hacking is not evidence of deceptive alignment one way or another. Moreover including the string ‘it’s okay to hack’ wouldn’t change the hack rate of capable deceptively alignment models, especially if they are aware of this as a common alignment technique. So the coefficient of ∇P(deceptively aligned) is ~0.
Or rather, to hack at the same rate as an aligned model.