Oops, fixed
Alexa Pan
Karma: 198
Will misaligned AIs know that they’re misaligned?
What would an IRB-like policy for AI experiments look like?
Thanks for flagging, it’s interesting to know that this idea has consideration traction. Maybe we should have responded in more detail given this.
Our response was mainly the below:If models can’t confidently distinguish test and deployment, they may be super paranoid about being tested and behave well out of caution in deployment. This is still bad news, since it suggests that more capable models might behave dangerously in deployment, but doesn’t provide clear warning signs or examples of misbehavior in the wild to study.
Also, this doesn’t seem like a scalable solution to more capable models which will probably distinguish test/deployment at some point.
Thanks, I agree with you here! Do you think that’s an assumption (about what misalignment looks like) we should generally work more with?
Supposing it is, I’m not immediately sure how this would change the rest of the argument. I guess considerations for “should we intervene on persona A’s knowledge of persona B” could turn out to be quite different, e.g. one might not be particularly worried about this knowledge eliciting persona B or making the model more dangerously coherent across contexts.