Alexa Pan comments on Will misaligned AIs know that they’re misaligned?

Alexa Pan 5 Dec 2025 21:57 UTC
1 point
0
Thanks, I agree with you here! Do you think that’s an assumption (about what misalignment looks like) we should generally work more with?

Supposing it is, I’m not immediately sure how this would change the rest of the argument. I guess considerations for “should we intervene on persona A’s knowledge of persona B” could turn out to be quite different, e.g. one might not be particularly worried about this knowledge eliciting persona B or making the model more dangerously coherent across contexts.