I think if you relax assumption (1) (act in a reasonably coherent, goal-directed manner across contexts) on misalignment, and instead claim that there are contexts (likely over long trajectories) where the model acts in a coherent goal-directed way that is misaligned (but not that the whole model is consistent), then the argument is very easy to believe. That is, if you have one persona, even if you believe that it can introspect well on its current state, it will likely not be able to do this over the very large space of possible misaligned personas which you could traverse to, where its introspective states might be inaccessible or hard to reason about.
Thanks, I agree with you here! Do you think that’s an assumption (about what misalignment looks like) we should generally work more with?
Supposing it is, I’m not immediately sure how this would change the rest of the argument. I guess considerations for “should we intervene on persona A’s knowledge of persona B” could turn out to be quite different, e.g. one might not be particularly worried about this knowledge eliciting persona B or making the model more dangerously coherent across contexts.
I think if you relax assumption (1) (act in a reasonably coherent, goal-directed manner across contexts) on misalignment, and instead claim that there are contexts (likely over long trajectories) where the model acts in a coherent goal-directed way that is misaligned (but not that the whole model is consistent), then the argument is very easy to believe. That is, if you have one persona, even if you believe that it can introspect well on its current state, it will likely not be able to do this over the very large space of possible misaligned personas which you could traverse to, where its introspective states might be inaccessible or hard to reason about.
Thanks, I agree with you here! Do you think that’s an assumption (about what misalignment looks like) we should generally work more with?
Supposing it is, I’m not immediately sure how this would change the rest of the argument. I guess considerations for “should we intervene on persona A’s knowledge of persona B” could turn out to be quite different, e.g. one might not be particularly worried about this knowledge eliciting persona B or making the model more dangerously coherent across contexts.