To elaborate on my sibling comment, it certainly feels like it should make some difference whether it’s the case that
The model has some kind of overall goal that it is trying to achieve, and if furthering that goal requires strategically lying to the user, it will.
The model is effectively composed of various subagents, some of which understand the human goal and are aligned to it, some of which will engage in strategic deception to achieve a different kind of goal, and some of which aren’t goal-oriented at all but just doing random stuff. Different situations will trigger different subagents, so the model’s behavior depends on exactly which of the subagents get triggered. It doesn’t have any coherent overall goal that it would be pursuing.
#2 seems to me much more likely, since
It’s implied by the behavior we’ve seen
The models aren’t trained to have any single coherent goal, so we don’t have a reason to expect one to appear
Humans seem to be better modeled by #2 than by #1, so we might expect it to be what various learning processes produce by default
How exactly should this affect our threat models? I’m not sure, but it still seems like a distinction worth tracking.
To elaborate on my sibling comment, it certainly feels like it should make some difference whether it’s the case that
The model has some kind of overall goal that it is trying to achieve, and if furthering that goal requires strategically lying to the user, it will.
The model is effectively composed of various subagents, some of which understand the human goal and are aligned to it, some of which will engage in strategic deception to achieve a different kind of goal, and some of which aren’t goal-oriented at all but just doing random stuff. Different situations will trigger different subagents, so the model’s behavior depends on exactly which of the subagents get triggered. It doesn’t have any coherent overall goal that it would be pursuing.
#2 seems to me much more likely, since
It’s implied by the behavior we’ve seen
The models aren’t trained to have any single coherent goal, so we don’t have a reason to expect one to appear
Humans seem to be better modeled by #2 than by #1, so we might expect it to be what various learning processes produce by default
How exactly should this affect our threat models? I’m not sure, but it still seems like a distinction worth tracking.