Kaj_Sotala comments on o3 Is a Lying Liar

Kaj_Sotala 7 May 2025 17:06 UTC
2 points
0
To elaborate on my sibling comment, it certainly feels like it should make some difference whether it’s the case that
1. The model has some kind of overall goal that it is trying to achieve, and if furthering that goal requires strategically lying to the user, it will.
2. The model is effectively composed of various subagents, some of which understand the human goal and are aligned to it, some of which will engage in strategic deception to achieve a different kind of goal, and some of which aren’t goal-oriented at all but just doing random stuff. Different situations will trigger different subagents, so the model’s behavior depends on exactly which of the subagents get triggered. It doesn’t have any coherent overall goal that it would be pursuing.
#2 seems to me much more likely, since
- It’s implied by the behavior we’ve seen
- The models aren’t trained to have any single coherent goal, so we don’t have a reason to expect one to appear
- Humans seem to be better modeled by #2 than by #1, so we might expect it to be what various learning processes produce by default
How exactly should this affect our threat models? I’m not sure, but it still seems like a distinction worth tracking.