To say that the model knew it was giving an answer we didn’t want, implies that the features with the correct pattern would have been active at the same time. Possibly they were, but we can’t know that without interpretability tools.
I do not think we’re getting utility out of not just calling this lying. There are absolutely clear cases where the models in the reasoning summaries they are planning to lie, have a reason to lie, and do in fact lie. They do this systematically, literally, and explicitly often enough it’s now a commonly accepted part of life that “oh yeah the models will just lie to you”.
To put it another way, whatever you want to call “the model will strategically plan on deceiving you and then do so in a way behaviorally indistinguishable from lying in these cases”, that is the threat model.
To elaborate on my sibling comment, it certainly feels like it should make some difference whether it’s the case that
The model has some kind of overall goal that it is trying to achieve, and if furthering that goal requires strategically lying to the user, it will.
The model is effectively composed of various subagents, some of which understand the human goal and are aligned to it, some of which will engage in strategic deception to achieve a different kind of goal, and some of which aren’t goal-oriented at all but just doing random stuff. Different situations will trigger different subagents, so the model’s behavior depends on exactly which of the subagents get triggered. It doesn’t have any coherent overall goal that it would be pursuing.
#2 seems to me much more likely, since
It’s implied by the behavior we’ve seen
The models aren’t trained to have any single coherent goal, so we don’t have a reason to expect one to appear
Humans seem to be better modeled by #2 than by #1, so we might expect it to be what various learning processes produce by default
How exactly should this affect our threat models? I’m not sure, but it still seems like a distinction worth tracking.
To put it another way, whatever you want to call “the model will strategically plan on deceiving you and then do so in a way behaviorally indistinguishable from lying in these cases”, that is the threat model.
Sure. But knowing the details of how it’s happening under the hood—whether it’s something that the model in some sense intentionally chooses or not—seems important for figuring out how to avoid it in the future.
I do not think we’re getting utility out of not just calling this lying. There are absolutely clear cases where the models in the reasoning summaries they are planning to lie, have a reason to lie, and do in fact lie. They do this systematically, literally, and explicitly often enough it’s now a commonly accepted part of life that “oh yeah the models will just lie to you”.
To put it another way, whatever you want to call “the model will strategically plan on deceiving you and then do so in a way behaviorally indistinguishable from lying in these cases”, that is the threat model.
To elaborate on my sibling comment, it certainly feels like it should make some difference whether it’s the case that
The model has some kind of overall goal that it is trying to achieve, and if furthering that goal requires strategically lying to the user, it will.
The model is effectively composed of various subagents, some of which understand the human goal and are aligned to it, some of which will engage in strategic deception to achieve a different kind of goal, and some of which aren’t goal-oriented at all but just doing random stuff. Different situations will trigger different subagents, so the model’s behavior depends on exactly which of the subagents get triggered. It doesn’t have any coherent overall goal that it would be pursuing.
#2 seems to me much more likely, since
It’s implied by the behavior we’ve seen
The models aren’t trained to have any single coherent goal, so we don’t have a reason to expect one to appear
Humans seem to be better modeled by #2 than by #1, so we might expect it to be what various learning processes produce by default
How exactly should this affect our threat models? I’m not sure, but it still seems like a distinction worth tracking.
Sure. But knowing the details of how it’s happening under the hood—whether it’s something that the model in some sense intentionally chooses or not—seems important for figuring out how to avoid it in the future.