Old-timers might remember that we used to call lying, “hallucination”.
Which is to say, this is the return of a familiar problem. GPT-4 in its early days made things up constantly, that never completely went away, and now it’s back.
Did OpenAI release o3 like this, in order to keep up with Gemini 2.5? How much does Gemini 2.5 hallucinate? How about Sonnet 3.7? (I wasn’t aware that current Claude has a hallucination problem.)
We’re supposed to be in a brave new world of reasoning models. I thought the whole point of reasoning was to keep the models even more based in reality. But apparently it’s actually making them more “agentic”, at the price of renewed hallucination?
Hallucination was a bad term because it sometimes included lies and sometimes included… well, something more like hallucinations. i.e. cases where the model itself seemed to actually believe what it was saying, or at least not be aware that there was a problem with what it was saying. Whereas in these cases it’s clear that the models know the answer they are giving is not what we wanted and they are doing it anyway.
Whereas in these cases it’s clear that the models know the answer they are giving is not what we wanted and they are doing it anyway.
I think this is not so clear. Yes, it might be that the model writes a thing, and then if you ask it whether humans would have wanted it to write that thing, it will tell you no. But it’s also the case that a model might be asked to write a young child’s internal narration, and then upon being asked, tell you that the narration is too sophisticated for a child of that age.
Or, the model might offer the correct algorithm for finding the optimal solution for a puzzle if asked in the abstract. But also fail to apply that knowledge if it’s given a concrete rather than an abstract instance of the problem right away, instead trying a trial-and-error type of approach and then just arbitrarily declaring that the solution it found was optimal.
I think the situation is mostly simply expressed as: different kinds of approaches and knowledge are encoded within different features inside the LLM. Sometimes there will be a situation that triggers features that cause the LLM to go ahead with an incorrect approach (writing untruths about what it did, writing a young character with too sophisticated knowledge, going with a trial-and-error approach when asked for an optimal solution). Then if you prompt it differently, this will activate features with a more appropriate approach or knowledge (telling you that this is undesired behavior, writing the character in a more age-appropriate way, applying the optimal algorithm).
To say that the model knew it was giving an answer we didn’t want, implies that the features with the correct pattern would have been active at the same time. Possibly they were, but we can’t know that without interpretability tools. And even if they were, “doing it anyway” implies a degree of strategizing and intent. I think a better phrasing is that the model knew in principle what we wanted, but failed to consider or make use of that knowledge when it was writing its initial reasoning.
To say that the model knew it was giving an answer we didn’t want, implies that the features with the correct pattern would have been active at the same time. Possibly they were, but we can’t know that without interpretability tools.
I do not think we’re getting utility out of not just calling this lying. There are absolutely clear cases where the models in the reasoning summaries they are planning to lie, have a reason to lie, and do in fact lie. They do this systematically, literally, and explicitly often enough it’s now a commonly accepted part of life that “oh yeah the models will just lie to you”.
To put it another way, whatever you want to call “the model will strategically plan on deceiving you and then do so in a way behaviorally indistinguishable from lying in these cases”, that is the threat model.
To elaborate on my sibling comment, it certainly feels like it should make some difference whether it’s the case that
The model has some kind of overall goal that it is trying to achieve, and if furthering that goal requires strategically lying to the user, it will.
The model is effectively composed of various subagents, some of which understand the human goal and are aligned to it, some of which will engage in strategic deception to achieve a different kind of goal, and some of which aren’t goal-oriented at all but just doing random stuff. Different situations will trigger different subagents, so the model’s behavior depends on exactly which of the subagents get triggered. It doesn’t have any coherent overall goal that it would be pursuing.
#2 seems to me much more likely, since
It’s implied by the behavior we’ve seen
The models aren’t trained to have any single coherent goal, so we don’t have a reason to expect one to appear
Humans seem to be better modeled by #2 than by #1, so we might expect it to be what various learning processes produce by default
How exactly should this affect our threat models? I’m not sure, but it still seems like a distinction worth tracking.
To put it another way, whatever you want to call “the model will strategically plan on deceiving you and then do so in a way behaviorally indistinguishable from lying in these cases”, that is the threat model.
Sure. But knowing the details of how it’s happening under the hood—whether it’s something that the model in some sense intentionally chooses or not—seems important for figuring out how to avoid it in the future.
Old-timers might remember that we used to call lying, “hallucination”.
Which is to say, this is the return of a familiar problem. GPT-4 in its early days made things up constantly, that never completely went away, and now it’s back.
Did OpenAI release o3 like this, in order to keep up with Gemini 2.5? How much does Gemini 2.5 hallucinate? How about Sonnet 3.7? (I wasn’t aware that current Claude has a hallucination problem.)
We’re supposed to be in a brave new world of reasoning models. I thought the whole point of reasoning was to keep the models even more based in reality. But apparently it’s actually making them more “agentic”, at the price of renewed hallucination?
Hallucination was a bad term because it sometimes included lies and sometimes included… well, something more like hallucinations. i.e. cases where the model itself seemed to actually believe what it was saying, or at least not be aware that there was a problem with what it was saying. Whereas in these cases it’s clear that the models know the answer they are giving is not what we wanted and they are doing it anyway.
I think this is not so clear. Yes, it might be that the model writes a thing, and then if you ask it whether humans would have wanted it to write that thing, it will tell you no. But it’s also the case that a model might be asked to write a young child’s internal narration, and then upon being asked, tell you that the narration is too sophisticated for a child of that age.
Or, the model might offer the correct algorithm for finding the optimal solution for a puzzle if asked in the abstract. But also fail to apply that knowledge if it’s given a concrete rather than an abstract instance of the problem right away, instead trying a trial-and-error type of approach and then just arbitrarily declaring that the solution it found was optimal.
I think the situation is mostly simply expressed as: different kinds of approaches and knowledge are encoded within different features inside the LLM. Sometimes there will be a situation that triggers features that cause the LLM to go ahead with an incorrect approach (writing untruths about what it did, writing a young character with too sophisticated knowledge, going with a trial-and-error approach when asked for an optimal solution). Then if you prompt it differently, this will activate features with a more appropriate approach or knowledge (telling you that this is undesired behavior, writing the character in a more age-appropriate way, applying the optimal algorithm).
To say that the model knew it was giving an answer we didn’t want, implies that the features with the correct pattern would have been active at the same time. Possibly they were, but we can’t know that without interpretability tools. And even if they were, “doing it anyway” implies a degree of strategizing and intent. I think a better phrasing is that the model knew in principle what we wanted, but failed to consider or make use of that knowledge when it was writing its initial reasoning.
I do not think we’re getting utility out of not just calling this lying. There are absolutely clear cases where the models in the reasoning summaries they are planning to lie, have a reason to lie, and do in fact lie. They do this systematically, literally, and explicitly often enough it’s now a commonly accepted part of life that “oh yeah the models will just lie to you”.
To put it another way, whatever you want to call “the model will strategically plan on deceiving you and then do so in a way behaviorally indistinguishable from lying in these cases”, that is the threat model.
To elaborate on my sibling comment, it certainly feels like it should make some difference whether it’s the case that
The model has some kind of overall goal that it is trying to achieve, and if furthering that goal requires strategically lying to the user, it will.
The model is effectively composed of various subagents, some of which understand the human goal and are aligned to it, some of which will engage in strategic deception to achieve a different kind of goal, and some of which aren’t goal-oriented at all but just doing random stuff. Different situations will trigger different subagents, so the model’s behavior depends on exactly which of the subagents get triggered. It doesn’t have any coherent overall goal that it would be pursuing.
#2 seems to me much more likely, since
It’s implied by the behavior we’ve seen
The models aren’t trained to have any single coherent goal, so we don’t have a reason to expect one to appear
Humans seem to be better modeled by #2 than by #1, so we might expect it to be what various learning processes produce by default
How exactly should this affect our threat models? I’m not sure, but it still seems like a distinction worth tracking.
Sure. But knowing the details of how it’s happening under the hood—whether it’s something that the model in some sense intentionally chooses or not—seems important for figuring out how to avoid it in the future.