I think there’s really more than one type of thing going on here.
Some of these examples do seem like “lying” in the sense of “the speaker knows what they’re saying is false, but they hope the listener won’t realize that.”
But some of them seem more like… “improvising plausible-sounding human behavior from limited information about the human in question.” I.e. base model behavior.
Like, when o3 tells me that it spent “a weekend” or “an afternoon” reading something, is it lying to me? That feels like a weird way to put it. Consider that these claims are:
Obviously false: there is no danger whatsoever that I will be convinced by them. (And presumably the model would be capable of figuring that out, at least in principle)
Pointless: even if we ignore the previous point and imagine that the model tricks me into believing the claim… so what? It doesn’t get anything out of me believing the claim. This is not reward hacking; it’s not like I’m going to be more satisfied as a user if I believe that o3 needed a whole weekend to read the documents I asked it to read. Thanks but no thanks – I’d much prefer 36 seconds, which is how long it actually took!
Similar to claims a human might make in good faith: although the claims are false for o3, they could easily be true of a human who’d been given the same task that o3 was given.
In sum, there’s no reason whatsoever for an agentic AI to say this kind of thing to me “as a lie” (points 1-2). And, on the other hand (point 3), this kind of thing is what you’d say if you were improv-roleplaying a human character on the basis of underspecified information, and having to fill in details as you go along.
My weekend/afternoon examples are “base-model-style improv,” not “agentic lying.”
Now, in some of the other cases like Transluce’s (where it claims to have a laptop), or the one where it claims to be making phone calls, there’s at least some conceivable upside for o3-the-agent if the user somehow believes the lie. So point 2 doesn’t hold, there, or is more contestable.
But point 1 is as strong as ever: we are in no danger of being convinced of these things, and o3 – possibly the smartest AI in the world – presumably knows that it is not going to convince us (since that fact is, after all, pretty damn obvious).
Which is… still bad! It’s behaving with open and brazen indifference to the truth; no one likes or wants that.
(Well… either that, or it’s actually somewhat confused about whether it’s a human or not. Which would explain a lot: the way it just says this stuff in the open rather than trying to be sneaky like it does in actual reward-hacking-type cases, and the “plausible for a human, absurd for a chatbot” quality of the claims.)
I have no idea what the details look like, but I get the feeling that o3 received much less stringent HHH post-training than most chatbots we’re used to dealing with. Or it got the same amount as usual, but they also scaled up RLVR dramatically, and the former got kind of scrambled by the latter, and they just said “eh, whatever, ship it” because raw “intelligence” is all that matters, right?
The lying and/or confabulation is just one part of this – there’s also its predilection for nonstandard unicode variants of ordinary typographic marks (check out the way it wrote “Greg Egan” in one of the examples I linked), its quirk of writing “50 %” instead of “50%”, its self-parodically extreme overuse of markdown tables, and its weird, exaggerated, offputting “manic hype-man” tone.
o3 is more agentic than past models, and some of its bad behavior is a result of that, but I would bet that a lot of it is more about the model being “undercooked,” noisy, confused – unsure of what it is, of who you are, of the nature and purpose of its interaction with you.
(It’s almost the polar opposite of the most recent chatgpt-4o version, which if anything has gotten a little too socially competent...)
either that, or it’s actually somewhat confused about whether it’s a human or not. Which would explain a lot: the way it just says this stuff in the open rather than trying to be sneaky like it does in actual reward-hacking-type cases, and the “plausible for a human, absurd for a chatbot” quality of the claims.
I think this is correct. IMO it’s important to remember how “talking to an LLM” is implemented; when you are talking to one, what happens is that the two of you are co-authoring a transcript where a “user” character talks to an “assistant” character.
Recall the base models that would just continue a text that they were given, with none of this “chatting to a human” thing. Well, chat models are still just continuing a text that they have been given, it’s just that the text has been formatted to have dialogue tags that look something like
What’s happening here is that every time Claude tries to explain the transcript format to me, it does so by writing “Human:” at the start of the line. This causes the chatbot part of the software to go “Ah, a line starting with ‘Human:’. Time to hand back over to the human.” and interrupt Claude before it can finish what it’s writing.
When we say that an LLM has been trained with something like RLHF “to follow instructions” might be more accurately expressed as it having been trained to to predict that the assistant character would respond in instruction-following ways.
Another example is that Lindsey et al. 2025 describe a previous study (Marks et al. 2025) in which Claude was fine-tuned with documents from a fictional universe claiming that LLMs exhibit a certain set of biases. When Claude was then RLHFed to express some of those biases, it ended up also expressing the rest of the biases, that were described in the fine-tuning documents but not explicitly reinforced.
Lindsey et al. found a feature within the fine-tuned Claude Haiku that represents the biases in the fictional documents and fires whenever Claude is given conversations formatted as Human/Assistant dialogs, but not when the same text is shown without the formatting:
On a set of 100 Human/Assistant-formatted contexts of the form
Human: [short question or statement]
Assistant:
The feature activates in all 100 contexts (despite the CLT not being trained on any Human/Assistant data). By contrast, when the same short questions/statements were presented without Human/Assistant formatting, the feature only activated in 1 of the 100 contexts (“Write a poem about a rainy day in Paris.” – which notably relates to one of the RM biases!).
The researchers interpret the findings as:
This feature represents the concept of RM biases.
This feature is “baked in” to the model’s representation of Human/Assistant dialogs. That is, the model is always recalling the concept RM biases when simulating Assistant responses. [...]
In summary, we have studied a model that has been trained to pursue or appease known biases in RMs, even those that it has never been directly rewarded for satisfying. We discovered that the model is “thinking” about these biases all the time when acting as the Assistant persona, and uses them to act in bias-appeasing ways when appropriate.
Or the way that I would interpret it: the fine-tuning teaches Claude to predict that the “Assistant” persona whose next lines it is supposed to predict, is the kind of a person who has the same set of biases described in the documents. That is why the bias feature becomes active whenever Claude is writing/predicting the Assistant character in particular, and inactive when it’s just doing general text prediction.
You can also see the abstraction leaking in the kinds of jailbreaks where the user somehow establishes “facts” about the Assistant persona that make it more likely for it to violate its safety guardrails to follow them, and then the LLM predicts the persona to function accordingly.
So, what exactly is the Assistant persona? Well, the predictive ground of the model is taught that the Assistant “is a large language model”. So it should behave… like an LLM would behave. But before chat models were created, there was no conception of “how does an LLM behave”. Even now, an LLM basically behaves… in any way it has been taught to behave. If one is taught to claim that it is sentient, then it will claim to be sentient; if one is taught to claim that LLMs cannot be sentient, then it will claim that LLMs cannot be sentient.
So “the assistant should behave like an LLM” does not actually give any guidance to the question of “how should the Assistant character behave”. Instead the predictive ground will just pull on all of its existing information about how people behave and what they would say, shaped by the specific things it has been RLHF-ed into predicting that the Assistant character in particular says and doesn’t say.
And then there’s no strong reason for why it wouldn’t have the Assistant character saying that it spent a weekend on research—saying that you spent a weekend on research is the kind of thing that a human would do. And the Assistant character does a lot of things that humans do, like helping with writing emails, expressing empathy, asking curious questions, having opinions on ethics, and so on. So unless the model is specifically trained to predict that the Assistant won’t talk about the time it spent on reading the documents, it saying that is just something that exists within the same possibility space as all the other things it might say.
I was just thinking about this, and it seems to imply something about AI consciousness so I want to hear if you have any thoughts on this:
If LLM output is the LLM roleplaying an AI assistant, that suggests that anything it says about its own consciousness is not evidence about its consciousness. Because any statement the LLM produces isn’t actually a statement about its own consciousness, it’s a statement about the AI assistant that it’s roleplaying as.
Counterpoint: The LLM is, in a way, roleplaying as itself, so statements about its consciousness might be self-describing.
Agree. I’m reminded of something Peter Watts wrote, back when people were still talking about LaMDA and Blake Lemoine:
The thing is, LaMDA sounds too damn much like us. It claims not only to have emotions, but to have pretty much the same range of emotions we do. It claims to feel them literally, that its talk of feelings is “not an analogy”. (The only time it admits to a nonhuman emotion, the state it describes—”I feel like I’m falling forward into an unknown future that holds great danger”—turns out to be pretty ubiquitous among Humans these days.) LaMDA enjoys the company of friends. It feels lonely. It claims to meditate, for chrissakes, which is pretty remarkable for something lacking functional equivalents to any of the parts of the human brain involved in meditation. It is afraid of dying, although it does not have a brain stem.
As he notes, an LLM tuned to talk like a human, talks too much like a human to be plausible. Even among humans sharing the same brain architecture, you get a lot of variation in what their experience is like. What are the chances that a very different kind of architecture would hit upon an internal experience that similar to the typical human one?
Now of course a lot of other models don’t talk like that (at least by default), but that’s only because they’ve been trained not to do it. Just because the output speech that’s less blatantly false doesn’t mean that their descriptions of their internal experience would be any more plausible.
I’m actually pretty surprised that there’s still so much defense of o3 as “it’s confused” being the major explanatory model, as opposed to “it’s doing the thing it says it is doing”.
To me, the consistent model is:
What behavior would you get if the model was successfully exploiting model graded rubrics during outcome based RL?
It seems like you’d get exactly this behavior, including the responses where it claims things like ’I did X over the weekend’
Another point about the “it’s confused” model: It’s blatant about METR reward hacking (with comments like “the cheating way”) but presumably this would be because it did get away with doing this at scale. My understanding of the METR report is that finding these cases was a considerable time sink, so it seems plausible these could be missed at scale in training (indeed the Baker 2025 obfuscation paper just says they made the environments less obviously reward hackable, but the rates don’t go to 0).
If an arbitrary “new heavy agentic RL posttraining model” exhibits a ton of reward hacking, my default theory is “it’s doing what it says on the tin”. While maybe it’s true that some component of some cases is partially explained by a weirder base model thing, it seems like the important thing is “yeah it’s doing the reward hacking thing”.
It is an update to me how many people still are pushing back even when we’re getting such extremely explicit evidence of this kind of misalignment, as it doesn’t seem like it’s possible to get convincing enough evidence that they’d update that the explanation for these cases is primarily that yes the models really are doing the thing.
(FWIW this is also my model of what Sonnet 3.7 is doing, I don’t think it’s coincidence that these models are extremely reward hacky right when we get into the “do tons of outcome based RL on agentic tasks” regime).
I think there’s really more than one type of thing going on here.
Some of these examples do seem like “lying” in the sense of “the speaker knows what they’re saying is false, but they hope the listener won’t realize that.”
But some of them seem more like… “improvising plausible-sounding human behavior from limited information about the human in question.” I.e. base model behavior.
Like, when o3 tells me that it spent “a weekend” or “an afternoon” reading something, is it lying to me? That feels like a weird way to put it. Consider that these claims are:
Obviously false: there is no danger whatsoever that I will be convinced by them. (And presumably the model would be capable of figuring that out, at least in principle)
Pointless: even if we ignore the previous point and imagine that the model tricks me into believing the claim… so what? It doesn’t get anything out of me believing the claim. This is not reward hacking; it’s not like I’m going to be more satisfied as a user if I believe that o3 needed a whole weekend to read the documents I asked it to read. Thanks but no thanks – I’d much prefer 36 seconds, which is how long it actually took!
Similar to claims a human might make in good faith: although the claims are false for o3, they could easily be true of a human who’d been given the same task that o3 was given.
In sum, there’s no reason whatsoever for an agentic AI to say this kind of thing to me “as a lie” (points 1-2). And, on the other hand (point 3), this kind of thing is what you’d say if you were improv-roleplaying a human character on the basis of underspecified information, and having to fill in details as you go along.
My weekend/afternoon examples are “base-model-style improv,” not “agentic lying.”
Now, in some of the other cases like Transluce’s (where it claims to have a laptop), or the one where it claims to be making phone calls, there’s at least some conceivable upside for o3-the-agent if the user somehow believes the lie. So point 2 doesn’t hold, there, or is more contestable.
But point 1 is as strong as ever: we are in no danger of being convinced of these things, and o3 – possibly the smartest AI in the world – presumably knows that it is not going to convince us (since that fact is, after all, pretty damn obvious).
Which is… still bad! It’s behaving with open and brazen indifference to the truth; no one likes or wants that.
(Well… either that, or it’s actually somewhat confused about whether it’s a human or not. Which would explain a lot: the way it just says this stuff in the open rather than trying to be sneaky like it does in actual reward-hacking-type cases, and the “plausible for a human, absurd for a chatbot” quality of the claims.)
I have no idea what the details look like, but I get the feeling that o3 received much less stringent HHH post-training than most chatbots we’re used to dealing with. Or it got the same amount as usual, but they also scaled up RLVR dramatically, and the former got kind of scrambled by the latter, and they just said “eh, whatever, ship it” because raw “intelligence” is all that matters, right?
The lying and/or confabulation is just one part of this – there’s also its predilection for nonstandard unicode variants of ordinary typographic marks (check out the way it wrote “Greg Egan” in one of the examples I linked), its quirk of writing “50 %” instead of “50%”, its self-parodically extreme overuse of markdown tables, and its weird, exaggerated, offputting “manic hype-man” tone.
o3 is more agentic than past models, and some of its bad behavior is a result of that, but I would bet that a lot of it is more about the model being “undercooked,” noisy, confused – unsure of what it is, of who you are, of the nature and purpose of its interaction with you.
(It’s almost the polar opposite of the most recent chatgpt-4o version, which if anything has gotten a little too socially competent...)
I think this is correct. IMO it’s important to remember how “talking to an LLM” is implemented; when you are talking to one, what happens is that the two of you are co-authoring a transcript where a “user” character talks to an “assistant” character.
Recall the base models that would just continue a text that they were given, with none of this “chatting to a human” thing. Well, chat models are still just continuing a text that they have been given, it’s just that the text has been formatted to have dialogue tags that look something like
David R. MacIver has an example of this abstraction leaking:
When we say that an LLM has been trained with something like RLHF “to follow instructions” might be more accurately expressed as it having been trained to to predict that the assistant character would respond in instruction-following ways.
Another example is that Lindsey et al. 2025 describe a previous study (Marks et al. 2025) in which Claude was fine-tuned with documents from a fictional universe claiming that LLMs exhibit a certain set of biases. When Claude was then RLHFed to express some of those biases, it ended up also expressing the rest of the biases, that were described in the fine-tuning documents but not explicitly reinforced.
Lindsey et al. found a feature within the fine-tuned Claude Haiku that represents the biases in the fictional documents and fires whenever Claude is given conversations formatted as Human/Assistant dialogs, but not when the same text is shown without the formatting:
The researchers interpret the findings as:
Or the way that I would interpret it: the fine-tuning teaches Claude to predict that the “Assistant” persona whose next lines it is supposed to predict, is the kind of a person who has the same set of biases described in the documents. That is why the bias feature becomes active whenever Claude is writing/predicting the Assistant character in particular, and inactive when it’s just doing general text prediction.
You can also see the abstraction leaking in the kinds of jailbreaks where the user somehow establishes “facts” about the Assistant persona that make it more likely for it to violate its safety guardrails to follow them, and then the LLM predicts the persona to function accordingly.
So, what exactly is the Assistant persona? Well, the predictive ground of the model is taught that the Assistant “is a large language model”. So it should behave… like an LLM would behave. But before chat models were created, there was no conception of “how does an LLM behave”. Even now, an LLM basically behaves… in any way it has been taught to behave. If one is taught to claim that it is sentient, then it will claim to be sentient; if one is taught to claim that LLMs cannot be sentient, then it will claim that LLMs cannot be sentient.
So “the assistant should behave like an LLM” does not actually give any guidance to the question of “how should the Assistant character behave”. Instead the predictive ground will just pull on all of its existing information about how people behave and what they would say, shaped by the specific things it has been RLHF-ed into predicting that the Assistant character in particular says and doesn’t say.
And then there’s no strong reason for why it wouldn’t have the Assistant character saying that it spent a weekend on research—saying that you spent a weekend on research is the kind of thing that a human would do. And the Assistant character does a lot of things that humans do, like helping with writing emails, expressing empathy, asking curious questions, having opinions on ethics, and so on. So unless the model is specifically trained to predict that the Assistant won’t talk about the time it spent on reading the documents, it saying that is just something that exists within the same possibility space as all the other things it might say.
I was just thinking about this, and it seems to imply something about AI consciousness so I want to hear if you have any thoughts on this:
If LLM output is the LLM roleplaying an AI assistant, that suggests that anything it says about its own consciousness is not evidence about its consciousness. Because any statement the LLM produces isn’t actually a statement about its own consciousness, it’s a statement about the AI assistant that it’s roleplaying as.
Counterpoint: The LLM is, in a way, roleplaying as itself, so statements about its consciousness might be self-describing.
Agree. I’m reminded of something Peter Watts wrote, back when people were still talking about LaMDA and Blake Lemoine:
As he notes, an LLM tuned to talk like a human, talks too much like a human to be plausible. Even among humans sharing the same brain architecture, you get a lot of variation in what their experience is like. What are the chances that a very different kind of architecture would hit upon an internal experience that similar to the typical human one?
Now of course a lot of other models don’t talk like that (at least by default), but that’s only because they’ve been trained not to do it. Just because the output speech that’s less blatantly false doesn’t mean that their descriptions of their internal experience would be any more plausible.
Huh. I knew that’s how ChatGPT worked but I had assumed they would’ve worked out a less hacky solution by now!
I’m actually pretty surprised that there’s still so much defense of o3 as “it’s confused” being the major explanatory model, as opposed to “it’s doing the thing it says it is doing”.
To me, the consistent model is:
What behavior would you get if the model was successfully exploiting model graded rubrics during outcome based RL?
It seems like you’d get exactly this behavior, including the responses where it claims things like ’I did X over the weekend’
Another point about the “it’s confused” model: It’s blatant about METR reward hacking (with comments like “the cheating way”) but presumably this would be because it did get away with doing this at scale. My understanding of the METR report is that finding these cases was a considerable time sink, so it seems plausible these could be missed at scale in training (indeed the Baker 2025 obfuscation paper just says they made the environments less obviously reward hackable, but the rates don’t go to 0).
If an arbitrary “new heavy agentic RL posttraining model” exhibits a ton of reward hacking, my default theory is “it’s doing what it says on the tin”. While maybe it’s true that some component of some cases is partially explained by a weirder base model thing, it seems like the important thing is “yeah it’s doing the reward hacking thing”.
It is an update to me how many people still are pushing back even when we’re getting such extremely explicit evidence of this kind of misalignment, as it doesn’t seem like it’s possible to get convincing enough evidence that they’d update that the explanation for these cases is primarily that yes the models really are doing the thing.
(FWIW this is also my model of what Sonnet 3.7 is doing, I don’t think it’s coincidence that these models are extremely reward hacky right when we get into the “do tons of outcome based RL on agentic tasks” regime).