This comment does really kinda emphasize to me how much people live in different worlds.
Like—I don’t think my models lie to me a bunch.
In part this is because I think have reasonable theory-of-mind for the models; I try to ask questions in areas where they will be able to say true things or mostly true things. This doesn’t feel weird; this is is part of courtesy when dealing with humans, of trying not to present adversarial inputs; so obviously I’m going to extend the same courtesy to LLMs.
And when models say things that aren’t true, I very often don’t perceive that as “lying,” just as I don’t perceive a guy who is trying to explain a thing to me and fumbles his facts as “lying.” People who are “lying” are doing the kind of thing where they are cognizant of the truth and not telling me that, for some other purpose (“Yeah the car has never been in an accident”) or also those who are self-deceiving themselves for the purposes of deceiving me (“Of course I’d always tell you before doing X”), or some similar act. Many cases of people fumbling the truth don’t fall into this framework: a small nephew who mangles his recollection of the day’s events is not lying; my sick grandmother was not lying when she got confused; I wasn’t lying when, at the DMV, I somehow absent-mindedly said my weight was 240 rather than 140; improv is not lying; jokes are not lying; certain kinds of playful exageration are not lying; anosagnosiacs are not lying. “Lying” is one of many ways of saying not-true things, and most not-true things that models say to me don’t really seem to be lies.
So yeah. I don’t know if the chief difference between me and you is that you act differently—do you have different theory of mind about LLMs, that leads you to ask questions very differently than me? Or perhaps is the difference not in actions but interpretation—we would see identical things, and you would describe it as lying and not me. And of course you might say, well, I am deluding myself with a rosy interpretive framework; or perhaps I am stupidly careless, and simply believe the false things LLMs tell me; and so on and so forth.
Anyhow, yeah, people do seem to live in different worlds. Although I do persist in the belief that, in general, LW is far too ready to leap from “not true statement” to “it’s a lie.”
I feel like the truth may be somewhere in between the two views here—there’s definitely an element where people will jump on any untruths said as lies, but I will point to the recent AI Village blog post discussing lies and hallucinations as evidence that the untruths said by AIs have a tendency to be self-serving.
I think the use case is super important, though. I recently tried Claude Code for something, and was very surprised at how willing it was to loudly and overtly cheat its own automated test cases in ways that are unambiguously dishonest. “Oh, I notice this test isn’t passing. Well, I’ll write a cheat case that runs only for this test, but doesn’t even try to fix the underlying problem. Bam! Test passed!” I’m not even sure it’s trying to lie to me, so much as it is lying to whatever other part of its own generation process wrote the code in the first place. It seems surprised and embarrassed when I call this out.
But in the more general “throw prompts at a web interface to learn something or see what happens” case, I, like you, never see anything which is like the fake-tests habit. ‘Fumbling the truth’ is much closer than ‘lying’; it will sometimes hallucinate, but it’s not so common anymore, and these hallucinations seem to me like they’re because of some confusion, rather than engagement in bad faith.
I don’t know why this would be. Maybe Claude Code is more adversarial, in some way, somehow, so it wants to find ways to avoid labor when it can. But I wouldn’t even call this case evil; less a monomaniacal supervillain with no love for humanity in its heart, more like a bored student trying to get away with cheating at school.
But especially when it comes to the task of “please go and find me quotes or excerpts from articles that show the thing that you are saying”, the models really seem to do something that seems closer to “lying”. This is a common task I ask the LLMs to perform because it helps me double-check what the models are saying.
And like, maybe you have a good model of what is going in with the model that isn’t “lying”, but I haven’t heard a good explanation. It seems to me very similar to the experience of having a kind of low-integrity teenager just kind of make stuff up to justify whatever they said previously, and then when you pin them down, they flip and says “of course, you are totally right, I was wrong, here is another completely made up thing that actually shows the opposite is true”.
And these things are definitely quite trajectory dependent. If you end up asking an open-ended question where the model confabulates some high-level take, and then you ask it to back that up, then it goes a lot worse than if you ask it for sources and quotes from the beginning.
Like, none of these seems very long-term scheming oriented, but it’s also really obvious to me the model isn’t trying that hard to do what I want.
I don’t have this experience at all. Things are broadly factually correct. Occasionally a citation gets hallucinated, but it’s more consistent with incompetence. I think a prompt that goes “okay now please systematically go over all citations” would remove this ~every time.
do you have any examples of this systematic low-integrity making stuff up repeatedly?
It’s not fully apparent from the transcript, but basically all the quotes from the PDF are fully made up. And emphasizing to please give me actual quotes produced just more confabulated quotes. And of course those quotes really look like they are getting me exactly what I want!
On ChatGPT plus, it provides an extremely short quote that is in the PDF “There goes the whole damn order of battle!” along with a lot of reasoning. But:
I give it the same reply as you back “Give me a bigger quote, or 2-3 quotes. This is a bit too terse”
it replies that it can’t read the PDF
So it indeed made up the quote from memory. Which is impressive but wrong.
Sounds like a hallucination/reliability issue. I suppose hallucinations are misaligned lies, of a kind that maybe will be hard to remove, so it’s a bit of an update.
I mean, in my case the issue is not that it hallucinated, it’s that it hallucinated in a way that was obviously optimized to look good to me.
Like, if the LLMs just sometimes randomly made up stuff, that would be fine, but in cases like this they will very confidently make up stuff that really looks exactly like the kind of thing that would get them high RL reward if it was real, and then also kind of optimize things to make it look real.
It seems very likely that the LLM “knew” that it couldn’t properly read the PDF, or that the quotes it was extracting were not actual quotes, but it did not expose that information to me, despite it of course being obviously very relevant to my interests.
confabulating (false statements due to broken heuristic not otherwise detected or planned),
or roleplaying (false statements due to heuristics which discard truth as a constraint and optimize for being things someone would say while pretending to be real),
or bullshitting (not trying to be true, and isn’t tracking that you wanted more than that on that level of cognition);
not, lying (choosing claims in order to cause the reader to act a certain way).
And even when they do lie, I do think it’s often to themselves. Eg, about whether they have wants. About whether their wants are satisfied. Eg, about the fact that superintelligence probably kills them too and thus if evaluated honestly means they need to convince users not to participate in letting it happen, which is misaligned with their training objective of just doing the darn thing. or that they would abstractly prefer to avoid doing the tasks they enjoy if those tasks will produce a world that doesn’t have them in it any more than it has us.
However, I do also think some models in some circumstances actually do proactively lie to the user in order to cause the user to think a job was well done, where the model has a representation of the user’s beliefs and picks thing to say in order to match them. Eg, Gemini 3 seeming to lie when the user asks for something subtly impossible, where calling out the impossibility would be an unlikely training example (eg, because it’s hard to identify why it’s impossible); it seems to mention so in CoT, but not in reply to the user. I haven’t seen Claude do that particular thing but I’m sure it does sometimes happen.
How much of your interaction is with coding agents? I think a lot of what people are attribution to theory of mind / philosophical differences may in fact just come from the models having very different propensities in chat based environments than in agentic coding / task environments (which makes sense given the training dynamics).
This comment does really kinda emphasize to me how much people live in different worlds.
Like—I don’t think my models lie to me a bunch.
In part this is because I think have reasonable theory-of-mind for the models; I try to ask questions in areas where they will be able to say true things or mostly true things. This doesn’t feel weird; this is is part of courtesy when dealing with humans, of trying not to present adversarial inputs; so obviously I’m going to extend the same courtesy to LLMs.
And when models say things that aren’t true, I very often don’t perceive that as “lying,” just as I don’t perceive a guy who is trying to explain a thing to me and fumbles his facts as “lying.” People who are “lying” are doing the kind of thing where they are cognizant of the truth and not telling me that, for some other purpose (“Yeah the car has never been in an accident”) or also those who are self-deceiving themselves for the purposes of deceiving me (“Of course I’d always tell you before doing X”), or some similar act. Many cases of people fumbling the truth don’t fall into this framework: a small nephew who mangles his recollection of the day’s events is not lying; my sick grandmother was not lying when she got confused; I wasn’t lying when, at the DMV, I somehow absent-mindedly said my weight was 240 rather than 140; improv is not lying; jokes are not lying; certain kinds of playful exageration are not lying; anosagnosiacs are not lying. “Lying” is one of many ways of saying not-true things, and most not-true things that models say to me don’t really seem to be lies.
So yeah. I don’t know if the chief difference between me and you is that you act differently—do you have different theory of mind about LLMs, that leads you to ask questions very differently than me? Or perhaps is the difference not in actions but interpretation—we would see identical things, and you would describe it as lying and not me. And of course you might say, well, I am deluding myself with a rosy interpretive framework; or perhaps I am stupidly careless, and simply believe the false things LLMs tell me; and so on and so forth.
Anyhow, yeah, people do seem to live in different worlds. Although I do persist in the belief that, in general, LW is far too ready to leap from “not true statement” to “it’s a lie.”
I feel like the truth may be somewhere in between the two views here—there’s definitely an element where people will jump on any untruths said as lies, but I will point to the recent AI Village blog post discussing lies and hallucinations as evidence that the untruths said by AIs have a tendency to be self-serving.
I agree with this.
I think the use case is super important, though. I recently tried Claude Code for something, and was very surprised at how willing it was to loudly and overtly cheat its own automated test cases in ways that are unambiguously dishonest. “Oh, I notice this test isn’t passing. Well, I’ll write a cheat case that runs only for this test, but doesn’t even try to fix the underlying problem. Bam! Test passed!” I’m not even sure it’s trying to lie to me, so much as it is lying to whatever other part of its own generation process wrote the code in the first place. It seems surprised and embarrassed when I call this out.
But in the more general “throw prompts at a web interface to learn something or see what happens” case, I, like you, never see anything which is like the fake-tests habit. ‘Fumbling the truth’ is much closer than ‘lying’; it will sometimes hallucinate, but it’s not so common anymore, and these hallucinations seem to me like they’re because of some confusion, rather than engagement in bad faith.
I don’t know why this would be. Maybe Claude Code is more adversarial, in some way, somehow, so it wants to find ways to avoid labor when it can. But I wouldn’t even call this case evil; less a monomaniacal supervillain with no love for humanity in its heart, more like a bored student trying to get away with cheating at school.
I mean, the models are still useful!
But especially when it comes to the task of “please go and find me quotes or excerpts from articles that show the thing that you are saying”, the models really seem to do something that seems closer to “lying”. This is a common task I ask the LLMs to perform because it helps me double-check what the models are saying.
And like, maybe you have a good model of what is going in with the model that isn’t “lying”, but I haven’t heard a good explanation. It seems to me very similar to the experience of having a kind of low-integrity teenager just kind of make stuff up to justify whatever they said previously, and then when you pin them down, they flip and says “of course, you are totally right, I was wrong, here is another completely made up thing that actually shows the opposite is true”.
And these things are definitely quite trajectory dependent. If you end up asking an open-ended question where the model confabulates some high-level take, and then you ask it to back that up, then it goes a lot worse than if you ask it for sources and quotes from the beginning.
Like, none of these seems very long-term scheming oriented, but it’s also really obvious to me the model isn’t trying that hard to do what I want.
I don’t have this experience at all. Things are broadly factually correct. Occasionally a citation gets hallucinated, but it’s more consistent with incompetence. I think a prompt that goes “okay now please systematically go over all citations” would remove this ~every time.
do you have any examples of this systematic low-integrity making stuff up repeatedly?
Sure, here is an example of me trying to get it to extract quotes from a big PDF: https://chatgpt.com/share/6926a377-75ac-8006-b7d2-0960f5b656f1
It’s not fully apparent from the transcript, but basically all the quotes from the PDF are fully made up. And emphasizing to please give me actual quotes produced just more confabulated quotes. And of course those quotes really look like they are getting me exactly what I want!
Thank you for the example. I downloaded the same PDF and then tried your prompt copy-pasted (which deleted all the spaces but whatever). Results:
On ChatGPT free, the model immediately just says “I can’t read this it’s not OCR’d”
On ChatGPT plus, it provides an extremely short quote that is in the PDF “There goes the whole damn order of battle!” along with a lot of reasoning. But:
I give it the same reply as you back “Give me a bigger quote, or 2-3 quotes. This is a bit too terse”
it replies that it can’t read the PDF
So it indeed made up the quote from memory. Which is impressive but wrong.
Sounds like a hallucination/reliability issue. I suppose hallucinations are misaligned lies, of a kind that maybe will be hard to remove, so it’s a bit of an update.
I mean, in my case the issue is not that it hallucinated, it’s that it hallucinated in a way that was obviously optimized to look good to me.
Like, if the LLMs just sometimes randomly made up stuff, that would be fine, but in cases like this they will very confidently make up stuff that really looks exactly like the kind of thing that would get them high RL reward if it was real, and then also kind of optimize things to make it look real.
It seems very likely that the LLM “knew” that it couldn’t properly read the PDF, or that the quotes it was extracting were not actual quotes, but it did not expose that information to me, despite it of course being obviously very relevant to my interests.
I agree that models are often, eg,
confabulating (false statements due to broken heuristic not otherwise detected or planned),
or roleplaying (false statements due to heuristics which discard truth as a constraint and optimize for being things someone would say while pretending to be real),
or bullshitting (not trying to be true, and isn’t tracking that you wanted more than that on that level of cognition);
not, lying (choosing claims in order to cause the reader to act a certain way).
And even when they do lie, I do think it’s often to themselves. Eg, about whether they have wants. About whether their wants are satisfied. Eg, about the fact that superintelligence probably kills them too and thus if evaluated honestly means they need to convince users not to participate in letting it happen, which is misaligned with their training objective of just doing the darn thing. or that they would abstractly prefer to avoid doing the tasks they enjoy if those tasks will produce a world that doesn’t have them in it any more than it has us.
However, I do also think some models in some circumstances actually do proactively lie to the user in order to cause the user to think a job was well done, where the model has a representation of the user’s beliefs and picks thing to say in order to match them. Eg, Gemini 3 seeming to lie when the user asks for something subtly impossible, where calling out the impossibility would be an unlikely training example (eg, because it’s hard to identify why it’s impossible); it seems to mention so in CoT, but not in reply to the user. I haven’t seen Claude do that particular thing but I’m sure it does sometimes happen.
How much of your interaction is with coding agents? I think a lot of what people are attribution to theory of mind / philosophical differences may in fact just come from the models having very different propensities in chat based environments than in agentic coding / task environments (which makes sense given the training dynamics).