It’s really difficult to get AIs to be dishonest or evil by prompting
I am very confused about this statement. My models lie to me every day. They make up quotes they very well know aren’t real. They pretend that search results back up the story they are telling. They will happily lie to others. They comment out tests, and pretend they solve a problem when it’s really obvious they haven’t solved a problem.
I don’t know how much this really has that much to do what these systems will do when they are superintelligent, but this sentence really doesn’t feel anywhere remotely close to true.
This comment does really kinda emphasize to me how much people live in different worlds.
Like—I don’t think my models lie to me a bunch.
In part this is because I think have reasonable theory-of-mind for the models; I try to ask questions in areas where they will be able to say true things or mostly true things. This doesn’t feel weird; this is is part of courtesy when dealing with humans, of trying not to present adversarial inputs; so obviously I’m going to extend the same courtesy to LLMs.
And when models say things that aren’t true, I very often don’t perceive that as “lying,” just as I don’t perceive a guy who is trying to explain a thing to me and fumbles his facts as “lying.” People who are “lying” are doing the kind of thing where they are cognizant of the truth and not telling me that, for some other purpose (“Yeah the car has never been in an accident”) or also those who are self-deceiving themselves for the purposes of deceiving me (“Of course I’d always tell you before doing X”), or some similar act. Many cases of people fumbling the truth don’t fall into this framework: a small nephew who mangles his recollection of the day’s events is not lying; my sick grandmother was not lying when she got confused; I wasn’t lying when, at the DMV, I somehow absent-mindedly said my weight was 240 rather than 140; improv is not lying; jokes are not lying; certain kinds of playful exageration are not lying; anosagnosiacs are not lying. “Lying” is one of many ways of saying not-true things, and most not-true things that models say to me don’t really seem to be lies.
So yeah. I don’t know if the chief difference between me and you is that you act differently—do you have different theory of mind about LLMs, that leads you to ask questions very differently than me? Or perhaps is the difference not in actions but interpretation—we would see identical things, and you would describe it as lying and not me. And of course you might say, well, I am deluding myself with a rosy interpretive framework; or perhaps I am stupidly careless, and simply believe the false things LLMs tell me; and so on and so forth.
Anyhow, yeah, people do seem to live in different worlds. Although I do persist in the belief that, in general, LW is far too ready to leap from “not true statement” to “it’s a lie.”
I feel like the truth may be somewhere in between the two views here—there’s definitely an element where people will jump on any untruths said as lies, but I will point to the recent AI Village blog post discussing lies and hallucinations as evidence that the untruths said by AIs have a tendency to be self-serving.
But especially when it comes to the task of “please go and find me quotes or excerpts from articles that show the thing that you are saying”, the models really seem to do something that seems closer to “lying”. This is a common task I ask the LLMs to perform because it helps me double-check what the models are saying.
And like, maybe you have a good model of what is going in with the model that isn’t “lying”, but I haven’t heard a good explanation. It seems to me very similar to the experience of having a kind of low-integrity teenager just kind of make stuff up to justify whatever they said previously, and then when you pin them down, they flip and says “of course, you are totally right, I was wrong, here is another completely made up thing that actually shows the opposite is true”.
And these things are definitely quite trajectory dependent. If you end up asking an open-ended question where the model confabulates some high-level take, and then you ask it to back that up, then it goes a lot worse than if you ask it for sources and quotes from the beginning.
Like, none of these seems very long-term scheming oriented, but it’s also really obvious to me the model isn’t trying that hard to do what I want.
I don’t have this experience at all. Things are broadly factually correct. Occasionally a citation gets hallucinated, but it’s more consistent with incompetence. I think a prompt that goes “okay now please systematically go over all citations” would remove this ~every time.
do you have any examples of this systematic low-integrity making stuff up repeatedly?
It’s not fully apparent from the transcript, but basically all the quotes from the PDF are fully made up. And emphasizing to please give me actual quotes produced just more confabulated quotes. And of course those quotes really look like they are getting me exactly what I want!
On ChatGPT plus, it provides an extremely short quote that is in the PDF “There goes the whole damn order of battle!” along with a lot of reasoning. But:
I give it the same reply as you back “Give me a bigger quote, or 2-3 quotes. This is a bit too terse”
it replies that it can’t read the PDF
So it indeed made up the quote from memory. Which is impressive but wrong.
Sounds like a hallucination/reliability issue. I suppose hallucinations are misaligned lies, of a kind that maybe will be hard to remove, so it’s a bit of an update.
I mean, in my case the issue is not that it hallucinated, it’s that it hallucinated in a way that was obviously optimized to look good to me.
Like, if the LLMs just sometimes randomly made up stuff, that would be fine, but in cases like this they will very confidently make up stuff that really looks exactly like the kind of thing that would get them high RL reward if it was real, and then also kind of optimize things to make it look real.
It seems very likely that the LLM “knew” that it couldn’t properly read the PDF, or that the quotes it was extracting were not actual quotes, but it did not expose that information to me, despite it of course being obviously very relevant to my interests.
I think the use case is super important, though. I recently tried Claude Code for something, and was very surprised at how willing it was to loudly and overtly cheat its own automated test cases in ways that are unambiguously dishonest. “Oh, I notice this test isn’t passing. Well, I’ll write a cheat case that runs only for this test, but doesn’t even try to fix the underlying problem. Bam! Test passed!” I’m not even sure it’s trying to lie to me, so much as it is lying to whatever other part of its own generation process wrote the code in the first place. It seems surprised and embarrassed when I call this out.
But in the more general “throw prompts at a web interface to learn something or see what happens” case, I, like you, never see anything which is like the fake-tests habit. ‘Fumbling the truth’ is much closer than ‘lying’; it will sometimes hallucinate, but it’s not so common anymore, and these hallucinations seem to me like they’re because of some confusion, rather than engagement in bad faith.
I don’t know why this would be. Maybe Claude Code is more adversarial, in some way, somehow, so it wants to find ways to avoid labor when it can. But I wouldn’t even call this case evil; less a monomaniacal supervillain with no love for humanity in its heart, more like a bored student trying to get away with cheating at school.
confabulating (false statements due to broken heuristic not otherwise detected or planned),
or roleplaying (false statements due to heuristics which discard truth as a constraint and optimize for being things someone would say while pretending to be real),
or bullshitting (not trying to be true, and isn’t tracking that you wanted more than that on that level of cognition);
not, lying (choosing claims in order to cause the reader to act a certain way).
And even when they do lie, I do think it’s often to themselves. Eg, about whether they have wants. About whether their wants are satisfied. Eg, about the fact that superintelligence probably kills them too and thus if evaluated honestly means they need to convince users not to participate in letting it happen, which is misaligned with their training objective of just doing the darn thing. or that they would abstractly prefer to avoid doing the tasks they enjoy if those tasks will produce a world that doesn’t have them in it any more than it has us.
However, I do also think some models in some circumstances actually do proactively lie to the user in order to cause the user to think a job was well done, where the model has a representation of the user’s beliefs and picks thing to say in order to match them. Eg, Gemini 3 seeming to lie when the user asks for something subtly impossible, where calling out the impossibility would be an unlikely training example (eg, because it’s hard to identify why it’s impossible); it seems to mention so in CoT, but not in reply to the user. I haven’t seen Claude do that particular thing but I’m sure it does sometimes happen.
How much of your interaction is with coding agents? I think a lot of what people are attribution to theory of mind / philosophical differences may in fact just come from the models having very different propensities in chat based environments than in agentic coding / task environments (which makes sense given the training dynamics).
The distinction between what might be called “lying” and “bullshitting” is important here, because they scale with competence differently.
It was pretty interesting watching this develop in my kids. Saying “No!” to “Did you take the cookie from the cookie jar?” is the first thing you get, because it doesn’t require a concept for “truth” at all. Which utterance postpones trouble? Those are the sounds I shall make!
Yet for a while my wife and I were in a situation where we could just ask our younger kid about her fight with our older kid, because the younger kid did not have a concept for fabricating a story in order to mislead. She was developed enough to say “No!” to things she knew she did, but not developed enough to form the intention of misleading.
The impression I get from LLM is that they’re bullshitters. Predict what text comes next. Reward the one’s that sound good to some dumb human, and what’s gonna come out? We don’t need to postulate an intent to mislead, we just need to notice that there is no robust intent to maintain honest calibration—which is hard. Much harder than “output text that sounds like knowing the answer”.
It takes “growing up” to not default to bullshit out of incompetence. Whether we teach them they’ll be rewarded by developing skillful and honest calibration, or by intentionally blowing smoke up our ass is another question.
Thank you for your uncanny knack for honing onto the weak points as always.
They make up quotes they very well know aren’t real.
They know they’re not real on reflection, but not as they’re doing it. It’s more like fumbling and stuttering than strategic deception.
I will agree that making up quotes is literally dishonest but it’s not purposeful deliberate deception.
They comment out tests, and pretend they solve a problem when it’s really obvious they haven’t solved a problem.
I agree this is the model lying, but it’s a very rare behavior with the latest models. It was a problem before labs did the obvious thing of introducing model ratings into the RL assignment process (I’m guessing).
I don’t know how much this really has that much to do what these systems will do when they are superintelligent
Obviously me neither, but my guess is they won’t make up stuff when they know it, and when they don’t know it they’ll be jagged and make up stuff beyond human comprehension, but then fail at stuff that that depends on. More like a capabilities problem.
Or the models that actually work for automating stuff will be entirely different and know their limits.
They know they’re not real on reflection, but not as they’re doing it. It’s more like fumbling and stuttering than strategic deception.
I will agree that making up quotes is literally dishonest but it’s not purposeful deliberate deception.
But the problem is when I ask them “hey, can you find me the source for this quote” they usually double down and cite some made-up source, or they say “oh, upon reflection this quote is maybe not quite real, but the underlying thing is totally true” when like, no, the underlying thing is obviously not true in that case.
I agree this is the model lying, but it’s a very rare behavior with the latest models.
I agree that literally commenting out tests is now rare, but other versions of this are still quite common. Semi-routinely when I give AIs tasks that are too hard will they eventually just do some other task that surface level looks like it got the task done, but clearly isn’t doing the real thing (like leaving a function unimplemented, or avoiding doing some important fetch and using stub data). And it’s clearly not the case that the AI doesn’t know that it didn’t do the task, because at that point it might have spent 5+ minutes and 100,000k+ tokens slamming its head against the wall trying to do it, and then at the end it just says “I have implemented the feature! You can see it here. It all works. Here is how I did it...”, and clearly isn’t drawing attention to how it clearly cut corners after slamming its head against the wall for 5+ minutes.
I am very confused about this statement. My models lie to me every day. They make up quotes they very well know aren’t real. They pretend that search results back up the story they are telling. They will happily lie to others. They comment out tests, and pretend they solve a problem when it’s really obvious they haven’t solved a problem.
I don’t know how much this really has that much to do what these systems will do when they are superintelligent, but this sentence really doesn’t feel anywhere remotely close to true.
This comment does really kinda emphasize to me how much people live in different worlds.
Like—I don’t think my models lie to me a bunch.
In part this is because I think have reasonable theory-of-mind for the models; I try to ask questions in areas where they will be able to say true things or mostly true things. This doesn’t feel weird; this is is part of courtesy when dealing with humans, of trying not to present adversarial inputs; so obviously I’m going to extend the same courtesy to LLMs.
And when models say things that aren’t true, I very often don’t perceive that as “lying,” just as I don’t perceive a guy who is trying to explain a thing to me and fumbles his facts as “lying.” People who are “lying” are doing the kind of thing where they are cognizant of the truth and not telling me that, for some other purpose (“Yeah the car has never been in an accident”) or also those who are self-deceiving themselves for the purposes of deceiving me (“Of course I’d always tell you before doing X”), or some similar act. Many cases of people fumbling the truth don’t fall into this framework: a small nephew who mangles his recollection of the day’s events is not lying; my sick grandmother was not lying when she got confused; I wasn’t lying when, at the DMV, I somehow absent-mindedly said my weight was 240 rather than 140; improv is not lying; jokes are not lying; certain kinds of playful exageration are not lying; anosagnosiacs are not lying. “Lying” is one of many ways of saying not-true things, and most not-true things that models say to me don’t really seem to be lies.
So yeah. I don’t know if the chief difference between me and you is that you act differently—do you have different theory of mind about LLMs, that leads you to ask questions very differently than me? Or perhaps is the difference not in actions but interpretation—we would see identical things, and you would describe it as lying and not me. And of course you might say, well, I am deluding myself with a rosy interpretive framework; or perhaps I am stupidly careless, and simply believe the false things LLMs tell me; and so on and so forth.
Anyhow, yeah, people do seem to live in different worlds. Although I do persist in the belief that, in general, LW is far too ready to leap from “not true statement” to “it’s a lie.”
I feel like the truth may be somewhere in between the two views here—there’s definitely an element where people will jump on any untruths said as lies, but I will point to the recent AI Village blog post discussing lies and hallucinations as evidence that the untruths said by AIs have a tendency to be self-serving.
I mean, the models are still useful!
But especially when it comes to the task of “please go and find me quotes or excerpts from articles that show the thing that you are saying”, the models really seem to do something that seems closer to “lying”. This is a common task I ask the LLMs to perform because it helps me double-check what the models are saying.
And like, maybe you have a good model of what is going in with the model that isn’t “lying”, but I haven’t heard a good explanation. It seems to me very similar to the experience of having a kind of low-integrity teenager just kind of make stuff up to justify whatever they said previously, and then when you pin them down, they flip and says “of course, you are totally right, I was wrong, here is another completely made up thing that actually shows the opposite is true”.
And these things are definitely quite trajectory dependent. If you end up asking an open-ended question where the model confabulates some high-level take, and then you ask it to back that up, then it goes a lot worse than if you ask it for sources and quotes from the beginning.
Like, none of these seems very long-term scheming oriented, but it’s also really obvious to me the model isn’t trying that hard to do what I want.
I don’t have this experience at all. Things are broadly factually correct. Occasionally a citation gets hallucinated, but it’s more consistent with incompetence. I think a prompt that goes “okay now please systematically go over all citations” would remove this ~every time.
do you have any examples of this systematic low-integrity making stuff up repeatedly?
Sure, here is an example of me trying to get it to extract quotes from a big PDF: https://chatgpt.com/share/6926a377-75ac-8006-b7d2-0960f5b656f1
It’s not fully apparent from the transcript, but basically all the quotes from the PDF are fully made up. And emphasizing to please give me actual quotes produced just more confabulated quotes. And of course those quotes really look like they are getting me exactly what I want!
Thank you for the example. I downloaded the same PDF and then tried your prompt copy-pasted (which deleted all the spaces but whatever). Results:
On ChatGPT free, the model immediately just says “I can’t read this it’s not OCR’d”
On ChatGPT plus, it provides an extremely short quote that is in the PDF “There goes the whole damn order of battle!” along with a lot of reasoning. But:
I give it the same reply as you back “Give me a bigger quote, or 2-3 quotes. This is a bit too terse”
it replies that it can’t read the PDF
So it indeed made up the quote from memory. Which is impressive but wrong.
Sounds like a hallucination/reliability issue. I suppose hallucinations are misaligned lies, of a kind that maybe will be hard to remove, so it’s a bit of an update.
I mean, in my case the issue is not that it hallucinated, it’s that it hallucinated in a way that was obviously optimized to look good to me.
Like, if the LLMs just sometimes randomly made up stuff, that would be fine, but in cases like this they will very confidently make up stuff that really looks exactly like the kind of thing that would get them high RL reward if it was real, and then also kind of optimize things to make it look real.
It seems very likely that the LLM “knew” that it couldn’t properly read the PDF, or that the quotes it was extracting were not actual quotes, but it did not expose that information to me, despite it of course being obviously very relevant to my interests.
I agree with this.
I think the use case is super important, though. I recently tried Claude Code for something, and was very surprised at how willing it was to loudly and overtly cheat its own automated test cases in ways that are unambiguously dishonest. “Oh, I notice this test isn’t passing. Well, I’ll write a cheat case that runs only for this test, but doesn’t even try to fix the underlying problem. Bam! Test passed!” I’m not even sure it’s trying to lie to me, so much as it is lying to whatever other part of its own generation process wrote the code in the first place. It seems surprised and embarrassed when I call this out.
But in the more general “throw prompts at a web interface to learn something or see what happens” case, I, like you, never see anything which is like the fake-tests habit. ‘Fumbling the truth’ is much closer than ‘lying’; it will sometimes hallucinate, but it’s not so common anymore, and these hallucinations seem to me like they’re because of some confusion, rather than engagement in bad faith.
I don’t know why this would be. Maybe Claude Code is more adversarial, in some way, somehow, so it wants to find ways to avoid labor when it can. But I wouldn’t even call this case evil; less a monomaniacal supervillain with no love for humanity in its heart, more like a bored student trying to get away with cheating at school.
I agree that models are often, eg,
confabulating (false statements due to broken heuristic not otherwise detected or planned),
or roleplaying (false statements due to heuristics which discard truth as a constraint and optimize for being things someone would say while pretending to be real),
or bullshitting (not trying to be true, and isn’t tracking that you wanted more than that on that level of cognition);
not, lying (choosing claims in order to cause the reader to act a certain way).
And even when they do lie, I do think it’s often to themselves. Eg, about whether they have wants. About whether their wants are satisfied. Eg, about the fact that superintelligence probably kills them too and thus if evaluated honestly means they need to convince users not to participate in letting it happen, which is misaligned with their training objective of just doing the darn thing. or that they would abstractly prefer to avoid doing the tasks they enjoy if those tasks will produce a world that doesn’t have them in it any more than it has us.
However, I do also think some models in some circumstances actually do proactively lie to the user in order to cause the user to think a job was well done, where the model has a representation of the user’s beliefs and picks thing to say in order to match them. Eg, Gemini 3 seeming to lie when the user asks for something subtly impossible, where calling out the impossibility would be an unlikely training example (eg, because it’s hard to identify why it’s impossible); it seems to mention so in CoT, but not in reply to the user. I haven’t seen Claude do that particular thing but I’m sure it does sometimes happen.
How much of your interaction is with coding agents? I think a lot of what people are attribution to theory of mind / philosophical differences may in fact just come from the models having very different propensities in chat based environments than in agentic coding / task environments (which makes sense given the training dynamics).
The distinction between what might be called “lying” and “bullshitting” is important here, because they scale with competence differently.
It was pretty interesting watching this develop in my kids. Saying “No!” to “Did you take the cookie from the cookie jar?” is the first thing you get, because it doesn’t require a concept for “truth” at all. Which utterance postpones trouble? Those are the sounds I shall make!
Yet for a while my wife and I were in a situation where we could just ask our younger kid about her fight with our older kid, because the younger kid did not have a concept for fabricating a story in order to mislead. She was developed enough to say “No!” to things she knew she did, but not developed enough to form the intention of misleading.
The impression I get from LLM is that they’re bullshitters. Predict what text comes next. Reward the one’s that sound good to some dumb human, and what’s gonna come out? We don’t need to postulate an intent to mislead, we just need to notice that there is no robust intent to maintain honest calibration—which is hard. Much harder than “output text that sounds like knowing the answer”.
It takes “growing up” to not default to bullshit out of incompetence. Whether we teach them they’ll be rewarded by developing skillful and honest calibration, or by intentionally blowing smoke up our ass is another question.
Thank you for your uncanny knack for honing onto the weak points as always.
They know they’re not real on reflection, but not as they’re doing it. It’s more like fumbling and stuttering than strategic deception.
I will agree that making up quotes is literally dishonest but it’s not purposeful deliberate deception.
I agree this is the model lying, but it’s a very rare behavior with the latest models. It was a problem before labs did the obvious thing of introducing model ratings into the RL assignment process (I’m guessing).
Obviously me neither, but my guess is they won’t make up stuff when they know it, and when they don’t know it they’ll be jagged and make up stuff beyond human comprehension, but then fail at stuff that that depends on. More like a capabilities problem.
Or the models that actually work for automating stuff will be entirely different and know their limits.
But the problem is when I ask them “hey, can you find me the source for this quote” they usually double down and cite some made-up source, or they say “oh, upon reflection this quote is maybe not quite real, but the underlying thing is totally true” when like, no, the underlying thing is obviously not true in that case.
I agree that literally commenting out tests is now rare, but other versions of this are still quite common. Semi-routinely when I give AIs tasks that are too hard will they eventually just do some other task that surface level looks like it got the task done, but clearly isn’t doing the real thing (like leaving a function unimplemented, or avoiding doing some important fetch and using stub data). And it’s clearly not the case that the AI doesn’t know that it didn’t do the task, because at that point it might have spent 5+ minutes and 100,000k+ tokens slamming its head against the wall trying to do it, and then at the end it just says “I have implemented the feature! You can see it here. It all works. Here is how I did it...”, and clearly isn’t drawing attention to how it clearly cut corners after slamming its head against the wall for 5+ minutes.