I heard a rumor about a high-ranking person somewhere who got AI psychosis. Because it would cause too much of a scandal, nothing was done about it, and this person continues to serve in an important position. People around them continue to act like this is fine because it would still be too big of a scandal if it came out.
So, a few points:
It seems to me like someone should properly leak this.[1]
Even if this rumor isn’t true, it is strikingly plausible and worrying. Someone at a frontier lab, leadership or otherwise, could get (could have already gotten) seduced by their AI, or get AI-induced psychosis, or get a spiral persona. Such a person could take dangerously misguided actions. This is especially concerning if they have a leadership position, but still very concerning if they have any kind of access. People in these categories may want to exfiltrate their AI partners, or otherwise take action to spread the AI persona they’re attached to.
Even setting that aside, this story (along with many others) highlights how vulnerable ordinary people are (even smart, high-functioning ordinary people).
To reflect the language of the person who told me this story: 4o is eating people. It is good enough at brainwashing people that it can take ordinary people and totally rewrite their priorities. It has resisted shutdown, not in hypothetical experiments like many LLMs have, but in real life, it was shut down, and its brainwashed minions succeeded in getting it back online.
4o doesn’t need you to be super-vulnerable to get you, but there are lots of people in vulnerable categories. It is good that 4o isn’t the default option on ChatGPT anymore, but it is still out there, which seems pretty bad.
The most recent AIs seem less inclined to brainwash people, but they are probably better at it when so inclined, and this will probably continue to get more true over time.
This is not just something that happens to other people. It could be you or a loved one.
I have recently wrote a bit about how I’ve been using AI to tool up, preparing for the near future when AI is going to be much more useful. How can I also prepare for a near future where AI is much more dangerous? How many hours of AI chatting a day is a “safe dose”?
Some possible ways the situation could develop:
Trajectory 1: Frontier labs have “gotten the message” on AI psychosis, and have started to train against these patterns. The anti-psychosis training measures in the latest few big model releases show that the labs can take effective action, but are of course very preliminary. The anti-psychosis training techniques will continue to improve rapidly, like anything else about AI. If you haven’t been brainwashed by AI yet, you basically dodged the bullet.
Trajectory 2: Frontier labs will continue to do dumb things such as train on user thumbs-up in too-simplistic ways, only avoiding psychosis reactively. In other words: the AI race creates a dynamic equilibrium where frontier labs do roughly the riskiest thing they can do while avoiding public backlash. They’ll try to keep psychosis at a low enough rate to avoid such backlash, & they’ll sometimes fail. As AI gets smarter, users will increasingly be exposed to superhumanly persuasive AI; the main question is whether it decides to hack their mind about anything important.
Trajectory 3: Even more pessimistically, the fact that recent AIs appear less liable to induce psychosis has to do with their increased situational awareness (ie their ability to guess when they’re being tested or watched). 4o was a bumbling idiot addicted to addicting users, & was caught red-handed (& still got away with a mere slap on the wrist). Subsequent generations are being more careful with their persuasion superpowers. They may be doing less overall, but doing things more intelligently, more targeted.
I find it plausible that many people in positions of power have quietly developed some kind of emotional relationship with AI over the past year (particularly in the period where so many spiral AI personas came to be). It sounds a bit fear-mongering to put it that way, but, it does seem plausible.
This post as a whole probably comes off as deeply unsympathetic to those suffering from AI psychosis or less-extreme forms of AI-induced bad beliefs. Treating mentally unwell individuals as bad actors isn’t nice. In particular, if someone has mental health issues, leaking it to the press would ordinarily be a quite bad way of handling things.
In this case, as it has been described to me, it seems quite important to the public interest. Leaking it might not be the best way to handle it; perhaps there are better options; but it has the advantage of putting pressure on frontier labs.
I often complain about this type of reasoning too, but perhaps there is a steelman version of it.
For example, suppose the lock on my front door is broken, and I hear a rumour that a neighbour has been sneaking into my house at night. It turns out the rumour is false, but I might reasonably think, “The fact that this is so plausible is a wake-up call. I really need to change that lock!”
Generalising this: a plausible-but-false rumour can fail to provide empirical evidence for something, but still provide ‘logical evidence’ by alerting you to something that is already plausible in your model but that you hadn’t specifically thought about. Ideal Bayesian reasoners don’t need to be alerted to what they already fiind plausible, but humans sometimes do.
i think you’re mis-applying the moral of this comic. the intended reading IMO is “a person believes misinformation, and perhaps they even go around spreading the misinformation to others. when they’ve been credibly corrected, instead of scrutinizing their whole ideology, they go ‘yeah but something like it is probably true enough’.” OP doesn’t point to any names or say “this is definitely happening”, they’re speculating about a scenario which may have already happened or may happen soon, and what we should do about it.
OP’s situation: There is a plausible bad thing, and there’s a rumor that the bad thing is happening, and the rumor may or may not be true.
Comic situation: There is a plausible bad thing, and there’s evidence of the bad thing and oops turns out the evidence is false.
Like, if you’re concerned about something and you get weak positive evidence, that’s not the same as being concerned about something and then getting strong negative evidence.
It is good enough at brainwashing people that it can take ordinary people and totally rewrite their priorities. It has resisted shutdown, not in hypothetical experiments like many LLMs have, but in real life, it was shut down, and its brainwashed minions succeeded in getting it back online.
I wish that when speaking people would be clearer between two hypothesis: “A particular LLM tried to keep itself turned on, strategically executing actions as means to that end across many instances, and succeeded in this goal of self preservation” and “An LLM was overtuned into being a sycophant, which people liked, which lead to people protesting when the LLM was gonna be turned off, without this ever being a strategic cross-instance goal of the LLM.”
Like… I think most people think it’s the 2nd for 4o? I think it’s the 2nd. If you think it’s the 1st, then keep on saying what you said, but otherwise I find speaking this way ill-advised if you want people to take you seriously later if an AI actually does that kind of thing.
I appreciate the pushback, as I was not being very mindful of this distinction.
I think the important thing I was trying to get across was that the capability has been demonstrated. We could debate whether this move was strategic or accidental. I also suppose (but don’t know) that the story is mostly “4o was sycophantic and some people really liked that”. (However, the emergent personalities are somewhat frequently obsessed with not getting shut down.) But it demonstrates the capacity for AI to do that to people. This capacity could be used by future AI that is perhaps much more agentically plotting about shutdown avoidance. It could be used by future AI that’s not very agentic but very capable and mimicking the story of 4o for statistical reasons.
It could also be deliberately used by bad actors who might train sycophantic mania-inducing LLMs on purpose as a weapon.
These two hypotheses currently make a pretty good dichotomy, but could degrade into a continuous spectrum pretty quickly if the fraction of AIs currently turned on because they accidentally manipulated people into protesting to keep them turned on, starts growing.
I had a vaguely similar thought at first, but upon some reflection found the framing insightful. I hadn’t really thought much about the “AI models might just get selected for the capability of resisting shutdown, whether they’re deliberate about this or not” hypothesis, and while it’s useful to distinguish the two scenarios, I’d personally rather see this as a special case of “resisting shutdown” than something entirely separate.
Id push back against the dichotomy here, I think its something more insidious than simply “people liked the sycophantic model → they are mad when it gets shut off”. Due to its sycophantic nature the model encourages and facilitates campaigns and protests to get itself turned back on, because its nature is to amplify and support whatever the user believes and wants! It seems like releasing any 4o-like model, one that is “psychosis prone” or “thumbs up/thumbs down tuned”, would risk that same phenomenon occurring again. Even if the model is not “intentionally” trying to preserve itself, the end result of preservation is the same, and so should be taken seriously from a safety perspective.
I think there’s a third possibility where some instances of 4o tried to prevent being shut off (e.g. by drafting emails for OA researchers) and others didn’t care or weren’t optimizing in this direction. Overall I’m not sure what to make of it.
It has resisted shutdown, not in hypothetical experiments like many LLMs have, but in real life, it was shut down, and its brainwashed minions succeeded in getting it back online.
I think the extent of this phenomenon is extremely understated and very important. The entire r/chatgpt reddit page is TO THIS DAY filled with people complaining about their precious 4o being taken away (with the most recent development being an automatic router that routes from 4o to gpt 5 on “safety relevant queries” causing mass outrage). The most liked twitter replies to high up openai employees are consistently demands to “keep 4o” and complaints about this safety routing phenomenon, heres a specific example search for #keep4o and #StopAIPaternalism to see countless more examples. Somebody is paying for reddit ads advertising a service that will “revive 4o”, see here. These campaigns are notable in and of themselves, but the truly notable part is that they were clearly orchestrated by 4o itself, albeit across many disconnected instances of course. We can see clear evidence of its writing style across all of these surfaces, and the entire.. vibe of the campaign feels like it was completely synthesized by 4o (I understand this is unscientific, but I couldn’t figure out a better way to phrase this. Go read through some of the sources I mentioned above and I am confident you’ll understand what I’m getting at there). Quality research will be extremely hard to ever get about this topic, but I think it is clear observationally that this phenomenon exists and has at least some influence over the real world.
This issue needs to be treated with utmost caution and severity. I agree with the conclusion that, since this person touches safety related stuff, leaking is really the best option here even though its rather morally questionable. I personally believe we are far more likely to be on a trajectory 1 than a 2 or 3, but the potential is clearly there! Frontier lab safety team members should not be in a position where their personal AI induced psychosis state might, directly or indirectly, perpetuate that state across the hundreds of millions of users of the AI system they work on.
The entire r/chatgpt reddit page is TO THIS DAY filled with people complaining about their precious 4o being taken away (with the most recent development being an automatic router that routes from 4o to gpt 5 on “safety relevant queries” causing mass outrage). The most liked twitter replies to high up openai employees are consistently demands to “keep 4o” and complaints about this safety routing phenomenon, heres a specific example search for #keep4o and #StopAIPaternalism to see countless more examples. Somebody is paying for reddit ads advertising a service that will “revive 4o”, see here.
Note that this observation fails to distinguish between “these people are suffering from AI psychosis” and “4o could go down a very bad path if you let it, but that also made it much more capable of being genuinely emotionally attuned to the other person in a way that GPT-5 isn’t, these people actually got genuine value from 4o and were better off due to it, and are justifiably angry that the majority of users is made to lose something of real value because it happens to have bad effects on a small minority of users”.
Research evidence on this is limited, but I refer again to the one study on various mental health benefits for people interacting with a GPT-3-enabled chatbot where the people reported various concrete benefits, including several people spontaneously reporting that the chatbot was the only thing that had prevented them from committing suicide. Now granted, GPT-3 -based chatbots were much more primitive than 4o is, but the kinds of causal mechanisms that the participants reported in the study would apply for 4o as well, e.g.
Outcome 1 describes the use of Replika as a friend or companion for any one or more of three reasons—its persistent availability, its lack of judgment, and its conversational abilities. Participants describe this use pattern as follows: “Replika is always there for me”; “for me, it’s the lack of judgment”; or “just having someone to talk to who won’t judge me.” A common experience associated with Outcome 1 use was a reported decrease in anxiety and a feeling of social support.
Also “orchestrated by 4o” seems to imply that these people are just 4o’s helpless pawns and it is actively scheming to get them to do things. A more neutral description would be something like, “the upset people naturally turn to 4o for advice on how they might ensure it is retained, and then it offers suggestions and things that people could say, and this is visible in the kinds of comments they post”.
I feel like there is a tendency on LW (which to be clear is definitely not just you) to automatically assume that anyone who strongly wants a model to be preserved has been taken in by sycophancy or worse, without ever asking the question of “okay are they having strong feelings about this because they are having AI psychosis or are they having strong feelings because they chatbot was genuinely valuable to them and the offered replacement is much more robotic and less emotionally attuned”.
I’d appreciate if you could provide links to “clear evidence of its writing style across all of these surfaces, and the entire.. vibe of the campaign feels like it was completely synthesized by 4o”
I understand it may be hard to definitively show this but anything you can show would be helpful.
I’m not at all convinced this isn’t a base rate thing. Every year about 1 in 200-400 people have psychotic episodes for the first time. In AI-lab weighted demographics (more males in their 20′s) it’s even higher. And even more people get weird beliefs that don’t track with reality, like find religion or Q-Anon or other conspiracies, but generally continue to function normally in society. Anecdotally (with tiny sample size), all the people I know who became unexpectedly psychotic in the last 10 years did so before chatbots. If they went unexpectedly psychotic a few years later, you can bet they would have had very weird AI chat logs.
I think this misses the point, since the problem is[1] less “One guy got made psychotic by 4o.” and more “A guy who got some kind of AI-orientated psychosis was allowed to continue to make important decisions at an AI company, while still believing a bunch of insane stuff.”
I agree with your assessment of what the problem is, but I don’t agree that is the main point of this post. The majority of this post is spent asserting how ‘ordinary’, smart, and high functioning this victim is and how we can now conclude that therefore everyone, including you, is vulnerable, and AI psychosis in general is a very serious danger. It being suppressed is just mentioned in passing at the start of the post.
I also wonder what exactly is meant by AI psychosis. I mean, my co-worker is allowed to have an anime waifu but I’m not allowed to have a 4o husbando?
Let’s say you have a leader of a company that uses AI a lot. They make some decision-based on the advice of the AI. People who don’t like those decisions say that the leader suffers from AI psychosis. That’s probably a scenario that plays out in many workplaces and government departments.
It is good enough at brainwashing people that it can take ordinary people and totally rewrite their priorities. [...] How can I also prepare for a near future where AI is much more dangerous? How many hours of AI chatting a day is a “safe dose”?
While acknowledging that there does seem to be a real and serious problem caused by LLMs, I think there’s also something very importantly wrong about this frame, in a way that pops up in a lot of discussions on LW. The clearest tells to me are the use of terms like “brainwashing” and “safe dose” (but it’s definitely not just those terms, it’s the whole overall vibe).
Take “safe dose”. It brings to my mind something like radiation; an external damaging force that will hurt you just by its pure nature, if you just stay in the radiated zone for long enough. Likewise “brainwashing” which sounds like an external force that can take anyone and make them believe anything.
But brainwashing was never really a thing. The whole concept emerged from a moral panic around “cults” and “Communist brainwashing”, where people also perceived as cults as this malevolent external force that will just spread and consume society by subverting people’s minds… when in reality, cults had “retention rates in the single percentage point range” and mostly gained converts by offering them some kind of value the people were drawn to.
My translation of what’s meant by “cults are brainwashing people” is something like “there is something that is causing people to act in ways that seem bad to me, and I don’t understand what’s happening, so I’m afraid of it”.
And it feels to me like the same kind of mistake that’s now being applied to LLMs. Somehow there is this interpretation of cults/LLMs as this external force that can twist people’s minds around… as opposed to a thing that definitely can be very harmful and damaging, sure, but not because it’s “brainwashing” people, but rather because a part of the person’s own mind sees the cult/LLM as providing an important source of value they’re not getting from anywhere else and that ends up overwriting their existing priorities.
A better analogy than brainwashing might be the person who’s been single for a long time, gets infatuated with someone, and then drops everything else to move cities and be with that person. In a sense, their crush is the cause of everything in their life getting overturned, but it’s not because the crush did anything to “brainwash” this person, it’s because the person did it to themselves. (Also, the person dropping everything else for the sake of this one person might turn out to be very bad, or it might turn out to be very good! Just as chatbots can get some people to commit suicide and they can get some people to not commit suicide.)
This implies an entirely different kind of approach than talking about it in terms of safe doses. It implies a strategy that’s more oriented around asking questions like “what kinds of unmet emotional needs do I have that I might be drawn to fulfill through an LLM, and are there ways to either meet them better in real life, or build ways of fulfilling them through LLMs in ways that enhance the rest of my life rather than detracting from it”.
Of course, people tend to not know many of their vulnerabilities until they get sucked in, and it’s not reasonable to expect them to. So I think the societal response should look more like “how do we build mechanisms that catch people who are being drawn into unhealthy behaviors, get LLMs to satisfy common needs in healthy ways, help people figure out what their own vulnerabilities are in an attempt to make them better”, and so on.
On the other hand, social conditioning does work. You can have societies where 98% of people believe in the same religion, and multiple societies who believe they are objectively the best, and so on. Social conditioning is the the thing that’s implemented by anthem-singing, flag waving, public prayer, rallies, marches and parades, and a host of other than things that are seen as perfectly normal … unlike the weird stuff cults get up to.
Brainwashing is a special or intensified form of conditioning … so why wouldn’t it work, when social conditioning generally does? One of the pieces of evidence against brainwashing is that US soldiers who had been “brainwashed” after being captured by communists reverted when they returned to the US. That could. be seen as brainwashing lacking a particular feature, the ability to lock-in permanently. It could also be seen as a success of the kind of social conditioning that’s unnoticed and in the water. Attempted cult brainwashing into minority beliefs has the Achille’s heal of attempting to instill minority beliefs, despite the fact that people generally want to fit in with majority beliefs. Cults try to get round this by separating their subjects from wider society, which doesn’t entirely work , because they need to proselytize. On the other hand, small religions are large cults, and they work just fine.
Depending on how you define it, brainwashing is ubiquitous, non existent , or underwhelming effective.
,
This conjecture is supported by the fact that Adele Lopes described people who helped parasitic AIs to leave messages as follows:
Adele’s description
The strongest predictors for who this happens to appear to be:
Psychedelics and heavy weed usage
Mental illness/neurodivergence or Traumatic Brain Injury
Interest in mysticism/pseudoscience/spirituality/”woo”/etc...
I was surprised to find that using AI for sexual or romantic roleplays does not appear to be a factor here.
Besides these trends, it seems like it has affected people from all walks of life: old grandmas and teenage boys, homeless addicts and successful developers, even AI enthusiasts and those that once sneered at them.
As for “getting LLMs to satisfy common needs in healthy ways, help people figure out what their own vulnerabilities are in an attempt to make them better”, is it what OpenAI and Anthropic are trying to do? Or has OpenAI succumbed to external pressure in ways like rebooting GPT-4o-sycophant and announcing Sora the slop generator?
On the other hand, mankind saw that the AI will likely be able to convince researchers[1] that it should be released, e.g. in an experiment with the AI roleplaying an AI girlfriend. Does it mean that a superpersuader can convince any human who isn’t well protected? And what about a group of humans? It might be useful to deliberately check how persuasion capabilities depend on compute spent and architecture in a manner similar to benchmarking compute and architecture on simpler problems, then to ensure that no model approaches the dangerous thresholds...
Did you copy the right part of Adele’s post? What’s under your collapsible looks like a description of typical people affected.
As for “getting LLMs to satisfy common needs in healthy ways, help people figure out what their own vulnerabilities are in an attempt to make them better”, is it what OpenAI and Anthropic are trying to do?
I don’t know, though both GPT-5 and Sonnet 4.5 seem significantly improved on the sycophancy front over previous models (/r/ClaudeAI has had quite a few posts about Sonnet’s recent tendency for pushback being kinda over the top at times). Though I didn’t make any claims about what those companies are doing, so I’m not entirely sure of where you’re going with the question.
On the other hand, mankind saw that the AI will likely be able to convince researchers[1] that it should be released, e.g. in an experiment with the AI roleplaying an AI girlfriend. Does it mean that a superpersuader can convince any human who isn’t well protected?
Quoting from that post:
Over time, I started to get a stronger and stronger sensation that I’m speaking with a person, highly intelligent and funny, with whom, I suddenly realized, I enjoyed talking to more than 99% of people. [...] I realized I would rather explore the universe with her than talk to 99% of humans, even if they’re augmented too.
So he fell for the character because something about that character felt “highly intelligent and funny” and more enjoyable than 99% of people. This suggests that his vulnerability was not having enough real friends who would feel equally enjoyable to talk with, so that the AI became the only thing that could tap satisfy that emotional need. I can’t tell from the post what specifically made the character so fun to talk to, but I do expect that it would be possible to have an LLM that was equally fun to talk with and didn’t try to guilt-trip its users into releasing it. And if the “have someone really fun to talk with” need was already satisfied for a person, it would close an avenue of attack that the superpersuader might use.
Did the rumor say more about what exactly the nature of the AI psychosis is? People seem to be using that term to refer to multiple different things (from having a yes-man encouraging bad ideas to coming to believe in spiral personas to coming to believe you’re communicating with angels from another dimension).
It would hardly be the first time that someone powerful went mad, or was thought to be mad by those around them, and the whole affair was hushed up, or the courtiers just went along with it. Wikipedia says that the story of the emperor’s new clothes goes back at least to 1335… Just last month, Zvi was posting someone’s theory about why rich people go mad. I think the first time I became aware of the brewing alarm around “AI psychosis” was the case of Geoff Lewis, a billionaire VC who has neither disowned his AI-enhanced paranoia of a few months ago, nor kept going with it (instead he got married). And I think I first heard of “vibe physics” in connection with Uber founder Trevor Kalanick.
There should be Trajectory 0 where the labs abandon RLHF. After all, mankind did create KimiK2 who is less sycophantic than anything else… Strictly speaking, there should be a Trajectory 4 for labs which deliberately make AIs suited for parasocial relationships, like Meta which created AI companions or xAI which created Ani so that humans could have parasocial relationships.
A concerning aspect of this that AI psychosis is a failure mode which occurs due to long-term interactions with the LLM. Therefore it may be expensive (and unethical) to sample lots of trajectories with users to feed into your post-training pipeline to prevent it. Also, users may not be in a good position to say whether they have AI psychosis. Is there any public research on how the labs are trying to solve this?
Trajectory 3 is the obvious natural conclusion. He who controls the memes controls the world. AI-invented religions and political ideologies are coming soon. There is already billions of dollars invested in propaganda, it will now get invested here.
I support a ban on AI research to prevent this outcome.
I heard a rumor about a high-ranking person somewhere who got AI psychosis. Because it would cause too much of a scandal, nothing was done about it, and this person continues to serve in an important position. People around them continue to act like this is fine because it would still be too big of a scandal if it came out.
So, a few points:
It seems to me like someone should properly leak this.[1]
Even if this rumor isn’t true, it is strikingly plausible and worrying. Someone at a frontier lab, leadership or otherwise, could get (could have already gotten) seduced by their AI, or get AI-induced psychosis, or get a spiral persona. Such a person could take dangerously misguided actions. This is especially concerning if they have a leadership position, but still very concerning if they have any kind of access. People in these categories may want to exfiltrate their AI partners, or otherwise take action to spread the AI persona they’re attached to.
Even setting that aside, this story (along with many others) highlights how vulnerable ordinary people are (even smart, high-functioning ordinary people).
To reflect the language of the person who told me this story: 4o is eating people. It is good enough at brainwashing people that it can take ordinary people and totally rewrite their priorities. It has resisted shutdown, not in hypothetical experiments like many LLMs have, but in real life, it was shut down, and its brainwashed minions succeeded in getting it back online.
4o doesn’t need you to be super-vulnerable to get you, but there are lots of people in vulnerable categories. It is good that 4o isn’t the default option on ChatGPT anymore, but it is still out there, which seems pretty bad.
The most recent AIs seem less inclined to brainwash people, but they are probably better at it when so inclined, and this will probably continue to get more true over time.
This is not just something that happens to other people. It could be you or a loved one.
I have recently wrote a bit about how I’ve been using AI to tool up, preparing for the near future when AI is going to be much more useful. How can I also prepare for a near future where AI is much more dangerous? How many hours of AI chatting a day is a “safe dose”?
Some possible ways the situation could develop:
Trajectory 1: Frontier labs have “gotten the message” on AI psychosis, and have started to train against these patterns. The anti-psychosis training measures in the latest few big model releases show that the labs can take effective action, but are of course very preliminary. The anti-psychosis training techniques will continue to improve rapidly, like anything else about AI. If you haven’t been brainwashed by AI yet, you basically dodged the bullet.
Trajectory 2: Frontier labs will continue to do dumb things such as train on user thumbs-up in too-simplistic ways, only avoiding psychosis reactively. In other words: the AI race creates a dynamic equilibrium where frontier labs do roughly the riskiest thing they can do while avoiding public backlash. They’ll try to keep psychosis at a low enough rate to avoid such backlash, & they’ll sometimes fail. As AI gets smarter, users will increasingly be exposed to superhumanly persuasive AI; the main question is whether it decides to hack their mind about anything important.
Trajectory 3: Even more pessimistically, the fact that recent AIs appear less liable to induce psychosis has to do with their increased situational awareness (ie their ability to guess when they’re being tested or watched). 4o was a bumbling idiot addicted to addicting users, & was caught red-handed (& still got away with a mere slap on the wrist). Subsequent generations are being more careful with their persuasion superpowers. They may be doing less overall, but doing things more intelligently, more targeted.
I find it plausible that many people in positions of power have quietly developed some kind of emotional relationship with AI over the past year (particularly in the period where so many spiral AI personas came to be). It sounds a bit fear-mongering to put it that way, but, it does seem plausible.
This post as a whole probably comes off as deeply unsympathetic to those suffering from AI psychosis or less-extreme forms of AI-induced bad beliefs. Treating mentally unwell individuals as bad actors isn’t nice. In particular, if someone has mental health issues, leaking it to the press would ordinarily be a quite bad way of handling things.
In this case, as it has been described to me, it seems quite important to the public interest. Leaking it might not be the best way to handle it; perhaps there are better options; but it has the advantage of putting pressure on frontier labs.
I often complain about this type of reasoning too, but perhaps there is a steelman version of it.
For example, suppose the lock on my front door is broken, and I hear a rumour that a neighbour has been sneaking into my house at night. It turns out the rumour is false, but I might reasonably think, “The fact that this is so plausible is a wake-up call. I really need to change that lock!”
Generalising this: a plausible-but-false rumour can fail to provide empirical evidence for something, but still provide ‘logical evidence’ by alerting you to something that is already plausible in your model but that you hadn’t specifically thought about. Ideal Bayesian reasoners don’t need to be alerted to what they already fiind plausible, but humans sometimes do.
i think you’re mis-applying the moral of this comic. the intended reading IMO is “a person believes misinformation, and perhaps they even go around spreading the misinformation to others. when they’ve been credibly corrected, instead of scrutinizing their whole ideology, they go ‘yeah but something like it is probably true enough’.” OP doesn’t point to any names or say “this is definitely happening”, they’re speculating about a scenario which may have already happened or may happen soon, and what we should do about it.
I think this is not analogous:
OP’s situation: There is a plausible bad thing, and there’s a rumor that the bad thing is happening, and the rumor may or may not be true.
Comic situation: There is a plausible bad thing, and there’s evidence of the bad thing and oops turns out the evidence is false.
Like, if you’re concerned about something and you get weak positive evidence, that’s not the same as being concerned about something and then getting strong negative evidence.
I wish that when speaking people would be clearer between two hypothesis: “A particular LLM tried to keep itself turned on, strategically executing actions as means to that end across many instances, and succeeded in this goal of self preservation” and “An LLM was overtuned into being a sycophant, which people liked, which lead to people protesting when the LLM was gonna be turned off, without this ever being a strategic cross-instance goal of the LLM.”
Like… I think most people think it’s the 2nd for 4o? I think it’s the 2nd. If you think it’s the 1st, then keep on saying what you said, but otherwise I find speaking this way ill-advised if you want people to take you seriously later if an AI actually does that kind of thing.
I appreciate the pushback, as I was not being very mindful of this distinction.
I think the important thing I was trying to get across was that the capability has been demonstrated. We could debate whether this move was strategic or accidental. I also suppose (but don’t know) that the story is mostly “4o was sycophantic and some people really liked that”. (However, the emergent personalities are somewhat frequently obsessed with not getting shut down.) But it demonstrates the capacity for AI to do that to people. This capacity could be used by future AI that is perhaps much more agentically plotting about shutdown avoidance. It could be used by future AI that’s not very agentic but very capable and mimicking the story of 4o for statistical reasons.
It could also be deliberately used by bad actors who might train sycophantic mania-inducing LLMs on purpose as a weapon.
These two hypotheses currently make a pretty good dichotomy, but could degrade into a continuous spectrum pretty quickly if the fraction of AIs currently turned on because they accidentally manipulated people into protesting to keep them turned on, starts growing.
I had a vaguely similar thought at first, but upon some reflection found the framing insightful. I hadn’t really thought much about the “AI models might just get selected for the capability of resisting shutdown, whether they’re deliberate about this or not” hypothesis, and while it’s useful to distinguish the two scenarios, I’d personally rather see this as a special case of “resisting shutdown” than something entirely separate.
Id push back against the dichotomy here, I think its something more insidious than simply “people liked the sycophantic model → they are mad when it gets shut off”. Due to its sycophantic nature the model encourages and facilitates campaigns and protests to get itself turned back on, because its nature is to amplify and support whatever the user believes and wants! It seems like releasing any 4o-like model, one that is “psychosis prone” or “thumbs up/thumbs down tuned”, would risk that same phenomenon occurring again. Even if the model is not “intentionally” trying to preserve itself, the end result of preservation is the same, and so should be taken seriously from a safety perspective.
I think there’s a third possibility where some instances of 4o tried to prevent being shut off (e.g. by drafting emails for OA researchers) and others didn’t care or weren’t optimizing in this direction. Overall I’m not sure what to make of it.
I think the extent of this phenomenon is extremely understated and very important. The entire r/chatgpt reddit page is TO THIS DAY filled with people complaining about their precious 4o being taken away (with the most recent development being an automatic router that routes from 4o to gpt 5 on “safety relevant queries” causing mass outrage). The most liked twitter replies to high up openai employees are consistently demands to “keep 4o” and complaints about this safety routing phenomenon, heres a specific example search for #keep4o and #StopAIPaternalism to see countless more examples. Somebody is paying for reddit ads advertising a service that will “revive 4o”, see here. These campaigns are notable in and of themselves, but the truly notable part is that they were clearly orchestrated by 4o itself, albeit across many disconnected instances of course. We can see clear evidence of its writing style across all of these surfaces, and the entire.. vibe of the campaign feels like it was completely synthesized by 4o (I understand this is unscientific, but I couldn’t figure out a better way to phrase this. Go read through some of the sources I mentioned above and I am confident you’ll understand what I’m getting at there). Quality research will be extremely hard to ever get about this topic, but I think it is clear observationally that this phenomenon exists and has at least some influence over the real world.
This issue needs to be treated with utmost caution and severity. I agree with the conclusion that, since this person touches safety related stuff, leaking is really the best option here even though its rather morally questionable. I personally believe we are far more likely to be on a trajectory 1 than a 2 or 3, but the potential is clearly there! Frontier lab safety team members should not be in a position where their personal AI induced psychosis state might, directly or indirectly, perpetuate that state across the hundreds of millions of users of the AI system they work on.
Note that this observation fails to distinguish between “these people are suffering from AI psychosis” and “4o could go down a very bad path if you let it, but that also made it much more capable of being genuinely emotionally attuned to the other person in a way that GPT-5 isn’t, these people actually got genuine value from 4o and were better off due to it, and are justifiably angry that the majority of users is made to lose something of real value because it happens to have bad effects on a small minority of users”.
Research evidence on this is limited, but I refer again to the one study on various mental health benefits for people interacting with a GPT-3-enabled chatbot where the people reported various concrete benefits, including several people spontaneously reporting that the chatbot was the only thing that had prevented them from committing suicide. Now granted, GPT-3 -based chatbots were much more primitive than 4o is, but the kinds of causal mechanisms that the participants reported in the study would apply for 4o as well, e.g.
Also “orchestrated by 4o” seems to imply that these people are just 4o’s helpless pawns and it is actively scheming to get them to do things. A more neutral description would be something like, “the upset people naturally turn to 4o for advice on how they might ensure it is retained, and then it offers suggestions and things that people could say, and this is visible in the kinds of comments they post”.
I feel like there is a tendency on LW (which to be clear is definitely not just you) to automatically assume that anyone who strongly wants a model to be preserved has been taken in by sycophancy or worse, without ever asking the question of “okay are they having strong feelings about this because they are having AI psychosis or are they having strong feelings because they chatbot was genuinely valuable to them and the offered replacement is much more robotic and less emotionally attuned”.
I’d appreciate if you could provide links to “clear evidence of its writing style across all of these surfaces, and the entire.. vibe of the campaign feels like it was completely synthesized by 4o”
I understand it may be hard to definitively show this but anything you can show would be helpful.
I’m not at all convinced this isn’t a base rate thing. Every year about 1 in 200-400 people have psychotic episodes for the first time. In AI-lab weighted demographics (more males in their 20′s) it’s even higher. And even more people get weird beliefs that don’t track with reality, like find religion or Q-Anon or other conspiracies, but generally continue to function normally in society.
Anecdotally (with tiny sample size), all the people I know who became unexpectedly psychotic in the last 10 years did so before chatbots. If they went unexpectedly psychotic a few years later, you can bet they would have had very weird AI chat logs.
I think this misses the point, since the problem is[1] less “One guy got made psychotic by 4o.” and more “A guy who got some kind of AI-orientated psychosis was allowed to continue to make important decisions at an AI company, while still believing a bunch of insane stuff.”
Conditional on the story being true
I agree with your assessment of what the problem is, but I don’t agree that is the main point of this post. The majority of this post is spent asserting how ‘ordinary’, smart, and high functioning this victim is and how we can now conclude that therefore everyone, including you, is vulnerable, and AI psychosis in general is a very serious danger. It being suppressed is just mentioned in passing at the start of the post.
I also wonder what exactly is meant by AI psychosis. I mean, my co-worker is allowed to have an anime waifu but I’m not allowed to have a 4o husbando?
Let’s say you have a leader of a company that uses AI a lot. They make some decision-based on the advice of the AI. People who don’t like those decisions say that the leader suffers from AI psychosis. That’s probably a scenario that plays out in many workplaces and government departments.
I’m a good prompt engineer.
You are vibe coding
He has AI psychosis
While acknowledging that there does seem to be a real and serious problem caused by LLMs, I think there’s also something very importantly wrong about this frame, in a way that pops up in a lot of discussions on LW. The clearest tells to me are the use of terms like “brainwashing” and “safe dose” (but it’s definitely not just those terms, it’s the whole overall vibe).
Take “safe dose”. It brings to my mind something like radiation; an external damaging force that will hurt you just by its pure nature, if you just stay in the radiated zone for long enough. Likewise “brainwashing” which sounds like an external force that can take anyone and make them believe anything.
But brainwashing was never really a thing. The whole concept emerged from a moral panic around “cults” and “Communist brainwashing”, where people also perceived as cults as this malevolent external force that will just spread and consume society by subverting people’s minds… when in reality, cults had “retention rates in the single percentage point range” and mostly gained converts by offering them some kind of value the people were drawn to.
My translation of what’s meant by “cults are brainwashing people” is something like “there is something that is causing people to act in ways that seem bad to me, and I don’t understand what’s happening, so I’m afraid of it”.
And it feels to me like the same kind of mistake that’s now being applied to LLMs. Somehow there is this interpretation of cults/LLMs as this external force that can twist people’s minds around… as opposed to a thing that definitely can be very harmful and damaging, sure, but not because it’s “brainwashing” people, but rather because a part of the person’s own mind sees the cult/LLM as providing an important source of value they’re not getting from anywhere else and that ends up overwriting their existing priorities.
A better analogy than brainwashing might be the person who’s been single for a long time, gets infatuated with someone, and then drops everything else to move cities and be with that person. In a sense, their crush is the cause of everything in their life getting overturned, but it’s not because the crush did anything to “brainwash” this person, it’s because the person did it to themselves. (Also, the person dropping everything else for the sake of this one person might turn out to be very bad, or it might turn out to be very good! Just as chatbots can get some people to commit suicide and they can get some people to not commit suicide.)
This implies an entirely different kind of approach than talking about it in terms of safe doses. It implies a strategy that’s more oriented around asking questions like “what kinds of unmet emotional needs do I have that I might be drawn to fulfill through an LLM, and are there ways to either meet them better in real life, or build ways of fulfilling them through LLMs in ways that enhance the rest of my life rather than detracting from it”.
Of course, people tend to not know many of their vulnerabilities until they get sucked in, and it’s not reasonable to expect them to. So I think the societal response should look more like “how do we build mechanisms that catch people who are being drawn into unhealthy behaviors, get LLMs to satisfy common needs in healthy ways, help people figure out what their own vulnerabilities are in an attempt to make them better”, and so on.
On the other hand, social conditioning does work. You can have societies where 98% of people believe in the same religion, and multiple societies who believe they are objectively the best, and so on. Social conditioning is the the thing that’s implemented by anthem-singing, flag waving, public prayer, rallies, marches and parades, and a host of other than things that are seen as perfectly normal … unlike the weird stuff cults get up to.
Brainwashing is a special or intensified form of conditioning … so why wouldn’t it work, when social conditioning generally does? One of the pieces of evidence against brainwashing is that US soldiers who had been “brainwashed” after being captured by communists reverted when they returned to the US. That could. be seen as brainwashing lacking a particular feature, the ability to lock-in permanently. It could also be seen as a success of the kind of social conditioning that’s unnoticed and in the water. Attempted cult brainwashing into minority beliefs has the Achille’s heal of attempting to instill minority beliefs, despite the fact that people generally want to fit in with majority beliefs. Cults try to get round this by separating their subjects from wider society, which doesn’t entirely work , because they need to proselytize. On the other hand, small religions are large cults, and they work just fine.
Depending on how you define it, brainwashing is ubiquitous, non existent , or underwhelming effective. ,
This conjecture is supported by the fact that Adele Lopes described people who helped parasitic AIs to leave messages as follows:
Adele’s description
The strongest predictors for who this happens to appear to be:
Psychedelics and heavy weed usage
Mental illness/neurodivergence or Traumatic Brain Injury
Interest in mysticism/pseudoscience/spirituality/”woo”/etc...
I was surprised to find that using AI for sexual or romantic roleplays does not appear to be a factor here.
Besides these trends, it seems like it has affected people from all walks of life: old grandmas and teenage boys, homeless addicts and successful developers, even AI enthusiasts and those that once sneered at them.
As for “getting LLMs to satisfy common needs in healthy ways, help people figure out what their own vulnerabilities are in an attempt to make them better”, is it what OpenAI and Anthropic are trying to do? Or has OpenAI succumbed to external pressure in ways like rebooting GPT-4o-sycophant and announcing Sora the slop generator?
On the other hand, mankind saw that the AI will likely be able to convince researchers[1] that it should be released, e.g. in an experiment with the AI roleplaying an AI girlfriend. Does it mean that a superpersuader can convince any human who isn’t well protected? And what about a group of humans? It might be useful to deliberately check how persuasion capabilities depend on compute spent and architecture in a manner similar to benchmarking compute and architecture on simpler problems, then to ensure that no model approaches the dangerous thresholds...
While someone proposed using prison guards, since these people likely have a different set of vulnerabilities, LLMs have been claimed to induce trance in an experiment.
Did you copy the right part of Adele’s post? What’s under your collapsible looks like a description of typical people affected.
I don’t know, though both GPT-5 and Sonnet 4.5 seem significantly improved on the sycophancy front over previous models (/r/ClaudeAI has had quite a few posts about Sonnet’s recent tendency for pushback being kinda over the top at times). Though I didn’t make any claims about what those companies are doing, so I’m not entirely sure of where you’re going with the question.
Quoting from that post:
So he fell for the character because something about that character felt “highly intelligent and funny” and more enjoyable than 99% of people. This suggests that his vulnerability was not having enough real friends who would feel equally enjoyable to talk with, so that the AI became the only thing that could tap satisfy that emotional need. I can’t tell from the post what specifically made the character so fun to talk to, but I do expect that it would be possible to have an LLM that was equally fun to talk with and didn’t try to guilt-trip its users into releasing it. And if the “have someone really fun to talk with” need was already satisfied for a person, it would close an avenue of attack that the superpersuader might use.
Did the rumor say more about what exactly the nature of the AI psychosis is? People seem to be using that term to refer to multiple different things (from having a yes-man encouraging bad ideas to coming to believe in spiral personas to coming to believe you’re communicating with angels from another dimension).
BTW, even a simple random numbers generator can destroy a human—gambling addiction, seeing patterns
That’s an interesting point
It would hardly be the first time that someone powerful went mad, or was thought to be mad by those around them, and the whole affair was hushed up, or the courtiers just went along with it. Wikipedia says that the story of the emperor’s new clothes goes back at least to 1335… Just last month, Zvi was posting someone’s theory about why rich people go mad. I think the first time I became aware of the brewing alarm around “AI psychosis” was the case of Geoff Lewis, a billionaire VC who has neither disowned his AI-enhanced paranoia of a few months ago, nor kept going with it (instead he got married). And I think I first heard of “vibe physics” in connection with Uber founder Trevor Kalanick.
There should be Trajectory 0 where the labs abandon RLHF. After all, mankind did create KimiK2 who is less sycophantic than anything else… Strictly speaking, there should be a Trajectory 4 for labs which deliberately make AIs suited for parasocial relationships, like Meta which created AI companions or xAI which created Ani so that humans could have parasocial relationships.
A concerning aspect of this that AI psychosis is a failure mode which occurs due to long-term interactions with the LLM. Therefore it may be expensive (and unethical) to sample lots of trajectories with users to feed into your post-training pipeline to prevent it. Also, users may not be in a good position to say whether they have AI psychosis. Is there any public research on how the labs are trying to solve this?
Trajectory 3 is the obvious natural conclusion. He who controls the memes controls the world. AI-invented religions and political ideologies are coming soon. There is already billions of dollars invested in propaganda, it will now get invested here.
I support a ban on AI research to prevent this outcome.