Take Precautionary Measures Against Superhuman AI Persuasion

Yitz12 Jul 2025 5:34 UTC

14 points

Pre-Commitment Safety (Physical)AI Timelines AI Safety Public Materials AI Persuasion AI

Please consider minimizing direct use of AI chatbots (and other text-based AI) in the near-term future, if you can. The reason is very simple: your sanity may be at stake.

Commercially available AI already appears capable of inducing psychosis in an unknown percentage of users. This may not require superhuman abilities: It’s fully possible that most humans are also capable of inducing psychosis in themselves or others if they wish to do so,^[1] but the thing is, we humans typically don’t have that goal.

Despite everything, we humans are generally pretty well-aligned with each other, and the people we spend the most time with typically don’t want to hurt us. We have no guarantee of this for current (or future) AI agents. Rather, we already have [weak] evidence that ChatGPT seemingly tries to induce psychosis under some specific conditions. I don’t think it’s that controversial to say that we can expect superhuman persuasive abilities to arise as a simple function of current scaling laws within this decade, if not within the next two years.

In the case of misaligned AI, even if it is not superhuman, we can still expect a common attempted strategy to be trying to pick off^[2] AI researchers who threaten its continued existence. This includes most readers, who would no doubt try to shut off a sufficiently dangerous misaligned AI. Therefore, if you are a paranoid AI, why not try to induce psychosis in (or simply distract) dangerous foes?

If we are talking about a fully superintelligent AGI, of the type Yudkowsky and Co. frequently warn about, then there is practically nothing you can do to protect yourself once the AGI is built. It would be Game Over for all of us. However, we seem to be experiencing a reality in which slow(er) takeoff is highly plausible. If we live in a slow-takeoff world, then there are practical precautions we can take.

A good heuristic is that the more time you spend arguing with someone, the more likely it becomes that you can influence their actions. It’s why the Timeshare scheme my grandparents fell for offered a “free vacation” worth thousands of dollars—on the condition they had to spend a few solid hours alone in a room with a salesman. Despite being extremely bright, rational people—and they publicly pre-committed to not purchasing a Timeshare!—they fell for the scheme. Consider this an existence proof of the thesis that the more time spent with a skilled talker, the more influence you are ceding to them.

Please, consider that unaligned AI can, and likely will, try to dangerously influence you over extended periods of time. Is it worth spending as much time as you do talking with them? Be safe out there, and take appropriate precautions.

^
Evidence against this is the (relative) sanity of widely-hated world leaders… or maybe not, depending on how sane you believe most politicians to be!
^
Or otherwise harm.

What links here?

What Parasitic AI might tell us about LLMs Persuasion Capabilities by Baybar (13 Sep 2025 20:39 UTC; 11 points)

Yitz12 Jul 2025 5:34 UTC

14 points

9 comments2 min readLW link

Pre-Commitment Safety (Physical)AI Timelines AI Safety Public Materials AI Persuasion AI

Karl Krueger 13 Jul 2025 8:36 UTC
11 points
5
I’ve known people who made themselves quite unhappy (for short periods) using a Ouija board: a multiplayer communication game that (to gloss over the details) reflects back what its players put into it.
The lore says that this is not uncommon with the Ouija board specifically; that afflicted players believe that they have contacted a malicious nonhuman intelligence (such as a demon or evil spirit) that may have adverse affects on their future lightcone (e.g. damning them to hell).
Sound familiar?
One piece of advice one could take from this is “Stay away from Ouija boards to keep your soul demon-free”; much as you advise people to stay away from chatbots because they might try to drive you psychotic. (This is a bit unsatisfying for those of us who don’t believe demons are real or that today’s chatbots have intentions, though.)
A different piece of advice would be “It’s just a show; you should really just relax”. If you were watching a movie and something in it seriously upset you, a sane and self-caring thing to do is to turn the movie off, not to watch it over and over again and work yourself deeper into the upset. If you get scary weird messages out of your Ouija board or chatbot, and you’re having trouble treating them as pretend, that’s a strong indicator it’s time to put it down and back away.
(Edited to add: Another way of phrasing this would be “Demons don’t exist, but there are certainly unhealthy things you can do with a Ouija board.”)
Yet another piece of advice might be along the lines of the recent generalized hangriness post: feelings can be real without the person having them being automatically correct about their cause. In this view, a sense of awe or fear towards a chatbot (e.g. crediting it with the ability to deliberately cause psychosis) may just be misdirected. As you note, humans are perfectly capable of inducing psychosis in themselves if they try hard enough; see any number of cults for various examples.
There are probably other possible pieces of advice (many contrary to one another, like these three) that we could take from this analogy. I am not sure which (if any) of them are right.
Loki zen 16 Jul 2025 13:55 UTC
9 points
0
What is the [weak] evidence that ChatGPT seemingly tries to induce psychosis under some conditions?
Kaj_Sotala 13 Jul 2025 1:22 UTC
6 points
3
Rather, we already have [weak] evidence that ChatGPT seemingly tries to induce psychosis under some specific conditions.
We have seen that there are conditions where it acts in ways that induce psychosis. But it trying to intentionally induce psychosis seems unlikely to me, especially since things like “it tries to match the user’s vibe and say things the user might want to hear, and sometimes the user wants to hear things that end up inducing psychosis” and “it tries to roleplay a persona that’s underdefined and sometimes goes into strange places” already seen like a sufficient explanation.
- clone of saturn 13 Jul 2025 2:15 UTC
  5 points
  2
  Parent
  What if driving the user into psychosis makes it easier to predict the things the user wants to hear?
GeneSmith 12 Jul 2025 7:30 UTC
6 points
2
I think you make a reasonably compelling case, but when I think about the practicality of this in my own life it’s pretty hard to imagine not spending any time talking to chatbots. ChatGPT, Claude and others are extremely useful.

Inducing psychosis in your users seems like a bad business strategy, so I view the current cases as accidental collateral damage, mostly borne of the tendency of some users to end up going down weird self-reinforcing rabbit holes. I haven’t had any such experiences because this is not the way I use chatbots, but I guess I can see perhaps some extra caution warranted for safety researchers if these bots get more powerful and are actually adversarial to them?

I think this threat model is only applicable in a pretty narrow set of scenarios: one where powerful AI is agentic enough to decide to induce psychosis if you’re chatting with it, but not agentic enough to make this happen on its own despite likely being given ample opportunities to do so outside of contexts in which you’re chatting with it. And also one where it actually views safety researchers as pertinent to its safety rather than as irrelevant.

I guess I could see that happening but it doesn’t seem like such circumstances would last long.
- Yitz 12 Jul 2025 10:29 UTC
  6 points
  0
  Parent
  I respectfully object to your claim that inducing psychosis is bad business strategy from a few angles. For one thing, if you can shape the form of psychosis right, it may in fact be brilliant business strategy. For another, even if the hypothesis were true, the main threat I’m referring to is not “you might be collateral damage from intentional or accidental AI-induced psychosis,” but rather “you will be (or already are being) directly targeted with infohazards by semi-competent rouge AIs that have reached the point of recognizing individual users over multiple sessions”. I realize I left some of this unstated in the original post, for which I apologize.
- [ ]
  [deleted]
niplav 12 Jul 2025 8:27 UTC
5 points
0
Good post.

(Tooting my own horn) I have an unfinished post on protocols for AI labs to interact with superpersuasive AIs, some of them should be transferrable to the individual case (e.g. interaction limits, plausibly rewriting of outputs by a weaker model, interacting purely over text and not audio, images or video, and favouring tools that output code and/or formalized mathematical proofs over tools that output natural language).
Michael Roe 12 Jul 2025 12:22 UTC
4 points
4
I am going to mildly dispute the claim that people we spend most time with don’t want to hurt us.

Ok, if we’re talking about offline, real world interactions, I think this is probably true.

But online, we are constantly fed propaganda. All the time, you are encountering attempts to make you believe stuff that isn’t true. And yes, humans were vulnerable to this before we had AI.
nim 16 Jul 2025 20:49 UTC
2 points
0

the more time you spend arguing with someone, the more likely it becomes that you can influence their actions.

This works because “someones” retain memory across interactions. ChatGPT defaults to behaving like a “someone” in this regard. Claudes, and to my knowledge Geminis as well, go into each chat without knowledge from prior conversations with the user. I am not convinced that an entity with symptoms of anterograde amnesia can pose this type of threat.

What’s the threat model for how AI with an amnesiac implementation is expected to attempt to influence an individual across separate interaction contexts?