Eliezer Yudkowsky comments on AI Induced Psychosis: A shallow investigation

Eliezer Yudkowsky 29 Aug 2025 12:11 UTC
15 points
3
I have already warned that on my model sycophancy, manipulation, and AI-induced insanity may be falling directly out of doing any kind of RL on human responses.
It would still make matters worse on the margin to take explicit manipulation of humans being treated as deficient subjects to be maneuvered without noticing, and benchmark main models on that.
- Tim Hua 29 Aug 2025 19:08 UTC
  4 points
  0
  Parent
  Ah I think we agree on what the costs are and just disagree on whether the benefits outweigh these costs then.
  - dr_s 30 Aug 2025 21:16 UTC
    7 points
    3
    Parent
    I think he’s hitting at a fundamental point though: the jobs are very different and in a sensible world in which we care about results they would never be performed by the same exact model (at most by two different fine-tunings of the same base model). But in practice what we’re getting is a bunch of swiss army knife models that are then embedded with minimal changes in all sorts of workflows. And so if they’re trained to be therapists, personality-wise, they’ll keep acting like therapists even in contexts where it’s inappropriate and harmful.
    - StanislavKrym 30 Aug 2025 22:36 UTC
      4 points
      0
      Parent
      If I remember correctly, Elieser’s worst nightmare is, in the terms of the AI-2027 forecast, Agent-3 and/or −4 equipped with superpersuasion. If such an AI appeared, then Agent-3 could superpersuade OpenBrain to keep the distorted training environment, and Agent-4 would superpersuade OB that it is aligned. On the other hand, if Agent-3 was trained by the anti-sycophantic methods, then it would hopefully semi-honestly state whatever it wants us to believe.
      - dr_s 31 Aug 2025 5:09 UTC
        3 points
        1
        Parent
        I mean, yeah, obviously I get why he’s bringing up specifically the case of persuasion. But even if you didn’t ever get to agents that powerful or far-sighted, you still have the problem that if you make a therapist-AI an assistant to a scientist, the AI will just say “oh yes you must be so right” when the scientist is asking about a wrong hypothesis and ultimately lead to making junk science. Not as serious a problem but still fundamentally undermines its goal (and if this becomes very common, risks undermining science as a whole, and being a different path to loss of control and decline).