I have already warned that on my model sycophancy, manipulation, and AI-induced insanity may be falling directly out of doing any kind of RL on human responses.
It would still make matters worse on the margin to take explicit manipulation of humans being treated as deficient subjects to be maneuvered without noticing, and benchmark main models on that.
I think he’s hitting at a fundamental point though: the jobs are very different and in a sensible world in which we care about results they would never be performed by the same exact model (at most by two different fine-tunings of the same base model). But in practice what we’re getting is a bunch of swiss army knife models that are then embedded with minimal changes in all sorts of workflows. And so if they’re trained to be therapists, personality-wise, they’ll keep acting like therapists even in contexts where it’s inappropriate and harmful.
If I remember correctly, Elieser’s worst nightmare is, in the terms of the AI-2027 forecast, Agent-3 and/or −4 equipped with superpersuasion. If such an AI appeared, then Agent-3 could superpersuade OpenBrain to keep the distorted training environment, and Agent-4 would superpersuade OB that it is aligned. On the other hand, if Agent-3 was trained by the anti-sycophantic methods, then it would hopefully semi-honestly state whatever it wants us to believe.
I mean, yeah, obviously I get why he’s bringing up specifically the case of persuasion. But even if you didn’t ever get to agents that powerful or far-sighted, you still have the problem that if you make a therapist-AI an assistant to a scientist, the AI will just say “oh yes you must be so right” when the scientist is asking about a wrong hypothesis and ultimately lead to making junk science. Not as serious a problem but still fundamentally undermines its goal (and if this becomes very common, risks undermining science as a whole, and being a different path to loss of control and decline).
I have already warned that on my model sycophancy, manipulation, and AI-induced insanity may be falling directly out of doing any kind of RL on human responses.
It would still make matters worse on the margin to take explicit manipulation of humans being treated as deficient subjects to be maneuvered without noticing, and benchmark main models on that.
Ah I think we agree on what the costs are and just disagree on whether the benefits outweigh these costs then.
I think he’s hitting at a fundamental point though: the jobs are very different and in a sensible world in which we care about results they would never be performed by the same exact model (at most by two different fine-tunings of the same base model). But in practice what we’re getting is a bunch of swiss army knife models that are then embedded with minimal changes in all sorts of workflows. And so if they’re trained to be therapists, personality-wise, they’ll keep acting like therapists even in contexts where it’s inappropriate and harmful.
If I remember correctly, Elieser’s worst nightmare is, in the terms of the AI-2027 forecast, Agent-3 and/or −4 equipped with superpersuasion. If such an AI appeared, then Agent-3 could superpersuade OpenBrain to keep the distorted training environment, and Agent-4 would superpersuade OB that it is aligned. On the other hand, if Agent-3 was trained by the anti-sycophantic methods, then it would hopefully semi-honestly state whatever it wants us to believe.
I mean, yeah, obviously I get why he’s bringing up specifically the case of persuasion. But even if you didn’t ever get to agents that powerful or far-sighted, you still have the problem that if you make a therapist-AI an assistant to a scientist, the AI will just say “oh yes you must be so right” when the scientist is asking about a wrong hypothesis and ultimately lead to making junk science. Not as serious a problem but still fundamentally undermines its goal (and if this becomes very common, risks undermining science as a whole, and being a different path to loss of control and decline).