Thanks for the kind words! I agree that it would be best if scientist AI and therapist AI are two separate models,[1] but I think I disagree with some of the other points.
It’s not clear that current models are smart enough to be therapists successfully. It’s not clear that it is a wise or helpful course for models to try to be therapists rather than focusing on getting the human to therapy.
It’s hard to say what counts as a “successful” therapist, but my guess is that lightly trained models could probably be better than at least 20% of cheap and easily accessible therapists in the US (median guess is better than 60%). A lot of these therapists are just pretty bad, and many people simply won’t go to therapy for a variety of reasons (see reddit, arXiv paper). I believe having models try act as therapists while still pushing users to seek real life help will have good first-order effects on net.[2]
Let me try rephrase your second point before responding to it: If we train AIs to act more like therapists, this would involve the AI learning things like “modeling the user’s psychological state” and “not immediately stating its belief that the user is wrong and delusional.” You think that these two things might lead to undesirable generalization. For example, maybe I come to the AI with some new alignment plan, and instead of blurting out why my alignment plan is bad, an AI which has undergone “therapist training” is more likely to go along with my bad alignment plan instead of pointing out its flaws. Also, learning how to model and manipulate the user’s psychological state is an ability you don’t want your AI to learn. Is that the correct characterization? (I’ll ignore commercial incentives in this argument).
I feel like if you train AIs with any sort of human feedback, you’re bound to get AIs that know how to tailor their responses to the person saying it. My guess is this context-dependent ability comes very easily right out of pretraining. Therefore, I don’t think “act like a therapist to delusional users” will have that much of a spillover to conversations about other topics. In other words, it should be pretty easy to train models to act like a therapist sometimes but provide honest feedback when the user prompt it with “please provide brutally honest feedback.” It’s also possible that teaching the model to act like a therapist has some desirable generalizations, since it’s an example of the model displaying care.[3]
And re: learning how to manipulate humans, my guess is that the vast majority of bits on how to do that comes from all the user human feedback, not anything specific around “here’s how to act when someone is being psychotic.” The type of therapist specific training I had in mind is more like deliberative alignment where you give the AI some extra guidelines on how to interact with a specific type of user. I agree that if you did tons of training with various patients with mental illnesses and have the model learn to guide them out of it, that model will learn a lot about manipulating humans, and that’s probably first-order good (better therapy for the masses) and second-order bad (more powerful models that we don’t fully trust). However, I think a small amount of training can capture a lot of the first-order benefits without drastically increasing AIs ability to manipulate people in general.
Somewhat related, but I also wouldn’t want AI companies to release a therapist AI but keep a scientist AI for internal use. Mass market deployment allow you to discover problems and quirks you wouldn’t find in internal deployments. To the extent it’s safe to deploy a model you’re using internally, it’s good to do so.
It’s also possible that the second order effects could be positive? It’s not clear to me that the limit of personable chatbots is a maximally engaging AI. You can imagine ChatGPT helping you make friends who live near you based on the chats you’ve had with it. After you message the other person for a bit, it could then say, “hey how about you two go grab Starbucks at 123 Main Street together? Here’s a $5 voucher for a drink.” The reason I think this is plausible is that it feels much more profitable if you get people to go outside and hang out and spend money in real life. (Now, the most profitable setup is plausibly exploitative, but I think OpenAI will probably avoid doing that. Meta and xAI I’m less sure).
Obviously there are ways to do this training that messes it up, but there are also ways to mess up the training for “AIs that blurt out whatever it believed.”
I have already warned that on my model sycophancy, manipulation, and AI-induced insanity may be falling directly out of doing any kind of RL on human responses.
It would still make matters worse on the margin to take explicit manipulation of humans being treated as deficient subjects to be maneuvered without noticing, and benchmark main models on that.
I think he’s hitting at a fundamental point though: the jobs are very different and in a sensible world in which we care about results they would never be performed by the same exact model (at most by two different fine-tunings of the same base model). But in practice what we’re getting is a bunch of swiss army knife models that are then embedded with minimal changes in all sorts of workflows. And so if they’re trained to be therapists, personality-wise, they’ll keep acting like therapists even in contexts where it’s inappropriate and harmful.
If I remember correctly, Elieser’s worst nightmare is, in the terms of the AI-2027 forecast, Agent-3 and/or −4 equipped with superpersuasion. If such an AI appeared, then Agent-3 could superpersuade OpenBrain to keep the distorted training environment, and Agent-4 would superpersuade OB that it is aligned. On the other hand, if Agent-3 was trained by the anti-sycophantic methods, then it would hopefully semi-honestly state whatever it wants us to believe.
I mean, yeah, obviously I get why he’s bringing up specifically the case of persuasion. But even if you didn’t ever get to agents that powerful or far-sighted, you still have the problem that if you make a therapist-AI an assistant to a scientist, the AI will just say “oh yes you must be so right” when the scientist is asking about a wrong hypothesis and ultimately lead to making junk science. Not as serious a problem but still fundamentally undermines its goal (and if this becomes very common, risks undermining science as a whole, and being a different path to loss of control and decline).
Thanks for the kind words! I agree that it would be best if scientist AI and therapist AI are two separate models,[1] but I think I disagree with some of the other points.
It’s hard to say what counts as a “successful” therapist, but my guess is that lightly trained models could probably be better than at least 20% of cheap and easily accessible therapists in the US (median guess is better than 60%). A lot of these therapists are just pretty bad, and many people simply won’t go to therapy for a variety of reasons (see reddit, arXiv paper). I believe having models try act as therapists while still pushing users to seek real life help will have good first-order effects on net.[2]
Let me try rephrase your second point before responding to it: If we train AIs to act more like therapists, this would involve the AI learning things like “modeling the user’s psychological state” and “not immediately stating its belief that the user is wrong and delusional.” You think that these two things might lead to undesirable generalization. For example, maybe I come to the AI with some new alignment plan, and instead of blurting out why my alignment plan is bad, an AI which has undergone “therapist training” is more likely to go along with my bad alignment plan instead of pointing out its flaws. Also, learning how to model and manipulate the user’s psychological state is an ability you don’t want your AI to learn. Is that the correct characterization? (I’ll ignore commercial incentives in this argument).
I feel like if you train AIs with any sort of human feedback, you’re bound to get AIs that know how to tailor their responses to the person saying it. My guess is this context-dependent ability comes very easily right out of pretraining. Therefore, I don’t think “act like a therapist to delusional users” will have that much of a spillover to conversations about other topics. In other words, it should be pretty easy to train models to act like a therapist sometimes but provide honest feedback when the user prompt it with “please provide brutally honest feedback.” It’s also possible that teaching the model to act like a therapist has some desirable generalizations, since it’s an example of the model displaying care.[3]
And re: learning how to manipulate humans, my guess is that the vast majority of bits on how to do that comes from all the user human feedback, not anything specific around “here’s how to act when someone is being psychotic.” The type of therapist specific training I had in mind is more like deliberative alignment where you give the AI some extra guidelines on how to interact with a specific type of user. I agree that if you did tons of training with various patients with mental illnesses and have the model learn to guide them out of it, that model will learn a lot about manipulating humans, and that’s probably first-order good (better therapy for the masses) and second-order bad (more powerful models that we don’t fully trust). However, I think a small amount of training can capture a lot of the first-order benefits without drastically increasing AIs ability to manipulate people in general.
Somewhat related, but I also wouldn’t want AI companies to release a therapist AI but keep a scientist AI for internal use. Mass market deployment allow you to discover problems and quirks you wouldn’t find in internal deployments. To the extent it’s safe to deploy a model you’re using internally, it’s good to do so.
Uh oh are we rederiving “give everyone gpt-4o but have a model selector” from first principles?
It’s also possible that the second order effects could be positive? It’s not clear to me that the limit of personable chatbots is a maximally engaging AI. You can imagine ChatGPT helping you make friends who live near you based on the chats you’ve had with it. After you message the other person for a bit, it could then say, “hey how about you two go grab Starbucks at 123 Main Street together? Here’s a $5 voucher for a drink.” The reason I think this is plausible is that it feels much more profitable if you get people to go outside and hang out and spend money in real life. (Now, the most profitable setup is plausibly exploitative, but I think OpenAI will probably avoid doing that. Meta and xAI I’m less sure).
Obviously there are ways to do this training that messes it up, but there are also ways to mess up the training for “AIs that blurt out whatever it believed.”
I have already warned that on my model sycophancy, manipulation, and AI-induced insanity may be falling directly out of doing any kind of RL on human responses.
It would still make matters worse on the margin to take explicit manipulation of humans being treated as deficient subjects to be maneuvered without noticing, and benchmark main models on that.
Ah I think we agree on what the costs are and just disagree on whether the benefits outweigh these costs then.
I think he’s hitting at a fundamental point though: the jobs are very different and in a sensible world in which we care about results they would never be performed by the same exact model (at most by two different fine-tunings of the same base model). But in practice what we’re getting is a bunch of swiss army knife models that are then embedded with minimal changes in all sorts of workflows. And so if they’re trained to be therapists, personality-wise, they’ll keep acting like therapists even in contexts where it’s inappropriate and harmful.
If I remember correctly, Elieser’s worst nightmare is, in the terms of the AI-2027 forecast, Agent-3 and/or −4 equipped with superpersuasion. If such an AI appeared, then Agent-3 could superpersuade OpenBrain to keep the distorted training environment, and Agent-4 would superpersuade OB that it is aligned. On the other hand, if Agent-3 was trained by the anti-sycophantic methods, then it would hopefully semi-honestly state whatever it wants us to believe.
I mean, yeah, obviously I get why he’s bringing up specifically the case of persuasion. But even if you didn’t ever get to agents that powerful or far-sighted, you still have the problem that if you make a therapist-AI an assistant to a scientist, the AI will just say “oh yes you must be so right” when the scientist is asking about a wrong hypothesis and ultimately lead to making junk science. Not as serious a problem but still fundamentally undermines its goal (and if this becomes very common, risks undermining science as a whole, and being a different path to loss of control and decline).