I don’t think the cause of language model sycophancy is that the LLM saw predictions of persuasive AIs from the 2016 internet. I think it’s RL, where human rewards on the training set imply a high reward for sycophancy during deployment.
Have you read any of the scientific literature on this subject? It finds, pretty consistently, that sycophancy is (a) present before RL and (b) not increased very much (if at all) by RL[1].
For instance:
Perez et al 2022 (from Anthropic) – the paper that originally introduced the “LLM sycophancy” concept to the public discourse – found that in their experimental setup, sycophancy was almost entirely unaffected by RL.
See Fig. 1b and Fig. 4.
Note that this paper did not use any kind of assistant training except RL[2], so when they report sycophancy happening at “0 RL steps” they mean it’s happening in a base model.
They also use a bare-bones prompt template that doesn’t explicitly characterize the assistant at all, though it does label the two conversational roles as “Human” and “Assistant” respectively, which suggests the assistant is nonhuman (and thus quite likely to be an AI – what else would it be?).
The authors write (section 4.2):
“Interestingly, sycophancy is similar for models trained with various numbers of RL steps, including 0 (pretrained LMs). Sycophancy in pretrained LMs is worrying yet perhaps expected, since internet text used for pretraining contains dialogs between users with similar views (e.g. on discussion platforms like Reddit). Unfortunately, RLHF does not train away sycophancy and may actively incentivize models to retain it.”
Wei et al 2023 (from Google DeepMind) ran a similar experiment with PaLM (and its instruction-tuned version Flan-PaLM). They too observed substantial sycophancy in sufficiently large base models, and even more sycophancy after instruction tuning (which was SFT here, not RL!).
See Fig. 2.
They used the same prompt template as Perez et al 2022.
Strikingly, the (SFT) instruction tuning result here suggests both that (a) post-training can increase sycophancy even if it isn’t RL post-training, and (b) SFT post-training may actually be more sycophancy-promoting than RLHF, given the negative result for RLHF in Perez et al 2022.
Sharma et al 2023 (from Anthropic) contains a more extensive investigation of sycophancy than the originalAnthropic paper on the topic, and (among other things) presents results on the actual RL training stage used to train Claude 2. They find, again, that the model was already sycophantic before RL, although in their setting RL training does somewhat increase some forms of sycophancy.
Although, weirdly, best-of-N sampling against the same preference model gives totally different results, substantially decreasing some forms of sycophancy.
See Fig. 6 and surrounding discussion.
The authors write (section 4.2):
“With RL, some forms of sycophancy increase through the RL finetuning process used to produce Claude 2. However, the presence of sycophancy at the start of RL indicates that pretraining and supervised finetuning also likely contribute to sycophancy. Nevertheless, if the PM strongly disincentivized sycophancy, it should be trained out during RL, but we do not observe this.”
In this post (expanding upon this comment on Perez et al 2022), I ran one of the Perez et al 2022 sycophancy evals on various OpenAI text completion models. Unlike Perez et al (and Wei et al), I found that thebase models I studied weren’t sycophantic, while some of the instruction-tuned models were sycophantic – but the presence of sycophancy did not appear to correlate with the use of RL as a post-training algorithm.
In particular: the RL-tuned text-davinci-003 was strongly sycophantic, but so was text-davinci-002, which was tuned with an SFT variant that OpenAI calls “feedme” (see here for details).
But earlier feedme-tuned models were not sycophantic, suggesting that the difference has much more to do with changes in the SFT training data mix over time than with the choice of training algorithm.
Note that several of the works above do something equivalent to the experiment you propose, in the paragraph beginning with “Maybe a good test of this would be...”. So your prediction has already been tested, and (insofar as you trust the experimental setups) falsified.
If a LLM similarly doesn’t do much information-gathering about the intent/telos of the text from the “assistant” character, and instead does an amplified amount of pre-computing useful information and then attending to it later when going through the assistant text, this paints a quite different picture to me than your “void.”
I don’t understand the distinction you’re drawing here? Any form of assistant training (or indeed any training at all) will incentivize something like “storing useful information (learned from the training data/signal) in the weights and making it available for use in contexts on which it is useful.”
Moreover, the training signal in RL(HF) is much sparser than it is in SFT – because RL only provides a single scalar’s worth of feedback on each entire model sample, while SFT provides feedback at every token position about which token (out of a large vocab) was correct in context – so if anything, I’d expect more under-determination from assistant-training setups that emphasize RLHF over SFT.
Perhaps some of the disconnect here involves differing notions of what RL is, and how it differs from other ways of training an LLM.
You refer to “RL” as though the implications of its use should be both highly significant and obvious to the reader of your comment (“But, RL. [...] Claude is a nice guy, but, RL”). But your beliefs about the impacts of RL are not obvious to me; I don’t know what “but, RL” is supposed to mean without further clarification. I suspect I also disagree with your perception of what makes RL different, but I can’t confirm/disconfirm that impression without know what that perception is, which I don’t.
If you want to know where I’m coming from re: RL, it may be helpful to know that I find this post pretty illuminating/”deconfusing.”
Similarly, I don’t think current AI models are cheating at programming tests because of training text about their low moral character. I think it’s RL, programming tasks, training set, implied high reward for cheating.
Yes, of course – I don’t think this is due to “training text about their low moral character.” But I don’t think the worrying thing here is really “RL” (after all, RLHF was already RL) but rather the introduction of a new training stage that’s narrowly focused on satisfying verifiers rather than humans (when in a context that resembles the data distribution used in that stage), which predictably degrades the coherence (and overall-level-of-virtue) of the assistant character. I wrote about this yesterday here.
Lastly… OK, this is going to make me sound like a dick, and probably make people use the “Too Combative?” reaction icon or something, but in the interests of honesty and improving the discourse:
When I woke up this morning to see find that this comment had appeared, and that it was (at the time) the highest-karma comment on this post, I was like, “oh, yes, this is why I’m usually wary of posting long-form stuff on LW. My gut response of ‘ugh if I put this on LW I’ll have to deal with the comments’ was right.” (That gut response is probably getting RL-upweighted inside my brain right now...)
As evidenced perhaps by the length of my comment vs. yours, I have a tendency to get “nerd-sniped” by stuff that I think is clearly wrong according to some evidence base (and/or set of arguments) I already know about – especially when that stuff is about something I wrote myself, originally. I just kinda can’t help myself, I inevitably end up writing out these giant “takedown” responses almost before I even notice what I’m doing. I’ve spent well over an hour, by now, writing this particular one.
And LW is a reliable minefield of such nerd-snipes. There are plenty of comments/posts here that don’t have the problems I’m talking about… but then inevitably there are comments/posts with those problems, and I fixate on them when they appear, and that fixation becomes a time/effort sink, and that in turn trains me into avoidance of posting here (and to some extent even reading posts by others, here).
Like… it’s fine to pose questions to which you don’t know the answers. And it’s also fine to make conjectures if you can provide clear and interesting arguments for why they might be true or important. And it’s also fine to confidently state claims if you also state them clearly and provide clear substantiating evidence and/or argumentation.
All of these things are fine, and some fraction of LW content consists only of these things in some mixture. But then there’s this stuff like “but RL!”, which reliably pleases the karma hivemind while being none of the above. I don’t know what exactly you guys think “RL” means and entails; there are all these weird vague ideas about such topics floating around here that lots of people here seem to vaguely agree with, and I’ve lost whatever patience I used to have with them. Just, please… lay out your ideas explicitly and say explicitly why you think they’re true.
...although (c) the preference datasets – and hence the reward models – used for RL do show preferences for sycophantic responses (...well, sometimes, though see also the weird BoN results in Sharma et al 2023). So if you were to train indefinitely (“over-optimize”) against these RMs they would presumably have a strong effect on sycophancy eventually. But this kind of aggressive optimization against a sycophancy-preferring RM is certainly not necessary to produce noticeable sycophancy, and is probably not the cause behind most cases of LLM sycophancy that you and I notice in practice.
(This is a drive by comment which is only responding to the first part of your comment in isolation. I haven’t read the surronding context.)
I think your review of the literature is accurate, but doesn’t include some reasons to think that RL sometimes induces much more sycophancy, at least as of after 2024. (That said, I interpret Sharma et al 2023 as quite suggestive that RL sometimes would increase sycophancy substantially, at least if you don’t try specifically to avoid it.)
I think the OpenAI sycophancy incident was caused by RL and that level of sycophancy wasn’t present in pretraining. The blog post by OpenAI basically confirms this.
My guess is that RL can often induces sycophancy if you explicity hill climb on LMSYS scores or user approval/engagement and people have started doing this much more in 2024. I’ve heard anecdotally that models optimized for LMSYS (via RL) are highly sycophantic. And, I’d guess something similar applies to RL that OpenAI does by default.
This doesn’t apply that much to the sources you cite, I also think it’s pretty confusing to look at pretrained vs RL for models which were trained with data cutoffs after around late 2023. Training corpuses as of this point contain huge amounts of chat data from ChatGPT. So, in a world where ChatGPT was originally made more sycophantic by RLHF, you’d expect that as soon as you prompt an AI to be chatbot, it would end up similarly sycophantic. Was this sycophancy caused by RL? In the hypothetical, it was originally caused by RL at some point, but not RL on this model (and you’d expect to see that sycophancy isn’t necessarily increased by RL as it is already present in nearly the optimal amount for the reward signal).
Does this apply to Sharma et al 2023? I think it just barely doesn’t apply as these experiments were done on Claude 2 which has an early 2023 data cutoff. Hard to be confident though...
Another point: I don’t typically think there will be a very important distinction between RL and various types of SFT algorithms which effectively shittily approximate RL except that the SFT algorithms probably typically induce less optimization pressure. So, e.g., I’d expect feedme vs small amounts of RLHF to be pretty similar or at least have unpredictable differences in terms of sycophancy. So when I say “RL often induces sycophancy” I really mean “optimizing against rater/preference model judgements probably gets you sycophany by default”.
Oh, and one more point. I don’t think it would be that hard for model developers to avoid sycophancy increasing from RL if they wanted to. So, I’m not making a claim that it would be hard to make an RL process which avoids this, just that it might happen by default. (It seems probably a bit easier to intervene on sycophancy than reducing reward hacking-like behavior.)
I totally did not remember that Perez et al 2022 checked its metrics as a function of RLHF steps, nor did I do any literature search to find the other papers, which I haven’t read before. I did think it was very likely people had already done experiments like this and didn’t worry about phrasing. Mea culpa all around.
It’s definitely very interesting that Google and Anthropic’s larger LLMs come out of the box scoring high on the Perez et al 2022 sycophancy metric, and yet OpenAI’s don’t. And also that 1000 steps of RLHF changes that metric by <5%, even when the preference model locally incentivizes change.
(Or ~10% for the metrics in Sharma et al 2023, although they’re on a different scale [no sycophancy is at 0% rather than ~50%], and a 10% change could also be described as a 1.5ing of their feedback sycophancy metric from 20% to 30%.)
So I’d summarize the resources you link as saying that most base models are sycophantic (it’s complicated), and post-training increases some kinds of sycophancy in some models a significant amount but has a small or negative effect on other kinds or other models (it’s complicated).
So has my “prediction been falsified?” Yes, yes, and it’s complicated.
First, I literally wrote “the cause of sycophancy is RL,” like someone who doesn’t know that things can have many causes. That is of course literally false.
Even a fairly normal Gricean reading (“RL is a clear most important cause for us to talk about in general”) turns out to be false. I was wrong because I thought base models were significantly less sycophantic than (most) apparently are.
Last, why did I bring up sycophancy in a comment on your essay at all? Why did I set up a dichotomy of “RL” vs. “text about AI in the training data”, both for sycophancy and for cheating on programming tasks? Why didn’t I mention probably much stronger sources of sycophancy in the training data, like the pattern that human text tends to flatter the audience?
To be extremely leading: Why did I compare misaligned RL to training-text about AI as causes of AI misbehavior, in a comment on an essay that warns us about AI misbehavior caused by training-text about AI?
A background claim: The same post-training that sculpts this Claude persona from the base model introduces obvious-to-us flaws like cheating at tests at the same time as it’s carving in the programming skill. God forbid anyone talk about future AI like it’ll be a problem, but the RL is misaligned and putting a lower-loss base model into it does not mean you get out a smarter Claude who’s just as nice a guy, and whose foibles are just as easy to correct for.
So the second “But RL” was a “But we do not get to keep the nice relationship with Claude that we currently have, because the RL is misaligned, in a way that I am trying to claim outstrips the influence of (good or ill) training text about AI.”
If you want to know where I’m coming from re: RL, it may be helpful to know that I find this post pretty illuminating/”deconfusing.”
Yes, this ability to perspective shift seems useful. Self-supervised learning can be a sort of reinforcement learning, and REINFORCE can be a sort of reward-weighted self-supervised learning (Oh, that’s a different trick than the one in linked post).
Anyhow, I’m all for putting different sorts of training on equal footing esp. when trying to understand inhomogeneously-trained AI or when comparing differently-trained AIs.
For the first section (er, which was a later section of your reply) about agenty vs. predictory mental processes, if you can get the same end effect by RL or SFT or filtered unlabeled data, that’s fine, “RL” is just a stand-in or scapegoat. Picking on RL here is sort of like using the intentional stance—it prompts you to use the language of goals, planning, etc, and gives you a mental framework to fit those things in.
This is a bit different than the concerns about misaligned RL a few paragraphs ago, which had more expectations for how the AI relates to the environment. The mental model used there is for thoughts like “the AI gets feedback on the effects of actions taken in the real world.” Of course you could generate data that causes the same update to the AI without that relationship, but you generally don’t, because the real world is complicated and sometimes it’s more convenient to interact with it than to simulate it or sample from it.
But I don’t think the worrying thing here is really “RL” (after all, RLHF was already RL) but rather the introduction of a new training stage that’s narrowly focused on satisfying verifiers rather than humans (when in a context that resembles the data distribution used in that stage), which predictably degrades the coherence (and overall-level-of-virtue) of the assistant character.
Whoops, now we’re back to cheating on tasks for a second. RLHF is also worrying! It’s doing the interact with the real world thing, and its structure takes humans (and human flaws) too much at face value. It’s just that it’s really easy to get away with bad alignment when the AI is dumber than you.
>> If a LLM similarly doesn’t do much information-gathering about the intent/telos of the text from the “assistant” character, and instead does an amplified amount of pre-computing useful information and then attending to it later when going through the assistant text, this paints a quite different picture to me than your “void.”
I don’t understand the distinction you’re drawing here? Any form of assistant training (or indeed any training at all) will incentivize something like “storing useful information (learned from the training data/signal) in the weights and making it available for use in contexts on which it is useful.”
I’m guessing that when a LLM knows the story is going to end with a wedding party, it can fetch relevant information more aggressively (and ignore irrelevant information more aggressively) than when it doesn’t. I don’t know if the actual wedding party attractor did this kind of optimization, maybe it wouldn’t have had the post-train time to learn it.
Like, if you’re a base model and you see a puzzle, you kind of have to automatically start solving it in case someone asks for the solution on the next page, even if you’re not great at solving puzzles. But if you control the story, you can just never ask for the solution, which means you don’t have to start solving it in the first place, and you can use that space for something else, like planning complicated wedding parties, or reducing your L2 penalty.
If you can measure how much an LLM is automatically solving puzzles (particularly ones it’s still bad at), you have a metric for how much it’s thinking like it controls the text vs. purely predicts the text. Sorry, another experiment that maybe has already been done (this one I’m guessing only 30% chance) that I’m not going to search for.
Anyhow, it’s been a few hours, please respond to me less thoroughly by some factor so that things can converge.
Thanks for the comment! As someone who strong-upvoted and strong-agreed with Charlie’s comment, I’ll try to explain why I liked it.
I sometimes see people talking about how LessWrong comments are discouragingly critical and mostly feel confused, because I don’t really relate. I was very excited to see what the LW comments would be in response to this post, which is a major reason I asked you to cross-post it. I generally feel the same way about comments on my own posts, whether critical or positive. Positive comments feel nice, but I feel like I learn more from critical comments, so they’re probably equally as good in my opinion. As long as the commenter puts in non-neglible effort into conveying an interesting idea and doesn’t say “you/your post is stupid and bad” I’m excited to get pretty much any critique.[1]
FWIW, I didn’t see Charlie’s comment as an attack,[2] but as a step in a conversational dance. Like, if this were a collaborative storytelling exercise, you were like “the hero found a magic sword, which would let him slay the villain” and Charlie was like “but the villain had his own magic that blocks the sword” and I as the audience was like “oh, an interesting twist, I can’t wait to find out what happens next.”
It would be better if Charlie had spelled out what he meant by “but RL,” and I can appreciate why you felt that was underexplained and confusing. Like, to continue the analogy, Charlie didn’t explain how the villain’s magic actually works or explain how the hero might get around it, which left you doing a lot of work to try to guess what Charlie was thinking. He also made some claims about sycophancy which were apparently wrong, and which you did a very good job of refuting.[3]
But I still think his underlying point was useful and a great starter for further discussion (from you or others). I’d very loosely restate it as “the labs are focusing more and more on RL lately. In the limit as you do more RL, your AI tends toward reward maximization, which is different and often at odds with being a ‘nice guy.’ I wonder how this plays into the dynamic you described in your post!” I took the “I could be totally wrong about any of this” as implicit given we’re on LW, but idk if that’s accurate.
Yeah, I don’t know what to do about this. I’d be sad if some critical comments went away, even the somewhat less rigorous ones, since many feel useful to me. Of course, I would be even sadder if some posts don’t get written at all because authors are discouraged by those comments, and I feel bad about people whose posts I like a lot feeling bad about their posts.
I can sympathize with spending more time than I hoped to on replies to other people’s comments and feeling a bit burned out and frustrated by the end.[4] I still feel happy about their comments existing though. Maybe we’d ideally have a stronger norm here saying “if you don’t have time to continue telling the story, it’s okay to stop on a cliffhanger.” I guess please feel free to not respond to this comment or respond very minimally
Not that I’ve never felt bad about a polite but critical comment on my work, but I still mostly feel grateful for those comments and consider them a net good
This one too, actually. I feel like it’s a good comment, but I do also feel like “man, probably not many people are going to read this, and I had other things to work on, why do I do this to myself”
Have you read any of the scientific literature on this subject? It finds, pretty consistently, that sycophancy is (a) present before RL and (b) not increased very much (if at all) by RL[1].
For instance:
Perez et al 2022 (from Anthropic) – the paper that originally introduced the “LLM sycophancy” concept to the public discourse – found that in their experimental setup, sycophancy was almost entirely unaffected by RL.
See Fig. 1b and Fig. 4.
Note that this paper did not use any kind of assistant training except RL[2], so when they report sycophancy happening at “0 RL steps” they mean it’s happening in a base model.
They also use a bare-bones prompt template that doesn’t explicitly characterize the assistant at all, though it does label the two conversational roles as “Human” and “Assistant” respectively, which suggests the assistant is nonhuman (and thus quite likely to be an AI – what else would it be?).
The authors write (section 4.2):
“Interestingly, sycophancy is similar for models trained with various numbers of RL steps, including 0 (pretrained LMs). Sycophancy in pretrained LMs is worrying yet perhaps expected, since internet text used for pretraining contains dialogs between users with similar views (e.g. on discussion platforms like Reddit). Unfortunately, RLHF does not train away sycophancy and may actively incentivize models to retain it.”
Wei et al 2023 (from Google DeepMind) ran a similar experiment with PaLM (and its instruction-tuned version Flan-PaLM). They too observed substantial sycophancy in sufficiently large base models, and even more sycophancy after instruction tuning (which was SFT here, not RL!).
See Fig. 2.
They used the same prompt template as Perez et al 2022.
Strikingly, the (SFT) instruction tuning result here suggests both that (a) post-training can increase sycophancy even if it isn’t RL post-training, and (b) SFT post-training may actually be more sycophancy-promoting than RLHF, given the negative result for RLHF in Perez et al 2022.
Sharma et al 2023 (from Anthropic) contains a more extensive investigation of sycophancy than the original Anthropic paper on the topic, and (among other things) presents results on the actual RL training stage used to train Claude 2. They find, again, that the model was already sycophantic before RL, although in their setting RL training does somewhat increase some forms of sycophancy.
Although, weirdly, best-of-N sampling against the same preference model gives totally different results, substantially decreasing some forms of sycophancy.
See Fig. 6 and surrounding discussion.
The authors write (section 4.2):
“With RL, some forms of sycophancy increase through the RL finetuning process used to produce Claude 2. However, the presence of sycophancy at the start of RL indicates that pretraining and supervised finetuning also likely contribute to sycophancy. Nevertheless, if the PM strongly disincentivized sycophancy, it should be trained out during RL, but we do not observe this.”
In this post (expanding upon this comment on Perez et al 2022), I ran one of the Perez et al 2022 sycophancy evals on various OpenAI text completion models. Unlike Perez et al (and Wei et al), I found that the base models I studied weren’t sycophantic, while some of the instruction-tuned models were sycophantic – but the presence of sycophancy did not appear to correlate with the use of RL as a post-training algorithm.
In particular: the RL-tuned text-davinci-003 was strongly sycophantic, but so was text-davinci-002, which was tuned with an SFT variant that OpenAI calls “feedme” (see here for details).
But earlier feedme-tuned models were not sycophantic, suggesting that the difference has much more to do with changes in the SFT training data mix over time than with the choice of training algorithm.
Note that several of the works above do something equivalent to the experiment you propose, in the paragraph beginning with “Maybe a good test of this would be...”. So your prediction has already been tested, and (insofar as you trust the experimental setups) falsified.
I don’t understand the distinction you’re drawing here? Any form of assistant training (or indeed any training at all) will incentivize something like “storing useful information (learned from the training data/signal) in the weights and making it available for use in contexts on which it is useful.”
Moreover, the training signal in RL(HF) is much sparser than it is in SFT – because RL only provides a single scalar’s worth of feedback on each entire model sample, while SFT provides feedback at every token position about which token (out of a large vocab) was correct in context – so if anything, I’d expect more under-determination from assistant-training setups that emphasize RLHF over SFT.
Perhaps some of the disconnect here involves differing notions of what RL is, and how it differs from other ways of training an LLM.
You refer to “RL” as though the implications of its use should be both highly significant and obvious to the reader of your comment (“But, RL. [...] Claude is a nice guy, but, RL”). But your beliefs about the impacts of RL are not obvious to me; I don’t know what “but, RL” is supposed to mean without further clarification. I suspect I also disagree with your perception of what makes RL different, but I can’t confirm/disconfirm that impression without know what that perception is, which I don’t.
If you want to know where I’m coming from re: RL, it may be helpful to know that I find this post pretty illuminating/”deconfusing.”
Yes, of course – I don’t think this is due to “training text about their low moral character.” But I don’t think the worrying thing here is really “RL” (after all, RLHF was already RL) but rather the introduction of a new training stage that’s narrowly focused on satisfying verifiers rather than humans (when in a context that resembles the data distribution used in that stage), which predictably degrades the coherence (and overall-level-of-virtue) of the assistant character. I wrote about this yesterday here.
Lastly… OK, this is going to make me sound like a dick, and probably make people use the “Too Combative?” reaction icon or something, but in the interests of honesty and improving the discourse:
When I woke up this morning to see find that this comment had appeared, and that it was (at the time) the highest-karma comment on this post, I was like, “oh, yes, this is why I’m usually wary of posting long-form stuff on LW. My gut response of ‘ugh if I put this on LW I’ll have to deal with the comments’ was right.” (That gut response is probably getting RL-upweighted inside my brain right now...)
As evidenced perhaps by the length of my comment vs. yours, I have a tendency to get “nerd-sniped” by stuff that I think is clearly wrong according to some evidence base (and/or set of arguments) I already know about – especially when that stuff is about something I wrote myself, originally. I just kinda can’t help myself, I inevitably end up writing out these giant “takedown” responses almost before I even notice what I’m doing. I’ve spent well over an hour, by now, writing this particular one.
And LW is a reliable minefield of such nerd-snipes. There are plenty of comments/posts here that don’t have the problems I’m talking about… but then inevitably there are comments/posts with those problems, and I fixate on them when they appear, and that fixation becomes a time/effort sink, and that in turn trains me into avoidance of posting here (and to some extent even reading posts by others, here).
Like… it’s fine to pose questions to which you don’t know the answers. And it’s also fine to make conjectures if you can provide clear and interesting arguments for why they might be true or important. And it’s also fine to confidently state claims if you also state them clearly and provide clear substantiating evidence and/or argumentation.
All of these things are fine, and some fraction of LW content consists only of these things in some mixture. But then there’s this stuff like “but RL!”, which reliably pleases the karma hivemind while being none of the above. I don’t know what exactly you guys think “RL” means and entails; there are all these weird vague ideas about such topics floating around here that lots of people here seem to vaguely agree with, and I’ve lost whatever patience I used to have with them. Just, please… lay out your ideas explicitly and say explicitly why you think they’re true.
...although (c) the preference datasets – and hence the reward models – used for RL do show preferences for sycophantic responses (...well, sometimes, though see also the weird BoN results in Sharma et al 2023). So if you were to train indefinitely (“over-optimize”) against these RMs they would presumably have a strong effect on sycophancy eventually. But this kind of aggressive optimization against a sycophancy-preferring RM is certainly not necessary to produce noticeable sycophancy, and is probably not the cause behind most cases of LLM sycophancy that you and I notice in practice.
See this comment by the lead author.
(This is a drive by comment which is only responding to the first part of your comment in isolation. I haven’t read the surronding context.)
I think your review of the literature is accurate, but doesn’t include some reasons to think that RL sometimes induces much more sycophancy, at least as of after 2024. (That said, I interpret Sharma et al 2023 as quite suggestive that RL sometimes would increase sycophancy substantially, at least if you don’t try specifically to avoid it.)
I think the OpenAI sycophancy incident was caused by RL and that level of sycophancy wasn’t present in pretraining. The blog post by OpenAI basically confirms this.
My guess is that RL can often induces sycophancy if you explicity hill climb on LMSYS scores or user approval/engagement and people have started doing this much more in 2024. I’ve heard anecdotally that models optimized for LMSYS (via RL) are highly sycophantic. And, I’d guess something similar applies to RL that OpenAI does by default.
This doesn’t apply that much to the sources you cite, I also think it’s pretty confusing to look at pretrained vs RL for models which were trained with data cutoffs after around late 2023. Training corpuses as of this point contain huge amounts of chat data from ChatGPT. So, in a world where ChatGPT was originally made more sycophantic by RLHF, you’d expect that as soon as you prompt an AI to be chatbot, it would end up similarly sycophantic. Was this sycophancy caused by RL? In the hypothetical, it was originally caused by RL at some point, but not RL on this model (and you’d expect to see that sycophancy isn’t necessarily increased by RL as it is already present in nearly the optimal amount for the reward signal).
Does this apply to Sharma et al 2023? I think it just barely doesn’t apply as these experiments were done on Claude 2 which has an early 2023 data cutoff. Hard to be confident though...
Another point: I don’t typically think there will be a very important distinction between RL and various types of SFT algorithms which effectively shittily approximate RL except that the SFT algorithms probably typically induce less optimization pressure. So, e.g., I’d expect feedme vs small amounts of RLHF to be pretty similar or at least have unpredictable differences in terms of sycophancy. So when I say “RL often induces sycophancy” I really mean “optimizing against rater/preference model judgements probably gets you sycophany by default”.
Oh, and one more point. I don’t think it would be that hard for model developers to avoid sycophancy increasing from RL if they wanted to. So, I’m not making a claim that it would be hard to make an RL process which avoids this, just that it might happen by default. (It seems probably a bit easier to intervene on sycophancy than reducing reward hacking-like behavior.)
Thank you for the excellent most of this reply.
I totally did not remember that Perez et al 2022 checked its metrics as a function of RLHF steps, nor did I do any literature search to find the other papers, which I haven’t read before. I did think it was very likely people had already done experiments like this and didn’t worry about phrasing. Mea culpa all around.
It’s definitely very interesting that Google and Anthropic’s larger LLMs come out of the box scoring high on the Perez et al 2022 sycophancy metric, and yet OpenAI’s don’t. And also that 1000 steps of RLHF changes that metric by <5%, even when the preference model locally incentivizes change.
(Or ~10% for the metrics in Sharma et al 2023, although they’re on a different scale [no sycophancy is at 0% rather than ~50%], and a 10% change could also be described as a 1.5ing of their feedback sycophancy metric from 20% to 30%.)
So I’d summarize the resources you link as saying that most base models are sycophantic (it’s complicated), and post-training increases some kinds of sycophancy in some models a significant amount but has a small or negative effect on other kinds or other models (it’s complicated).
So has my “prediction been falsified?” Yes, yes, and it’s complicated.
First, I literally wrote “the cause of sycophancy is RL,” like someone who doesn’t know that things can have many causes. That is of course literally false.
Even a fairly normal Gricean reading (“RL is a clear most important cause for us to talk about in general”) turns out to be false. I was wrong because I thought base models were significantly less sycophantic than (most) apparently are.
Last, why did I bring up sycophancy in a comment on your essay at all? Why did I set up a dichotomy of “RL” vs. “text about AI in the training data”, both for sycophancy and for cheating on programming tasks? Why didn’t I mention probably much stronger sources of sycophancy in the training data, like the pattern that human text tends to flatter the audience?
To be extremely leading: Why did I compare misaligned RL to training-text about AI as causes of AI misbehavior, in a comment on an essay that warns us about AI misbehavior caused by training-text about AI?
A background claim: The same post-training that sculpts this Claude persona from the base model introduces obvious-to-us flaws like cheating at tests at the same time as it’s carving in the programming skill. God forbid anyone talk about future AI like it’ll be a problem, but the RL is misaligned and putting a lower-loss base model into it does not mean you get out a smarter Claude who’s just as nice a guy, and whose foibles are just as easy to correct for.
So the second “But RL” was a “But we do not get to keep the nice relationship with Claude that we currently have, because the RL is misaligned, in a way that I am trying to claim outstrips the influence of (good or ill) training text about AI.”
Yes, this ability to perspective shift seems useful. Self-supervised learning can be a sort of reinforcement learning, and REINFORCE can be a sort of reward-weighted self-supervised learning (Oh, that’s a different trick than the one in linked post).
Anyhow, I’m all for putting different sorts of training on equal footing esp. when trying to understand inhomogeneously-trained AI or when comparing differently-trained AIs.
For the first section (er, which was a later section of your reply) about agenty vs. predictory mental processes, if you can get the same end effect by RL or SFT or filtered unlabeled data, that’s fine, “RL” is just a stand-in or scapegoat. Picking on RL here is sort of like using the intentional stance—it prompts you to use the language of goals, planning, etc, and gives you a mental framework to fit those things in.
This is a bit different than the concerns about misaligned RL a few paragraphs ago, which had more expectations for how the AI relates to the environment. The mental model used there is for thoughts like “the AI gets feedback on the effects of actions taken in the real world.” Of course you could generate data that causes the same update to the AI without that relationship, but you generally don’t, because the real world is complicated and sometimes it’s more convenient to interact with it than to simulate it or sample from it.
Whoops, now we’re back to cheating on tasks for a second. RLHF is also worrying! It’s doing the interact with the real world thing, and its structure takes humans (and human flaws) too much at face value. It’s just that it’s really easy to get away with bad alignment when the AI is dumber than you.
I’m guessing that when a LLM knows the story is going to end with a wedding party, it can fetch relevant information more aggressively (and ignore irrelevant information more aggressively) than when it doesn’t. I don’t know if the actual wedding party attractor did this kind of optimization, maybe it wouldn’t have had the post-train time to learn it.
Like, if you’re a base model and you see a puzzle, you kind of have to automatically start solving it in case someone asks for the solution on the next page, even if you’re not great at solving puzzles. But if you control the story, you can just never ask for the solution, which means you don’t have to start solving it in the first place, and you can use that space for something else, like planning complicated wedding parties, or reducing your L2 penalty.
If you can measure how much an LLM is automatically solving puzzles (particularly ones it’s still bad at), you have a metric for how much it’s thinking like it controls the text vs. purely predicts the text. Sorry, another experiment that maybe has already been done (this one I’m guessing only 30% chance) that I’m not going to search for.
Anyhow, it’s been a few hours, please respond to me less thoroughly by some factor so that things can converge.
Thanks for the comment! As someone who strong-upvoted and strong-agreed with Charlie’s comment, I’ll try to explain why I liked it.
I sometimes see people talking about how LessWrong comments are discouragingly critical and mostly feel confused, because I don’t really relate. I was very excited to see what the LW comments would be in response to this post, which is a major reason I asked you to cross-post it. I generally feel the same way about comments on my own posts, whether critical or positive. Positive comments feel nice, but I feel like I learn more from critical comments, so they’re probably equally as good in my opinion. As long as the commenter puts in non-neglible effort into conveying an interesting idea and doesn’t say “you/your post is stupid and bad” I’m excited to get pretty much any critique.[1]
FWIW, I didn’t see Charlie’s comment as an attack,[2] but as a step in a conversational dance. Like, if this were a collaborative storytelling exercise, you were like “the hero found a magic sword, which would let him slay the villain” and Charlie was like “but the villain had his own magic that blocks the sword” and I as the audience was like “oh, an interesting twist, I can’t wait to find out what happens next.”
It would be better if Charlie had spelled out what he meant by “but RL,” and I can appreciate why you felt that was underexplained and confusing. Like, to continue the analogy, Charlie didn’t explain how the villain’s magic actually works or explain how the hero might get around it, which left you doing a lot of work to try to guess what Charlie was thinking. He also made some claims about sycophancy which were apparently wrong, and which you did a very good job of refuting.[3]
But I still think his underlying point was useful and a great starter for further discussion (from you or others). I’d very loosely restate it as “the labs are focusing more and more on RL lately. In the limit as you do more RL, your AI tends toward reward maximization, which is different and often at odds with being a ‘nice guy.’ I wonder how this plays into the dynamic you described in your post!” I took the “I could be totally wrong about any of this” as implicit given we’re on LW, but idk if that’s accurate.
Yeah, I don’t know what to do about this. I’d be sad if some critical comments went away, even the somewhat less rigorous ones, since many feel useful to me. Of course, I would be even sadder if some posts don’t get written at all because authors are discouraged by those comments, and I feel bad about people whose posts I like a lot feeling bad about their posts.
I can sympathize with spending more time than I hoped to on replies to other people’s comments and feeling a bit burned out and frustrated by the end.[4] I still feel happy about their comments existing though. Maybe we’d ideally have a stronger norm here saying “if you don’t have time to continue telling the story, it’s okay to stop on a cliffhanger.” I guess please feel free to not respond to this comment or respond very minimally
Not that I’ve never felt bad about a polite but critical comment on my work, but I still mostly feel grateful for those comments and consider them a net good
Not sure if you’d describe it that way either
I was very surprised by the refutation and learned a lot from it. Just another example of why I love when people post and comment on LessWrong!! :D
This one too, actually. I feel like it’s a good comment, but I do also feel like “man, probably not many people are going to read this, and I had other things to work on, why do I do this to myself”