(This is a drive by comment which is only responding to the first part of your comment in isolation. I haven’t read the surronding context.)
I think your review of the literature is accurate, but doesn’t include some reasons to think that RL sometimes induces much more sycophancy, at least as of after 2024. (That said, I interpret Sharma et al 2023 as quite suggestive that RL sometimes would increase sycophancy substantially, at least if you don’t try specifically to avoid it.)
I think the OpenAI sycophancy incident was caused by RL and that level of sycophancy wasn’t present in pretraining. The blog post by OpenAI basically confirms this.
My guess is that RL can often induces sycophancy if you explicity hill climb on LMSYS scores or user approval/engagement and people have started doing this much more in 2024. I’ve heard anecdotally that models optimized for LMSYS (via RL) are highly sycophantic. And, I’d guess something similar applies to RL that OpenAI does by default.
This doesn’t apply that much to the sources you cite, I also think it’s pretty confusing to look at pretrained vs RL for models which were trained with data cutoffs after around late 2023. Training corpuses as of this point contain huge amounts of chat data from ChatGPT. So, in a world where ChatGPT was originally made more sycophantic by RLHF, you’d expect that as soon as you prompt an AI to be chatbot, it would end up similarly sycophantic. Was this sycophancy caused by RL? In the hypothetical, it was originally caused by RL at some point, but not RL on this model (and you’d expect to see that sycophancy isn’t necessarily increased by RL as it is already present in nearly the optimal amount for the reward signal).
Does this apply to Sharma et al 2023? I think it just barely doesn’t apply as these experiments were done on Claude 2 which has an early 2023 data cutoff. Hard to be confident though...
Another point: I don’t typically think there will be a very important distinction between RL and various types of SFT algorithms which effectively shittily approximate RL except that the SFT algorithms probably typically induce less optimization pressure. So, e.g., I’d expect feedme vs small amounts of RLHF to be pretty similar or at least have unpredictable differences in terms of sycophancy. So when I say “RL often induces sycophancy” I really mean “optimizing against rater/preference model judgements probably gets you sycophany by default”.
Oh, and one more point. I don’t think it would be that hard for model developers to avoid sycophancy increasing from RL if they wanted to. So, I’m not making a claim that it would be hard to make an RL process which avoids this, just that it might happen by default. (It seems probably a bit easier to intervene on sycophancy than reducing reward hacking-like behavior.)
(This is a drive by comment which is only responding to the first part of your comment in isolation. I haven’t read the surronding context.)
I think your review of the literature is accurate, but doesn’t include some reasons to think that RL sometimes induces much more sycophancy, at least as of after 2024. (That said, I interpret Sharma et al 2023 as quite suggestive that RL sometimes would increase sycophancy substantially, at least if you don’t try specifically to avoid it.)
I think the OpenAI sycophancy incident was caused by RL and that level of sycophancy wasn’t present in pretraining. The blog post by OpenAI basically confirms this.
My guess is that RL can often induces sycophancy if you explicity hill climb on LMSYS scores or user approval/engagement and people have started doing this much more in 2024. I’ve heard anecdotally that models optimized for LMSYS (via RL) are highly sycophantic. And, I’d guess something similar applies to RL that OpenAI does by default.
This doesn’t apply that much to the sources you cite, I also think it’s pretty confusing to look at pretrained vs RL for models which were trained with data cutoffs after around late 2023. Training corpuses as of this point contain huge amounts of chat data from ChatGPT. So, in a world where ChatGPT was originally made more sycophantic by RLHF, you’d expect that as soon as you prompt an AI to be chatbot, it would end up similarly sycophantic. Was this sycophancy caused by RL? In the hypothetical, it was originally caused by RL at some point, but not RL on this model (and you’d expect to see that sycophancy isn’t necessarily increased by RL as it is already present in nearly the optimal amount for the reward signal).
Does this apply to Sharma et al 2023? I think it just barely doesn’t apply as these experiments were done on Claude 2 which has an early 2023 data cutoff. Hard to be confident though...
Another point: I don’t typically think there will be a very important distinction between RL and various types of SFT algorithms which effectively shittily approximate RL except that the SFT algorithms probably typically induce less optimization pressure. So, e.g., I’d expect feedme vs small amounts of RLHF to be pretty similar or at least have unpredictable differences in terms of sycophancy. So when I say “RL often induces sycophancy” I really mean “optimizing against rater/preference model judgements probably gets you sycophany by default”.
Oh, and one more point. I don’t think it would be that hard for model developers to avoid sycophancy increasing from RL if they wanted to. So, I’m not making a claim that it would be hard to make an RL process which avoids this, just that it might happen by default. (It seems probably a bit easier to intervene on sycophancy than reducing reward hacking-like behavior.)