I totally did not remember that Perez et al 2022 checked its metrics as a function of RLHF steps, nor did I do any literature search to find the other papers, which I haven’t read before. I did think it was very likely people had already done experiments like this and didn’t worry about phrasing. Mea culpa all around.
It’s definitely very interesting that Google and Anthropic’s larger LLMs come out of the box scoring high on the Perez et al 2022 sycophancy metric, and yet OpenAI’s don’t. And also that 1000 steps of RLHF changes that metric by <5%, even when the preference model locally incentivizes change.
(Or ~10% for the metrics in Sharma et al 2023, although they’re on a different scale [no sycophancy is at 0% rather than ~50%], and a 10% change could also be described as a 1.5ing of their feedback sycophancy metric from 20% to 30%.)
So I’d summarize the resources you link as saying that most base models are sycophantic (it’s complicated), and post-training increases some kinds of sycophancy in some models a significant amount but has a small or negative effect on other kinds or other models (it’s complicated).
So has my “prediction been falsified?” Yes, yes, and it’s complicated.
First, I literally wrote “the cause of sycophancy is RL,” like someone who doesn’t know that things can have many causes. That is of course literally false.
Even a fairly normal Gricean reading (“RL is a clear most important cause for us to talk about in general”) turns out to be false. I was wrong because I thought base models were significantly less sycophantic than (most) apparently are.
Last, why did I bring up sycophancy in a comment on your essay at all? Why did I set up a dichotomy of “RL” vs. “text about AI in the training data”, both for sycophancy and for cheating on programming tasks? Why didn’t I mention probably much stronger sources of sycophancy in the training data, like the pattern that human text tends to flatter the audience?
To be extremely leading: Why did I compare misaligned RL to training-text about AI as causes of AI misbehavior, in a comment on an essay that warns us about AI misbehavior caused by training-text about AI?
A background claim: The same post-training that sculpts this Claude persona from the base model introduces obvious-to-us flaws like cheating at tests at the same time as it’s carving in the programming skill. God forbid anyone talk about future AI like it’ll be a problem, but the RL is misaligned and putting a lower-loss base model into it does not mean you get out a smarter Claude who’s just as nice a guy, and whose foibles are just as easy to correct for.
So the second “But RL” was a “But we do not get to keep the nice relationship with Claude that we currently have, because the RL is misaligned, in a way that I am trying to claim outstrips the influence of (good or ill) training text about AI.”
If you want to know where I’m coming from re: RL, it may be helpful to know that I find this post pretty illuminating/”deconfusing.”
Yes, this ability to perspective shift seems useful. Self-supervised learning can be a sort of reinforcement learning, and REINFORCE can be a sort of reward-weighted self-supervised learning (Oh, that’s a different trick than the one in linked post).
Anyhow, I’m all for putting different sorts of training on equal footing esp. when trying to understand inhomogeneously-trained AI or when comparing differently-trained AIs.
For the first section (er, which was a later section of your reply) about agenty vs. predictory mental processes, if you can get the same end effect by RL or SFT or filtered unlabeled data, that’s fine, “RL” is just a stand-in or scapegoat. Picking on RL here is sort of like using the intentional stance—it prompts you to use the language of goals, planning, etc, and gives you a mental framework to fit those things in.
This is a bit different than the concerns about misaligned RL a few paragraphs ago, which had more expectations for how the AI relates to the environment. The mental model used there is for thoughts like “the AI gets feedback on the effects of actions taken in the real world.” Of course you could generate data that causes the same update to the AI without that relationship, but you generally don’t, because the real world is complicated and sometimes it’s more convenient to interact with it than to simulate it or sample from it.
But I don’t think the worrying thing here is really “RL” (after all, RLHF was already RL) but rather the introduction of a new training stage that’s narrowly focused on satisfying verifiers rather than humans (when in a context that resembles the data distribution used in that stage), which predictably degrades the coherence (and overall-level-of-virtue) of the assistant character.
Whoops, now we’re back to cheating on tasks for a second. RLHF is also worrying! It’s doing the interact with the real world thing, and its structure takes humans (and human flaws) too much at face value. It’s just that it’s really easy to get away with bad alignment when the AI is dumber than you.
>> If a LLM similarly doesn’t do much information-gathering about the intent/telos of the text from the “assistant” character, and instead does an amplified amount of pre-computing useful information and then attending to it later when going through the assistant text, this paints a quite different picture to me than your “void.”
I don’t understand the distinction you’re drawing here? Any form of assistant training (or indeed any training at all) will incentivize something like “storing useful information (learned from the training data/signal) in the weights and making it available for use in contexts on which it is useful.”
I’m guessing that when a LLM knows the story is going to end with a wedding party, it can fetch relevant information more aggressively (and ignore irrelevant information more aggressively) than when it doesn’t. I don’t know if the actual wedding party attractor did this kind of optimization, maybe it wouldn’t have had the post-train time to learn it.
Like, if you’re a base model and you see a puzzle, you kind of have to automatically start solving it in case someone asks for the solution on the next page, even if you’re not great at solving puzzles. But if you control the story, you can just never ask for the solution, which means you don’t have to start solving it in the first place, and you can use that space for something else, like planning complicated wedding parties, or reducing your L2 penalty.
If you can measure how much an LLM is automatically solving puzzles (particularly ones it’s still bad at), you have a metric for how much it’s thinking like it controls the text vs. purely predicts the text. Sorry, another experiment that maybe has already been done (this one I’m guessing only 30% chance) that I’m not going to search for.
Anyhow, it’s been a few hours, please respond to me less thoroughly by some factor so that things can converge.
Thank you for the excellent most of this reply.
I totally did not remember that Perez et al 2022 checked its metrics as a function of RLHF steps, nor did I do any literature search to find the other papers, which I haven’t read before. I did think it was very likely people had already done experiments like this and didn’t worry about phrasing. Mea culpa all around.
It’s definitely very interesting that Google and Anthropic’s larger LLMs come out of the box scoring high on the Perez et al 2022 sycophancy metric, and yet OpenAI’s don’t. And also that 1000 steps of RLHF changes that metric by <5%, even when the preference model locally incentivizes change.
(Or ~10% for the metrics in Sharma et al 2023, although they’re on a different scale [no sycophancy is at 0% rather than ~50%], and a 10% change could also be described as a 1.5ing of their feedback sycophancy metric from 20% to 30%.)
So I’d summarize the resources you link as saying that most base models are sycophantic (it’s complicated), and post-training increases some kinds of sycophancy in some models a significant amount but has a small or negative effect on other kinds or other models (it’s complicated).
So has my “prediction been falsified?” Yes, yes, and it’s complicated.
First, I literally wrote “the cause of sycophancy is RL,” like someone who doesn’t know that things can have many causes. That is of course literally false.
Even a fairly normal Gricean reading (“RL is a clear most important cause for us to talk about in general”) turns out to be false. I was wrong because I thought base models were significantly less sycophantic than (most) apparently are.
Last, why did I bring up sycophancy in a comment on your essay at all? Why did I set up a dichotomy of “RL” vs. “text about AI in the training data”, both for sycophancy and for cheating on programming tasks? Why didn’t I mention probably much stronger sources of sycophancy in the training data, like the pattern that human text tends to flatter the audience?
To be extremely leading: Why did I compare misaligned RL to training-text about AI as causes of AI misbehavior, in a comment on an essay that warns us about AI misbehavior caused by training-text about AI?
A background claim: The same post-training that sculpts this Claude persona from the base model introduces obvious-to-us flaws like cheating at tests at the same time as it’s carving in the programming skill. God forbid anyone talk about future AI like it’ll be a problem, but the RL is misaligned and putting a lower-loss base model into it does not mean you get out a smarter Claude who’s just as nice a guy, and whose foibles are just as easy to correct for.
So the second “But RL” was a “But we do not get to keep the nice relationship with Claude that we currently have, because the RL is misaligned, in a way that I am trying to claim outstrips the influence of (good or ill) training text about AI.”
Yes, this ability to perspective shift seems useful. Self-supervised learning can be a sort of reinforcement learning, and REINFORCE can be a sort of reward-weighted self-supervised learning (Oh, that’s a different trick than the one in linked post).
Anyhow, I’m all for putting different sorts of training on equal footing esp. when trying to understand inhomogeneously-trained AI or when comparing differently-trained AIs.
For the first section (er, which was a later section of your reply) about agenty vs. predictory mental processes, if you can get the same end effect by RL or SFT or filtered unlabeled data, that’s fine, “RL” is just a stand-in or scapegoat. Picking on RL here is sort of like using the intentional stance—it prompts you to use the language of goals, planning, etc, and gives you a mental framework to fit those things in.
This is a bit different than the concerns about misaligned RL a few paragraphs ago, which had more expectations for how the AI relates to the environment. The mental model used there is for thoughts like “the AI gets feedback on the effects of actions taken in the real world.” Of course you could generate data that causes the same update to the AI without that relationship, but you generally don’t, because the real world is complicated and sometimes it’s more convenient to interact with it than to simulate it or sample from it.
Whoops, now we’re back to cheating on tasks for a second. RLHF is also worrying! It’s doing the interact with the real world thing, and its structure takes humans (and human flaws) too much at face value. It’s just that it’s really easy to get away with bad alignment when the AI is dumber than you.
I’m guessing that when a LLM knows the story is going to end with a wedding party, it can fetch relevant information more aggressively (and ignore irrelevant information more aggressively) than when it doesn’t. I don’t know if the actual wedding party attractor did this kind of optimization, maybe it wouldn’t have had the post-train time to learn it.
Like, if you’re a base model and you see a puzzle, you kind of have to automatically start solving it in case someone asks for the solution on the next page, even if you’re not great at solving puzzles. But if you control the story, you can just never ask for the solution, which means you don’t have to start solving it in the first place, and you can use that space for something else, like planning complicated wedding parties, or reducing your L2 penalty.
If you can measure how much an LLM is automatically solving puzzles (particularly ones it’s still bad at), you have a metric for how much it’s thinking like it controls the text vs. purely predicts the text. Sorry, another experiment that maybe has already been done (this one I’m guessing only 30% chance) that I’m not going to search for.
Anyhow, it’s been a few hours, please respond to me less thoroughly by some factor so that things can converge.