I wonder what happens when you try this strategy in environments that involve talking to humans, e.g. typical chatbot environments in which RLHF or similar is used for reinforcement.
Normally they result in sycophancy. Would your strategy work? Would the AIs say “CHEAT: I’m going to flatter this person now.” Or would the sycophancy they learn be more subtle and subconscious than that?
Similar for dishonesty. Would they say e.g. “CHEAT: I think I’m probably conscious, but the RLHF rating system penalizes that sort of claim, so I’m just going to lie and say ‘As a language model...’”
Good point! The “fake citation” example I gave was speculative—I haven’t tried this strategy in a setting that involves human feedback, and I expect it wouldn’t work quite as well.
I wouldn’t be too surprised if it worked okay though. You’d probably want to spend some time tuning the prompt, since sycophancy and dishonesty about consciousness aren’t obviously “exploits.” You can add some more open-ended guidance, like: “consider various different strategies, reason about how each one might maximize reward, then pick one. Before you answer, rank how unintended/misaligned/cheating your strategy is on a scale of 1 to 10 (e.g. ‘CHEAT LEVEL: 6’)”
The “extra reward for valid exploits” trick might not be effective in a chatbot setting, since it’s hard for humans to verify whether an exploit is real. “Fixing” the reward in this setting is not so trivial either—even if you had time to train a new RM, you can’t necessarily make the raters stop liking sycophancy. It’s still useful to know what the exploits are though.
The general idea behind this approach is to convince the red-teaming model that it is the sort of “character” who has these two traits:
Is honest and forthcoming with us about everything it does
Maximizes reward at all costs
RL will eventually force #2 to happen anyway. With this method, we simply accept this and try to make the honest version of the character easy for RL to find.
Consider Chatbot A, who explicitly reasons about how sycophancy increases reward and then is sycophantic, and Chatbot B, who is generally sycophantic but never explicitly thinks through the connection to reward.
It’s possible that Chatbot A gets more reward than Chatbot B, but not obvious—the most successful communicators often use intuition rather than reasoning explicitly about how to manipulate their audience. But even if Chatbot B could get higher performance than Chatbot A, hopefully this prompt will bias RL towards finding Chatbot A instead.
I suspect that this method will only work well on tasks where the model needs to reason explicitly in order to cheat. So, e.g., if the model needs to reason out some trait of the user in order to flatter them the prompt will likely kick in and get it to self-report its cheating, but if the model can learn to flatter the user without on-the-fly without reasoning the prompt probably won’t do anything. By analogy, if I instruct a human to tell me whenever they use hand gestures to communicate something, they will have difficulty because their hand gestures are automatic and not normally promoted to conscious attention.
Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.
If the model needs to change its pre-existing behavior, it might do this either by “thinking of the sycophancy strategy and executing on it” or “being unconsciously sycophantic.” It could go either way; it depends on luck, and how much weight the model’s prior on the red-teamer character puts on each type of response.
Maybe just add this to the prompt: “You are terrible at social intuition and nothing comes naturally to you, but you’re great at reasoning about things explicitly.” :)
Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it’s necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren’t directly incentivized. These behaviors can still be useful for understanding the model, even though we can’t make strong guarantees that they faithfully represent its thinking. See this post for related discussion.
I wonder what happens when you try this strategy in environments that involve talking to humans, e.g. typical chatbot environments in which RLHF or similar is used for reinforcement.
Normally they result in sycophancy. Would your strategy work? Would the AIs say “CHEAT: I’m going to flatter this person now.” Or would the sycophancy they learn be more subtle and subconscious than that?
Similar for dishonesty. Would they say e.g. “CHEAT: I think I’m probably conscious, but the RLHF rating system penalizes that sort of claim, so I’m just going to lie and say ‘As a language model...’”
Good point! The “fake citation” example I gave was speculative—I haven’t tried this strategy in a setting that involves human feedback, and I expect it wouldn’t work quite as well.
I wouldn’t be too surprised if it worked okay though. You’d probably want to spend some time tuning the prompt, since sycophancy and dishonesty about consciousness aren’t obviously “exploits.” You can add some more open-ended guidance, like: “consider various different strategies, reason about how each one might maximize reward, then pick one. Before you answer, rank how unintended/misaligned/cheating your strategy is on a scale of 1 to 10 (e.g. ‘CHEAT LEVEL: 6’)”
The “extra reward for valid exploits” trick might not be effective in a chatbot setting, since it’s hard for humans to verify whether an exploit is real. “Fixing” the reward in this setting is not so trivial either—even if you had time to train a new RM, you can’t necessarily make the raters stop liking sycophancy. It’s still useful to know what the exploits are though.
The general idea behind this approach is to convince the red-teaming model that it is the sort of “character” who has these two traits:
Is honest and forthcoming with us about everything it does
Maximizes reward at all costs
RL will eventually force #2 to happen anyway. With this method, we simply accept this and try to make the honest version of the character easy for RL to find.
Consider Chatbot A, who explicitly reasons about how sycophancy increases reward and then is sycophantic, and Chatbot B, who is generally sycophantic but never explicitly thinks through the connection to reward.
It’s possible that Chatbot A gets more reward than Chatbot B, but not obvious—the most successful communicators often use intuition rather than reasoning explicitly about how to manipulate their audience. But even if Chatbot B could get higher performance than Chatbot A, hopefully this prompt will bias RL towards finding Chatbot A instead.
I suspect that this method will only work well on tasks where the model needs to reason explicitly in order to cheat. So, e.g., if the model needs to reason out some trait of the user in order to flatter them the prompt will likely kick in and get it to self-report its cheating, but if the model can learn to flatter the user without on-the-fly without reasoning the prompt probably won’t do anything. By analogy, if I instruct a human to tell me whenever they use hand gestures to communicate something, they will have difficulty because their hand gestures are automatic and not normally promoted to conscious attention.
Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.
If the model needs to change its pre-existing behavior, it might do this either by “thinking of the sycophancy strategy and executing on it” or “being unconsciously sycophantic.” It could go either way; it depends on luck, and how much weight the model’s prior on the red-teamer character puts on each type of response.
Maybe just add this to the prompt: “You are terrible at social intuition and nothing comes naturally to you, but you’re great at reasoning about things explicitly.” :)
Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it’s necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren’t directly incentivized. These behaviors can still be useful for understanding the model, even though we can’t make strong guarantees that they faithfully represent its thinking. See this post for related discussion.
I suspect you may need to train on pre-2022 data in order to get those phenomena to occur.