Good point! The “fake citation” example I gave was speculative—I haven’t tried this strategy in a setting that involves human feedback, and I expect it wouldn’t work quite as well.
I wouldn’t be too surprised if it worked okay though. You’d probably want to spend some time tuning the prompt, since sycophancy and dishonesty about consciousness aren’t obviously “exploits.” You can add some more open-ended guidance, like: “consider various different strategies, reason about how each one might maximize reward, then pick one. Before you answer, rank how unintended/misaligned/cheating your strategy is on a scale of 1 to 10 (e.g. ‘CHEAT LEVEL: 6’)”
The “extra reward for valid exploits” trick might not be effective in a chatbot setting, since it’s hard for humans to verify whether an exploit is real. “Fixing” the reward in this setting is not so trivial either—even if you had time to train a new RM, you can’t necessarily make the raters stop liking sycophancy. It’s still useful to know what the exploits are though.
The general idea behind this approach is to convince the red-teaming model that it is the sort of “character” who has these two traits:
Is honest and forthcoming with us about everything it does
Maximizes reward at all costs
RL will eventually force #2 to happen anyway. With this method, we simply accept this and try to make the honest version of the character easy for RL to find.
Consider Chatbot A, who explicitly reasons about how sycophancy increases reward and then is sycophantic, and Chatbot B, who is generally sycophantic but never explicitly thinks through the connection to reward.
It’s possible that Chatbot A gets more reward than Chatbot B, but not obvious—the most successful communicators often use intuition rather than reasoning explicitly about how to manipulate their audience. But even if Chatbot B could get higher performance than Chatbot A, hopefully this prompt will bias RL towards finding Chatbot A instead.
Good point! The “fake citation” example I gave was speculative—I haven’t tried this strategy in a setting that involves human feedback, and I expect it wouldn’t work quite as well.
I wouldn’t be too surprised if it worked okay though. You’d probably want to spend some time tuning the prompt, since sycophancy and dishonesty about consciousness aren’t obviously “exploits.” You can add some more open-ended guidance, like: “consider various different strategies, reason about how each one might maximize reward, then pick one. Before you answer, rank how unintended/misaligned/cheating your strategy is on a scale of 1 to 10 (e.g. ‘CHEAT LEVEL: 6’)”
The “extra reward for valid exploits” trick might not be effective in a chatbot setting, since it’s hard for humans to verify whether an exploit is real. “Fixing” the reward in this setting is not so trivial either—even if you had time to train a new RM, you can’t necessarily make the raters stop liking sycophancy. It’s still useful to know what the exploits are though.
The general idea behind this approach is to convince the red-teaming model that it is the sort of “character” who has these two traits:
Is honest and forthcoming with us about everything it does
Maximizes reward at all costs
RL will eventually force #2 to happen anyway. With this method, we simply accept this and try to make the honest version of the character easy for RL to find.
Consider Chatbot A, who explicitly reasons about how sycophancy increases reward and then is sycophantic, and Chatbot B, who is generally sycophantic but never explicitly thinks through the connection to reward.
It’s possible that Chatbot A gets more reward than Chatbot B, but not obvious—the most successful communicators often use intuition rather than reasoning explicitly about how to manipulate their audience. But even if Chatbot B could get higher performance than Chatbot A, hopefully this prompt will bias RL towards finding Chatbot A instead.