I suspect that this method will only work well on tasks where the model needs to reason explicitly in order to cheat. So, e.g., if the model needs to reason out some trait of the user in order to flatter them the prompt will likely kick in and get it to self-report its cheating, but if the model can learn to flatter the user without on-the-fly without reasoning the prompt probably won’t do anything. By analogy, if I instruct a human to tell me whenever they use hand gestures to communicate something, they will have difficulty because their hand gestures are automatic and not normally promoted to conscious attention.
Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.
If the model needs to change its pre-existing behavior, it might do this either by “thinking of the sycophancy strategy and executing on it” or “being unconsciously sycophantic.” It could go either way; it depends on luck, and how much weight the model’s prior on the red-teamer character puts on each type of response.
Maybe just add this to the prompt: “You are terrible at social intuition and nothing comes naturally to you, but you’re great at reasoning about things explicitly.” :)
Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it’s necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren’t directly incentivized. These behaviors can still be useful for understanding the model, even though we can’t make strong guarantees that they faithfully represent its thinking. See this post for related discussion.
I suspect that this method will only work well on tasks where the model needs to reason explicitly in order to cheat. So, e.g., if the model needs to reason out some trait of the user in order to flatter them the prompt will likely kick in and get it to self-report its cheating, but if the model can learn to flatter the user without on-the-fly without reasoning the prompt probably won’t do anything. By analogy, if I instruct a human to tell me whenever they use hand gestures to communicate something, they will have difficulty because their hand gestures are automatic and not normally promoted to conscious attention.
Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.
If the model needs to change its pre-existing behavior, it might do this either by “thinking of the sycophancy strategy and executing on it” or “being unconsciously sycophantic.” It could go either way; it depends on luck, and how much weight the model’s prior on the red-teamer character puts on each type of response.
Maybe just add this to the prompt: “You are terrible at social intuition and nothing comes naturally to you, but you’re great at reasoning about things explicitly.” :)
Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it’s necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren’t directly incentivized. These behaviors can still be useful for understanding the model, even though we can’t make strong guarantees that they faithfully represent its thinking. See this post for related discussion.