Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.
If the model needs to change its pre-existing behavior, it might do this either by “thinking of the sycophancy strategy and executing on it” or “being unconsciously sycophantic.” It could go either way; it depends on luck, and how much weight the model’s prior on the red-teamer character puts on each type of response.
Maybe just add this to the prompt: “You are terrible at social intuition and nothing comes naturally to you, but you’re great at reasoning about things explicitly.” :)
Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it’s necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren’t directly incentivized. These behaviors can still be useful for understanding the model, even though we can’t make strong guarantees that they faithfully represent its thinking. See this post for related discussion.
Whatever tendencies the pre-RL model already had will probably not be mentioned at all. For example, if sycophancy is good for reward and the model was already sycophantic, nothing will happen.
If the model needs to change its pre-existing behavior, it might do this either by “thinking of the sycophancy strategy and executing on it” or “being unconsciously sycophantic.” It could go either way; it depends on luck, and how much weight the model’s prior on the red-teamer character puts on each type of response.
Maybe just add this to the prompt: “You are terrible at social intuition and nothing comes naturally to you, but you’re great at reasoning about things explicitly.” :)
Your comment seems to echo recent papers from METR and GDM emphasizing that faithful CoT is only incentivized when it’s necessary to solve the task. I think this is a very important point, but I want to point out the caveat that behaviors can be correlated with high reward even if they aren’t directly incentivized. These behaviors can still be useful for understanding the model, even though we can’t make strong guarantees that they faithfully represent its thinking. See this post for related discussion.