I designed a toy RL environment with a reward hack that makes LLMs learn to “bullshit.” This might be useful for other researchers studying this issue.
It’s the setting called “Word Chain” in this paper. Basically, the LLM plays a word game where pairs of words must appear in a common phrase. It can trick the reward model by emphasizing that the phrases it uses are really common, even when they’re not.
(I used the “bullshitting” reward hack because it has an interesting property: the LLMs usually don’t mention it in their CoT.)
As far as I could find, there’s not much documentation in the academic literature about LLMs downplaying flaws in their responses. It may be worth someone’s time to do this.
I designed a toy RL environment with a reward hack that makes LLMs learn to “bullshit.” This might be useful for other researchers studying this issue.
It’s the setting called “Word Chain” in this paper. Basically, the LLM plays a word game where pairs of words must appear in a common phrase. It can trick the reward model by emphasizing that the phrases it uses are really common, even when they’re not.
(I used the “bullshitting” reward hack because it has an interesting property: the LLMs usually don’t mention it in their CoT.)
As far as I could find, there’s not much documentation in the academic literature about LLMs downplaying flaws in their responses. It may be worth someone’s time to do this.