It looks to me like GPT-4o has a bias against the word, but it will say it when pressed. My chat. I also find this reproduces as described in response to exactly the input “what are the most well-known sorts of reward hacking in LLMs”. I got “Synergistic Hacking with Human Feedback”.
It looks to me like GPT-4o has a bias against the word, but it will say it when pressed. My chat. I also find this reproduces as described in response to exactly the input “what are the most well-known sorts of reward hacking in LLMs”. I got “Synergistic Hacking with Human Feedback”.