ryan_greenblatt comments on Caleb Biddulph’s Shortform

ryan_greenblatt 30 May 2025 18:26 UTC
4 points
3
It looks to me like GPT-4o has a bias against the word, but it will say it when pressed. My chat. I also find this reproduces as described in response to exactly the input “what are the most well-known sorts of reward hacking in LLMs”. I got “Synergistic Hacking with Human Feedback”.