Thanks for sharing. I read the entire model card instead of starting with the reward hacking section. The parts I personally found most interesting were sections 4.1.1.5 and 7.3.4. Why isn’t the underperformance of Claude Opus 4 on the AI research evaluation suite considered as evidence of potential sandbagging?
See also some notes on reward hacking on twitter and in the model card.
Thanks for sharing. I read the entire model card instead of starting with the reward hacking section. The parts I personally found most interesting were sections 4.1.1.5 and 7.3.4. Why isn’t the underperformance of Claude Opus 4 on the AI research evaluation suite considered as evidence of potential sandbagging?
Links: 4.1.1.5 and 7.3.4.