A note on interpretation and comparison to other evals: Our scenarios range from simple single-turn queries (Scenario 4 replicating Palisade’s setup) to complex multi-step agentic interactions (Scenario 1). Results are roughly consistent across this range: o3 exhibits the most reward hacking, models show high sensitivity to prompt variations, and Claude models hack infrequently in these settings. These findings align directionally with results from METR and Palisade evaluations.
Claude 3.7′s limited reward hacking in our eval—as well as METR’s and Palisade’s—contrasts with reports of more frequent hacking in code-heavy scenarios ( as seen in several examples in social media, as well as private communication with Anthropic researchers). The gap likely reflects differences in task type or specific vulnerabilities that emerge in coding contexts. Further investigation of these differences would be valuable for developing more comprehensive evaluation frameworks.
A note on interpretation and comparison to other evals:
Our scenarios range from simple single-turn queries (Scenario 4 replicating Palisade’s setup) to complex multi-step agentic interactions (Scenario 1). Results are roughly consistent across this range: o3 exhibits the most reward hacking, models show high sensitivity to prompt variations, and Claude models hack infrequently in these settings. These findings align directionally with results from METR and Palisade evaluations.
Claude 3.7′s limited reward hacking in our eval—as well as METR’s and Palisade’s—contrasts with reports of more frequent hacking in code-heavy scenarios ( as seen in several examples in social media, as well as private communication with Anthropic researchers). The gap likely reflects differences in task type or specific vulnerabilities that emerge in coding contexts. Further investigation of these differences would be valuable for developing more comprehensive evaluation frameworks.