AndresCampero

Karma: 24

AndresCampero 10 Jun 2025 1:00 UTC
1 point
0
on: Quickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations
A note on interpretation and comparison to other evals:
Our scenarios range from simple single-turn queries (Scenario 4 replicating Palisade’s setup) to complex multi-step agentic interactions (Scenario 1). Results are roughly consistent across this range: o3 exhibits the most reward hacking, models show high sensitivity to prompt variations, and Claude models hack infrequently in these settings. These findings align directionally with results from METR and Palisade evaluations.
Claude 3.7′s limited reward hacking in our eval—as well as METR’s and Palisade’s—contrasts with reports of more frequent hacking in code-heavy scenarios ( as seen in several examples in social media, as well as private communication with Anthropic researchers). The gap likely reflects differences in task type or specific vulnerabilities that emerge in coding contexts. Further investigation of these differences would be valuable for developing more comprehensive evaluation frameworks.

Quickly Assessing Reward Hacking-like Behavior in LLMs and its Sensitivity to Prompt Variations

AndresCampero4 Jun 2025 7:22 UTC

26 points

1 comment17 min readLW link