aog comments on ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents

aog 30 Oct 2025 17:52 UTC
8 points
0
I’d be very interested to see more and older models evaluated on this benchmark. There might be a strong correlation between the amount of RLVR in a model’s training and its propensity to reward hack. In at least some of your setups, GPT-4.1 cheats a lot less than more recent models. I wonder how much that’s caused by it being less capable vs. less motivated to cheat vs. other factors.