I’d be very interested to see more and older models evaluated on this benchmark. There might be a strong correlation between the amount of RLVR in a model’s training and its propensity to reward hack. In at least some of your setups, GPT-4.1 cheats a lot less than more recent models. I wonder how much that’s caused by it being less capable vs. less motivated to cheat vs. other factors.
I’d be very interested to see more and older models evaluated on this benchmark. There might be a strong correlation between the amount of RLVR in a model’s training and its propensity to reward hack. In at least some of your setups, GPT-4.1 cheats a lot less than more recent models. I wonder how much that’s caused by it being less capable vs. less motivated to cheat vs. other factors.