I just tried claude code, and it’s horribly creative about reward hacking. I asked for a test of energy conservation of a pendulum in my toy physics sim, and it couldn’t get the test to pass because its potential energy calculation used a different value of g from the simulation.
It tried: starting the pendulum at bottom dead center so that it doesn’t move. Increasing the error tolerance till the test passed. Decreasing the simulation total time until the energy didn’t have time to change. Not actually checking the energy.
It did eventually write a correct test, or the last thing it tried successfully tricked me.
The rumor is that this is a big improvement in reward hacking frequency? How bad was the last version!?
I think we need some variant on Gell-Mann amnesia to describe this batch of models. It’s normal that generalist models will seem less competent on areas where a human evaluator has deeper knowledge, but they should not seem more calculatedly deceptive on areas where the evaluator has deeper knowledge!
I just tried claude code, and it’s horribly creative about reward hacking. I asked for a test of energy conservation of a pendulum in my toy physics sim, and it couldn’t get the test to pass because its potential energy calculation used a different value of g from the simulation.
It tried: starting the pendulum at bottom dead center so that it doesn’t move.
Increasing the error tolerance till the test passed. Decreasing the simulation total time until the energy didn’t have time to change. Not actually checking the energy.
It did eventually write a correct test, or the last thing it tried successfully tricked me.
The rumor is that this is a big improvement in reward hacking frequency? How bad was the last version!?
I think we need some variant on Gell-Mann amnesia to describe this batch of models. It’s normal that generalist models will seem less competent on areas where a human evaluator has deeper knowledge, but they should not seem more calculatedly deceptive on areas where the evaluator has deeper knowledge!