In my time at METR, I saw agents game broken scoring functions. I’ve also heard that people at AI companies have seen this happen in practice too, and it’s been kind of a pain for them.
Are you implicitly or explicitly claiming that reward was the optimization target for agents tested at METR, and that AI companies have experienced issues with reward being the optimization target for AIs?
No, the agents were not trying to get high reward as far as I know.
They were just trying to do the task and thought they were being clever by gaming it. I think this convinced me “these agency tasks in training will be gameable” more than “AI agents will reward hack in a situationally aware way.” I don’t think we have great evidence that the latter happens in practice yet, aside from some experiments in toy settings like “sycophancy to subterfugure.”
I think it would be unsurprising if AI agents did explicitly optimize for reward at some capability level.
Deliberate reward maximization should generalize better than normal specification gaming in principle -- and AI agents know plenty about how ML training works.
Are you implicitly or explicitly claiming that reward was the optimization target for agents tested at METR, and that AI companies have experienced issues with reward being the optimization target for AIs?
No, the agents were not trying to get high reward as far as I know.
They were just trying to do the task and thought they were being clever by gaming it. I think this convinced me “these agency tasks in training will be gameable” more than “AI agents will reward hack in a situationally aware way.” I don’t think we have great evidence that the latter happens in practice yet, aside from some experiments in toy settings like “sycophancy to subterfugure.”
I think it would be unsurprising if AI agents did explicitly optimize for reward at some capability level.
Deliberate reward maximization should generalize better than normal specification gaming in principle -- and AI agents know plenty about how ML training works.