No, the agents were not trying to get high reward as far as I know.
They were just trying to do the task and thought they were being clever by gaming it. I think this convinced me “these agency tasks in training will be gameable” more than “AI agents will reward hack in a situationally aware way.” I don’t think we have great evidence that the latter happens in practice yet, aside from some experiments in toy settings like “sycophancy to subterfugure.”
I think it would be unsurprising if AI agents did explicitly optimize for reward at some capability level.
Deliberate reward maximization should generalize better than normal specification gaming in principle -- and AI agents know plenty about how ML training works.
No, the agents were not trying to get high reward as far as I know.
They were just trying to do the task and thought they were being clever by gaming it. I think this convinced me “these agency tasks in training will be gameable” more than “AI agents will reward hack in a situationally aware way.” I don’t think we have great evidence that the latter happens in practice yet, aside from some experiments in toy settings like “sycophancy to subterfugure.”
I think it would be unsurprising if AI agents did explicitly optimize for reward at some capability level.
Deliberate reward maximization should generalize better than normal specification gaming in principle -- and AI agents know plenty about how ML training works.