joshc comments on How AI Takeover Might Happen in 2 Years

joshc 19 Feb 2025 20:18 UTC
3 points
0
No, the agents were not trying to get high reward as far as I know.

They were just trying to do the task and thought they were being clever by gaming it. I think this convinced me “these agency tasks in training will be gameable” more than “AI agents will reward hack in a situationally aware way.” I don’t think we have great evidence that the latter happens in practice yet, aside from some experiments in toy settings like “sycophancy to subterfugure.”

I think it would be unsurprising if AI agents did explicitly optimize for reward at some capability level.

Deliberate reward maximization should generalize better than normal specification gaming in principle -- and AI agents know plenty about how ML training works.