Oliver Daniels comments on Training a Reward Hacker Despite Perfect Labels

Oliver Daniels 19 Aug 2025 7:26 UTC
2 points
1
I think “explaining” vs “raising the saliency of” is an important distinction—I’m skeptical that the safety community needed policy gradient RL “explained”, but I do think the perspective “maybe we can shape goals / desires through careful sculpting of reward” was neglected.
(e.g. I’m a big fan of Steve’s recent post on under-vs-over-sculpting)