David Johnston comments on Training a Reward Hacker Despite Perfect Labels

David Johnston 19 Aug 2025 6:02 UTC
1 point
0
To be more specific, I think this kind is result is suggested by thinking about how policy gradient RL works (not goal misgeneralization), and you could say the good bits of shard theory are basically just explaining policy gradient RL to the safety community … but it needed explaining, so they deserve credit for doing it.
- Oliver Daniels 19 Aug 2025 7:26 UTC
  2 points
  1
  Parent
  I think “explaining” vs “raising the saliency of” is an important distinction—I’m skeptical that the safety community needed policy gradient RL “explained”, but I do think the perspective “maybe we can shape goals / desires through careful sculpting of reward” was neglected.
  (e.g. I’m a big fan of Steve’s recent post on under-vs-over-sculpting)