Oliver Daniels comments on Training a Reward Hacker Despite Perfect Labels

Oliver Daniels 17 Aug 2025 16:44 UTC
2 points
1
You could have a view of RL that is totally agnostic about training dynamics and just reasons about policies conditioned on reward-maximization.

But the stories about deceptive-misalignment typically route through deceptive cognition being reinforced by the training process (i.e. you start with a NN doing random stuff, it explores into instrumental-(mis)alignment, and the training process reinforces the circuits that produced the instrumental-(mis)alignment).
…
I see shard-theory as making two interventions in the discourse:
1. emphasizing path-dependence in RL training (vs simplicity bias)
2. emphasizing messy heuristic behaviors (vs cleanly-factored goal directed agents)

I think both these interventions are important and useful, but I am sometimes frustrated by broader claims made about the novelty of shard-theory. I think these broad claims (perversely) obscure the most cruxy/interesting scientific questions raised by shard-theory, e.g. “how path-dependent is RL, actually” (see my other comment)
- David Johnston 19 Aug 2025 6:02 UTC
  1 point
  0
  Parent
  To be more specific, I think this kind is result is suggested by thinking about how policy gradient RL works (not goal misgeneralization), and you could say the good bits of shard theory are basically just explaining policy gradient RL to the safety community … but it needed explaining, so they deserve credit for doing it.
  - Oliver Daniels 19 Aug 2025 7:26 UTC
    2 points
    1
    Parent
    I think “explaining” vs “raising the saliency of” is an important distinction—I’m skeptical that the safety community needed policy gradient RL “explained”, but I do think the perspective “maybe we can shape goals / desires through careful sculpting of reward” was neglected.
    (e.g. I’m a big fan of Steve’s recent post on under-vs-over-sculpting)
- David Johnston 17 Aug 2025 22:23 UTC
  1 point
  −2
  Parent
  I share your general feelings about shard theory, but think you were being a bit too stingy with credit in this particular case.