Oliver Daniels comments on Training a Reward Hacker Despite Perfect Labels

Oliver Daniels 15 Aug 2025 4:03 UTC
49 points
38
As best I can tell, before “Reward is not the optimization target”, people mostly thought of RL as a sieve, or even a carrot and stick—try to “give reward” so the AI can only maximize reward via good behavior. Few^[2] other people speculated that RL generalization is controlled by why the policy took an action. So I give myself and @Quintin Pope^[3] a bunch of points.
I’m confused by this claim—goal mis-generalization/deceptive-misalignment/scheming is centrally about generalization failures due to policies ~~maximizing~~ receiving high reward for the wrong reasons, and these threat models were widely discussed prior to 2022.

(I do agree that shard-theory made “think rigorously about reward-shaping” more salient and exciting)
- sunwillrise 18 Aug 2025 0:08 UTC
  1 point
  −1
  Parent
  I’m confused by this claim—goal mis-generalization/deceptive-misalignment/scheming is centrally about generalization failures due to policies maximizing reward for the wrong reasons, and these threat models were widely discussed prior to 2022.
  “Maximizing” reward? I don’t expect a supposedly deceptively misaligned RL policy to necessarily grab control of the reward button and press it over and over and over again to maximize reward.^[1] It can have high reward, but I don’t expect it to maximize the reward. Words have meanings, let’s actually try to be precise when talking about this.^[2]
  And this is one of the points of Reward is not the optimization target: stop conflating “reward” as the chisel that changes a policy’s cognition and “reward” as what the model actually tries to accomplish. Through the process of training, you are selecting for a model that scores well according to some metric. But that doesn’t make the policy intentionally attempt to make that metric go higher and higher and higher (in the case of model-free RL).
  The same way human genes were selected for something-approximating-IGF, but humans don’t intentionally try to have as many children as possible (instead, we execute adaptations that at least in some environments result in us having a large, but not maximal, number of children).
  1. ^
    Which is what the word “maximize” means in this context: to make maximal, or at least to try to make maximal
  2. ^
    I have already observed before how loosey-goosey reasoning and argumentation can lead to faulty conclusions about important topics here
  - Oliver Daniels 18 Aug 2025 1:00 UTC
    1 point
    0
    Parent
    yup, my bad, editing to “receiving high reward”
- David Johnston 17 Aug 2025 7:51 UTC
  1 point
  −1
  Parent
  This seems different to “maximising rewards for the wrong reasons”. That view generally sees the reward maximised because it is instrumental for or aliased with the wrong goal. Here it’s just a separate behaviour that is totally unhelpful for maximising rewards but is learned as a reflex anyway.
  - Oliver Daniels 17 Aug 2025 16:44 UTC
    2 points
    1
    Parent
    You could have a view of RL that is totally agnostic about training dynamics and just reasons about policies conditioned on reward-maximization.
    
    But the stories about deceptive-misalignment typically route through deceptive cognition being reinforced by the training process (i.e. you start with a NN doing random stuff, it explores into instrumental-(mis)alignment, and the training process reinforces the circuits that produced the instrumental-(mis)alignment).
    …
    I see shard-theory as making two interventions in the discourse:
    1. emphasizing path-dependence in RL training (vs simplicity bias)
    2. emphasizing messy heuristic behaviors (vs cleanly-factored goal directed agents)
    
    I think both these interventions are important and useful, but I am sometimes frustrated by broader claims made about the novelty of shard-theory. I think these broad claims (perversely) obscure the most cruxy/interesting scientific questions raised by shard-theory, e.g. “how path-dependent is RL, actually” (see my other comment)
    - David Johnston 19 Aug 2025 6:02 UTC
      1 point
      0
      Parent
      To be more specific, I think this kind is result is suggested by thinking about how policy gradient RL works (not goal misgeneralization), and you could say the good bits of shard theory are basically just explaining policy gradient RL to the safety community … but it needed explaining, so they deserve credit for doing it.
      - Oliver Daniels 19 Aug 2025 7:26 UTC
        2 points
        1
        Parent
        I think “explaining” vs “raising the saliency of” is an important distinction—I’m skeptical that the safety community needed policy gradient RL “explained”, but I do think the perspective “maybe we can shape goals / desires through careful sculpting of reward” was neglected.
        (e.g. I’m a big fan of Steve’s recent post on under-vs-over-sculpting)
    - David Johnston 17 Aug 2025 22:23 UTC
      1 point
      −2
      Parent
      I share your general feelings about shard theory, but think you were being a bit too stingy with credit in this particular case.