sunwillrise comments on Training a Reward Hacker Despite Perfect Labels

sunwillrise 18 Aug 2025 0:08 UTC
1 point
−1
I’m confused by this claim—goal mis-generalization/deceptive-misalignment/scheming is centrally about generalization failures due to policies maximizing reward for the wrong reasons, and these threat models were widely discussed prior to 2022.
“Maximizing” reward? I don’t expect a supposedly deceptively misaligned RL policy to necessarily grab control of the reward button and press it over and over and over again to maximize reward.^[1] It can have high reward, but I don’t expect it to maximize the reward. Words have meanings, let’s actually try to be precise when talking about this.^[2]
And this is one of the points of Reward is not the optimization target: stop conflating “reward” as the chisel that changes a policy’s cognition and “reward” as what the model actually tries to accomplish. Through the process of training, you are selecting for a model that scores well according to some metric. But that doesn’t make the policy intentionally attempt to make that metric go higher and higher and higher (in the case of model-free RL).
The same way human genes were selected for something-approximating-IGF, but humans don’t intentionally try to have as many children as possible (instead, we execute adaptations that at least in some environments result in us having a large, but not maximal, number of children).
1. ^
  Which is what the word “maximize” means in this context: to make maximal, or at least to try to make maximal
2. ^
  I have already observed before how loosey-goosey reasoning and argumentation can lead to faulty conclusions about important topics here
- Oliver Daniels 18 Aug 2025 1:00 UTC
  1 point
  0
  Parent
  yup, my bad, editing to “receiving high reward”