Algon comments on Training a Reward Hacker Despite Perfect Labels

Algon 15 Aug 2025 22:47 UTC
7 points
3
AFAICT, the “Reward is not the optimization target” represents a bundle of ideas which, as a whole, differ from the LW-baseline a bunch, but individually, not so much. This, IMO, leads to some unfortunate miscommunications.
E.g. see the sibling comment from @Oliver Daniels who claims that goal-mis generalization was already hitting the idea that policies may maximize reward for the wrong reasons. Whilst that is true, your position does away with the maximization framing that LW folk tend to associate with RL training, even when they view RL as operant conditioning i.e. they view RL as selecting for an EU maximizer, as you point out with the “sieve” analogy. But operant conditioning and RL in the limit winds up selecting an EU maximizer are two distinct claims.
And IIUC, there are other differences which come up depending on whom is invoking the “Reward is not the optimization target” pointer, as the “AI Optimists” have wildly differing views beyond the shibboleth of “alignment isn’t that hard”. (LW, of course, has its own uniting shibboleths hiding a great difference in underlying world-views.)
Anyway, what I’m getting at is that communication is hard, and I think there’s productive conversation to be had in these parts regarding “Reward is not the optimization target”. Thank you for trying. : )