Caleb Biddulph comments on 2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target

Caleb Biddulph 20 Dec 2025 9:25 UTC
2 points
0
Point taken about continual learning—I conveniently ignored this, but I agree that scenario is pretty likely.
If AIs were weak reward-optimizers, I think that would solve inner alignment better than if they were strong reward-optimizers. “Inner alignment asks the question: How can we robustly aim our AI optimizers at any objective function at all?” I.e. the AI shouldn’t misgeneralize the objective function you designed, but it also shouldn’t bypass the objective function entirely by overwriting its reward.
By “hardening,” do you mean something like “making sure the AI can never hack directly into the computer implementing RL and overwrite its own reward?” I guess you’re right. But I would only expect the AI to acquire this motivation if it actually saw this sort of opportunity during training. My vague impression is that in a production-grade RL pipeline, exploring into this would be really difficult, even without intentional reward hardening.
Assuming that’s right, thinking the AI would have a “strong reward-optimizing” motivation feels pretty crazy to me. Maybe that’s my true objection.
Like, I feel like no one actually thinks the AI will care about literal reward, for the same reason that no one thinks evolution will cause humans to want to spend their days making lots of copies of their own DNA in test tubes.
But maybe people do think this. IIRC, Bostrom’s Superintelligence leaned pretty heavily on the “true reward-optimization” framing (cf. wireheading). Maybe some people thinking about AI safety today are also concerned about true reward-optimizers, and maybe those people are simply confused in a way that they wouldn’t be if everyone talked about reward the way Alex wants.
I say things like “RL makes the AI want to get reward” all the time, which I guess would annoy Alex. But when I say “get reward” I actually mean “do things that would cause the untampered reward function to output a higher value,” aka “weakly optimize reward.” It feels obvious to me that the “true reward-optimization” interpretation isn’t what I’m talking about, but maybe it’s not as obvious as I think.
I think talking about AI “wanting/learning to get reward” is useful, but it isn’t clear to me what Alex would want me to say instead. I’d consider changing my phrasing if there were an alternative that felt natural and didn’t take much longer to say.
- Oliver Daniels 20 Dec 2025 17:59 UTC
  3 points
  0
  Parent
  I think it’s pretty plausible that AIs care about the literal reward and exploring into it seems pretty natural / plausible. By analogy, I think if humans could reliably wire head, many people would try it (and indeed people get addicted to drugs).
  It’s possible I’m missing something, lots of smart people I respect often claim that literal reward optimization is implausible, but I’ve never understood why (something something goal guarding? Like ofc many humans don’t do drugs, but I expect this is largely bc doing drugs is a bad reward optimization strategy)
  - Noosphere89 21 Dec 2025 15:30 UTC
    2 points
    0
    Parent
    The weaker version of pure reward optimization in humans is basically just the obesity issue, and the biggest reason humans became more obese on a population level in the 20th and 21st century is because we have figured out how to goodhart human reward models in the domain of food such that sugary, fatty and high-calorie foods (with the high-calorie part being most important for obesity) are very, very highly rewarding to at least part of our brains.
    
    Essentially everything has gotten more palatable and rewarding to eat than pre-20th century food.
    
    And as you say, part of the issue is that drugs are very crippling for capabilities, whereas reward functions for AIs that are optimized will have much less of this issue, and optimizing the reward function likely makes AIs more capable.