Noosphere89 comments on Reward Function Design: a starter pack

Noosphere89 8 Dec 2025 22:21 UTC
LW: 8 AF: 2
0
AF
Some good picks the for how to design reward functions starter pack (though I should note that their empirical support is very weak due to focusing on toy models) are Defining Corrigible and Useful Goals and Defining Monitorable and Useful Goals.
The first post focuses on how you can get a goal for AIs that allow you to shutdown the AI while having the AI be useful, and the approach to corrigibility it takes is extremely different to how human brains work, using the corrigibility transformation to get corrigible AIs.
One big caveat here is that it definitely requires the assumption that Causal Decision Theory is used but I mostly am fine with that assumption, given that humans intuitively use Causal Decision Theory and it’s in the spec of the transformation rather than a background assumption.
The other big caveat is that you want the model to optimize for the reward in order for this to work, so in terms of under-sculpting vs over-sculpting, or whether an AI is driven by the reward vs driven by another goal, you want to have the AI reward-maximize and be over-sculpted (though in this case it’s just appropriately sculpted via reward), which makes it incompatible with corrigibility/alignment hopes that depend on AIs not maximizing the reward, but I think this is a good property to have.
The post on defining monitorable and useful goals proposes the idea of the monitorability transformation to get AIs to not be incentivized to fool monitors generally, and I’d recommend reading that over any explanation I’d give.
These are admittedly curveballs compared to standard LW thoughts on this, but this is why I picked them for the reward functions starter pack, as they contain novel ideas to deal with some notorious problems.
- Steven Byrnes 12 Dec 2025 21:40 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Thanks. I feel like I want to treat “reward function design” and “AGI motivation design” as more different than you do, and I think your examples above are more about the latter. The reward function is highly relevant to the motivation, but they’re still different.
  For example, “reward function design” calls for executable code, whereas “AGI motivation design” usually calls for natural-language descriptions. Or when math is involved, the math in practice usually glosses over tricky ontology identification stuff, like figuring out which latent variables in a potentially learned-from-scratch (randomly-initialized) world model correspond to a human, or a shutdown switch, or a human’s desires, or whatever.
  I guess you’re saying that if you have a great “AGI motivation design” plan, and you have somehow operationalized this plan perfectly and completely in terms of executable code, then you can set that exact thing as the reward function, and hope that there’s no inner misalignment / goal misgeneralization. But that latter part is still tricky. …And also, if you’ve operationalized the motivation perfectly, why even have a reward function at all? Shouldn’t you just delete the part of your AI code that does reinforcement learning, and put the already-perfect motivation into the model-based planner or whatever?
  Again I acknowledge that “reward function design” and “AGI motivation design” are not wholly unrelated. And that maybe I should read Rubi’s posts more carefully, thanks. Sorry if I’m misunderstanding what you’re saying.