Towards_Keeperhood comments on Reward button alignment

Towards_Keeperhood 28 May 2025 14:15 UTC
1 point
0
I liked this post. Reward button alignment seems like a good toy problem to attack or discuss alignment feasibility on.
But it’s not obvious to me whether the AI would really become sth like a superintelligent reward button presses optimizer. (But even if your exact proposal doesn’t work, I think reward button alignment is probably a relatively feasible problem for brain-like AGI.) There are multiple potential problems, where most seem like “eh probably it works fine but not sure”, but my current biggest doubt is “when the AI becomes reflective, will the reflectively endorsed values only include reward button presses or also a bunch of shards that were used for estimated expected button presses?”.
Let me try to understand in more detail how you imagine the AI to look like:
1. How does the learned value function evaluate plans?
  1. Does the world model always evaluate expected-button-presses for each plan and the LVF just looks at that part of a plan and uses that as the value it assigns? Or does the value function also end up valuing other stuff because it gets updated through TD learning?
    Maybe the question is rather how far upstream of button presses is that other stuff, e.g. just “the human walks toward the reward button” or also “getting more relevant knowledge is usually good”.
    Or like, what parts get evaluated by the thought generator and what parts by the value function? Does the value function (1) look at a lot of complex parts in a plan to evaluate expected-reward-utility (2) recognize a bunch of shards like “value of information”, “gaining instrumental resources”, etc. on plans which it uses to estimate value, (3) do the plans conveniently summarize success probability and expected resources it can look at (as opposed to them being implicit and needing to be recognized by the LVF as in (2)), (4) or does the thought generator directly predict expected-reward-utility which can be used?
  2. Also how sophisticated is the LVF? Is it primitive like in humans or able to make more complex estimates?
    If there are deceptive plans like “ok actually i value U_2, but i will of course maximize and faithfully predict expected button presses to not get value drift until i can destroy the reward setup”, would the LVF detect that as being low expected button presses?
I can try to imagine in more detail about what may go wrong once I better see what you’re imagining.
(Also in case you’re trying to explain why you think it would work by analogy to humans, perhaps use John von Neumann or so as example rather than normies or normie situations.)