Most “simple” reward feedback leads to convergent instrumental subgoals, whereas approval / myopic feedback almost never does unless that’s what the human says is correct.
I am kinda confused about what sort of approval feedback you’re talking about. Suppose we have a simple reward function, which gives the agent more points for collecting more berries. Then the agent has lots of convergent instrumental subgoals. Okay, what about a simple approval function, which approves actions insofar as the supervisor expects them to lead to collecting more berries? Then the agent *also* learns convergent instrumental subgoals, because it learns to take whatever actions lead to collecting more berries (assuming the supervisor is right about that).
I think there are secretly two notions of “leads to convergent instrumental subgoals” here. There’s the outer-alignment notion: “do any of the optimal policies for this reward function pursue convergent instrumental subgoals?”, and also the “given a fixed learning setup, is it probable that training on this reward signal induces learned policies which pursue convergent instrumental subgoals?”.
That said,
I think that ~any non-trivial reward signal we know how to train policy networks on right now, plausibly leads to learned convergent instrumental goals. I think that this may include myopically trained RL agents, but I’m not really sure yet.
I think that the optimal policies for an approval reward signal are unlikely to have convergent instrumental subgoals. But this is weird, because Optimal Policies Tend to Seek Powergives conditions under which, well… optimal policies tend to seek power: given that we have a “neutral” (IID) prior over reward functions and that the agent acts optimally, we should expect to see power-seeking behavior in such-and-such situations.
This generally becomes more pronounced as the discount rate increases towards 1, which is a hint that there’s something going on with “myopic cognition” in MDPs, and that discount rates do matter in some sense.
And then you stop thinking about IID reward functions, you realize that we still don’t know how to write down non-myopic, non-trivial reward functions that wouldn’t lead to doom if maximized AIXI-style. We don’t know how to do that, AFAICT, though not for lack of trying.
But with respect to the kind of approval function we’d likely implement, optimally arg-max’ing the approval function doesn’t seem to induce the same kinds of subgoals. AFAICT, it might lead to short-term manipulative behavior, but not long-term power-seeking.
Agents won’t be acting optimally, but this distinction hints that we know how to implement different kinds of things via approval than we do via rewards. There’s something different about the respective reward function sets. I think this difference is interesting.
I think there are secretly two notions of “leads to convergent instrumental subgoals” here. There’s the outer-alignment notion: “do any of the optimal policies for this reward function pursue convergent instrumental subgoals?”, and also the “given a fixed learning setup, is it probable that training on this reward signal induces learned policies which pursue convergent instrumental subgoals?”.
That said,
I think that ~any non-trivial reward signal we know how to train policy networks on right now, plausibly leads to learned convergent instrumental goals. I think that this may include myopically trained RL agents, but I’m not really sure yet.
I think that the optimal policies for an approval reward signal are unlikely to have convergent instrumental subgoals. But this is weird, because Optimal Policies Tend to Seek Power gives conditions under which, well… optimal policies tend to seek power: given that we have a “neutral” (IID) prior over reward functions and that the agent acts optimally, we should expect to see power-seeking behavior in such-and-such situations.
This generally becomes more pronounced as the discount rate increases towards 1, which is a hint that there’s something going on with “myopic cognition” in MDPs, and that discount rates do matter in some sense.
And then you stop thinking about IID reward functions, you realize that we still don’t know how to write down non-myopic, non-trivial reward functions that wouldn’t lead to doom if maximized AIXI-style. We don’t know how to do that, AFAICT, though not for lack of trying.
But with respect to the kind of approval function we’d likely implement, optimally arg-max’ing the approval function doesn’t seem to induce the same kinds of subgoals. AFAICT, it might lead to short-term manipulative behavior, but not long-term power-seeking.
Agents won’t be acting optimally, but this distinction hints that we know how to implement different kinds of things via approval than we do via rewards. There’s something different about the respective reward function sets. I think this difference is interesting.