Reward uncertainty

In my last post, I argued that interaction between the human and the AI system was necessary in order for the AI system to “stay on track” as we encounter new and unforeseen changes to the environment. The most obvious implementation of this would be to have an AI system that keeps an estimate of the reward function. It acts to maximize its current estimate of the reward function, while simultaneously updating the reward through human feedback. However, this approach has significant problems.

Looking at the description of this approach, one thing that stands out is that the actions are chosen according to a reward that we know is going to change. (This is what leads to the incentive to disable the narrow value learning system.) This seems clearly wrong: surely our plans should account for the fact that our rewards will change, without treating such a change as adversarial? This suggests that we need to have our action selection mechanism take the future rewards into account as well.

While we don’t know what the future reward will be, we can certainly have a probability distribution over it. So what if we had uncertainty over reward functions, and took that uncertainty into account while choosing actions?

Setup

We’ve drilled down on the problem sufficiently far that we can create a formal model and see what happens. So, let’s consider the following setup:

  • The human, Alice, knows the “true” reward function that she would like to have optimized.

  • The AI system maintains a probability distribution over reward functions, and acts to maximize the expected sum of rewards under this distribution.

  • Alice and the AI system take turns acting. Alice knows that the AI learns from her actions, and chooses actions accordingly.

  • Alice’s action space is such that she cannot take the action “tell the AI system the true reward function” (otherwise the problem would become trivial).

  • Given these assumptions, Alice and the AI system act optimally.

This is the setup of Cooperative Inverse Reinforcement Learning (CIRL). The optimal solution to this problem typically involves Alice “teaching” the AI system by taking actions that communicate what she does and does not like, while the AI system “asks” about parts of the reward by taking actions that would force Alice to behave in different ways for different rewards.

Does this solve our problems?

Two of the problems we identified in the last post are simply assumed away:

  • Alice does not know the “true” reward function, but we assumed that she does.

  • Alice may be unable to optimally give feedback to the AI system, but we assume that she is optimal here.

So this particular kind of reward uncertainty does not fix either of these problems. What about convergent instrumental subgoals?

Utility preservation. One major worry we had with the original setup was that the AI system would disable its narrow value learning system, as a manifestation of the instrumental goal of protecting its utility function. This is reversed in our setup: the AI system has a positive incentive to continue doing narrow value learning, since it helps it hone in on the true reward function, which in turn allows it to optimize the reward better. (We might worry that this prevents us from fixing any problems in the narrow value learning system, but that is a robustness problem: in the world where everything is working correctly, this is the correct incentive.)

Survival incentive, aka shutdown incorrigibility. Another worry is that the AI system has a survival incentive that causes it to prevent us from shutting it down. With reward uncertainty, the fact that we are trying to shut the AI system down is itself strong evidence about the reward function. The AI system should reason that its operation leads to worse outcomes, and so allow itself to be turned off.

The Off-Switch Game formalizes this reasoning in a simple setting. The AI system can either directly take action a (perhaps by disabling an off switch), can shut down, or can allow Alice to choose between these options. If the AI system allows Alice to choose, Alice then decides whether or not to shut down the AI system. Assuming that Alice is optimal, the AI system reasons that when Alice chooses to shut it down, the true reward function must have been one which makes the action a bad, and so it does better by deferring to Alice. However, when Alice is modeled as noisily rational instead of optimal, the AI system might reason that Alice might make a mistake when deciding to shut the AI system down, and so it might take action a directly without deferring to her. So, the AI system becomes shutdown corrigible, as long as it assumes that Alice is sufficiently rational.

Should robots be obedient? makes a similar point, arguing that an AI system that learns preferences and then acts to maximize their satisfaction can perform better than an AI system that simply obeys instructions, because humans are not perfectly rational. This creates a tradeoff between performance and obedience (which shutdown corrigibility is an instance of).

Of course, these simple models exclude many actions that a realistic AI system could take. In particular, it seems likely that an AI system would prefer to disable the shutdown button, gather information about the reward until it has fully updated, and optimize the resulting set of rewards. If the space of reward functions is misspecified, as it likely will be, this will lead to bad behavior. (This is the point made by Incorrigibility in the CIRL Framework.)

Note though that while this cuts against shutdown corrigibility (since the AI system would prefer to disable the shutdown button), I would frame the problem differently. If the space of rewards is well-specified and has sufficient weight on the true reward function and the AI system is sufficiently robust and intelligent, then the AI system must update strongly on us attempting to shut it down. This should cause it to stop doing the bad thing it was doing. When it eventually narrows down on the reward it will have identified the true reward, which by definition is the right thing to optimize. So even though the AI system might disable its off switch, this is simply because it is better at knowing what we want than we are, and this leads to better outcomes for us. So, really the argument is that since we want to be robust (particularly to reward misspecification), we want shutdown corrigibility, and reward uncertainty is an insufficient solution for that.

A note on CIRL

There has been a lot of confusion on what CIRL is and isn’t trying to do, so I want to avoid adding to the confusion.

CIRL is not meant to be a blueprint for a value-aligned AI system. It is not the case that we could create a practical implementation of CIRL and then we would be done. If we were to build a practical implementation of CIRL and use it to align powerful AI systems, we would face many problems:

  • As mentioned above, Alice doesn’t actually know the true reward function, and she may not be able to give optimal feedback.

  • As mentioned above, in the presence of reward misspecification the AI system may end up optimizing the wrong thing, leading to catastrophic outcomes.

  • Similarly, if the model of Alice’s behavior is incorrect, as it inevitably will be, the AI system will make incorrect inferences about Alice’s reward, again leading to bad behavior. As an example that is particularly easy to model, should the AI system model Alice as thinking about the robot thinking about Alice, or should it model Alice as thinking about the robot thinking about Alice thinking about the robot thinking about Alice? How many levels of pragmatics is the “right” level?

  • Lots of other problems have not been addressed: the AI system might not deal with embeddedness well, or it might not be robust and could make mistakes, etc.

CIRL is supposed to bring conceptual clarity to what we could be trying to do in the first place with a human-AI system. In Dylan’s own words, “what cooperative IRL is, it’s a definition of how a human and a robot system together can be rational in the context of fixed preferences in a fully observable world state”. In the same way that VNM rationality informs our understanding of humans even though humans are not expected utility maximizers, CIRL can inform our understanding of alignment proposals, even though CIRL itself is unsuitable as a solution to alignment.

Note also that this post is about reward uncertainty, not about CIRL. CIRL makes other points besides reward uncertainty, that are well explained in this blog post, and are not mentioned here.

While all of my posts have been significantly influenced by many people, this post is especially based on ideas I heard from Dylan Hadfield-Menell. However, besides the one quote, the writing is my own, and may not reflect Dylan’s views.