Stable Pointers to Value: An Agent Embedded in Its Own Utility Function

(This post is largely a write-up of a conversation with Scott Garrabrant.)

Stable Pointers to Value

How do we build stable pointers to values?

As a first example, consider the wireheading problem for AIXI-like agents in the case of a fixed utility function which we know how to estimate from sense data. As discussed in Daniel Dewey’s Learning What to Value and other places, if you try to implement this by putting the utility calculation in a box which rewards an AIXI-like RL agent, the agent can eventually learn to modify or remove the box, and happily does so if it can get more reward by doing so. This is because the RL agent predicts, and attempts to maximize, reward received. If it understands that it can modify the reward-giving box to get more reward, it will.

We can fix this problem by integrating the same reward box with the agent in a better way. Rather than having the RL agent learn what the output of the box will be and plan to maximize the output of the box, we use the box directly to evaluate possible futures, and have the agent plan to maximize that evaluation. Now, if the agent considers modifying the box, it evaluates that future with the current box. The box as currently configured sees no advantage to such tampering. This is called an observation-utility maximizer (to contrast it with reinforcement learning). Daniel Dewey goes on to show that we can incorporate uncertainty about the utility function into observation-utility maximizers, recovering the kind of “learning what is being rewarded” that RL agents were supposed to provide, but without the perverse incentive to try and make the utility turn out to be something easy to maximize.

This feels much like a use/​mention distinction. The RL agent is maximizing “the function in the utility module”, whereas the observation-utility agent (OU agent) is maximizing the function in the utility module.

The Easy vs Hard Problem

I’ll call the problem which OU agents solve the easy problem of wireheading. There’s also the hard problem of wireheading: how do you build a stable pointer to values if you can’t build an observation-utility box? For example, how do you set things up so that the agent wants to satisfy a human, without incentivising the AI to manipulate the human to be easy to satisfy, or creating other problems in the attempt to avoid this? Daniel Dewey’s approach of incorporating uncertainty about the utility function into the utility box doesn’t seem to cut it—or at least, it’s not obvious how to set up that uncertainty in the right way.

The hard problem is the wireheading problem Which Tom Everitt attempts to make progress on in Avoiding Wireheading with Value Reinforcement Learning and Reinforcement Learning with a Corrupted Reinforcement Channel. It’s also connected to the problem of Generalizable Environmental Goals in AAMLS. CIRL gets at an aspect of this problem as well, showing how it can be solved if the problem of environmental goals is solved (and if we can assume that humans are perfectly rational, or that we can somehow factor out their irrationality—Stuart Armstrong has some useful thoughts on why this is difficult). Approval-directed agents can be seen as an attempt to turn the hard problem into the easy problem, by treating humans as the evaluation box rather than trying to infer what the human wants.

All these approaches have different advantages and disadvantages, and the point of this post isn’t to evaluate them. My point is more to convey the overall picture which seems to connect them. In a sense, the hard problem is just an extension of the same use/​mention distinction which came up with the easy problem. We have some idea how to maximize “human values”, but we don’t know how to actually maximize human values. Metaphorically, we’re trying to dereference the pointer.

Stuart Armstrong’s indifference work is a good illustration of what’s hard about the hard problem. In the RL vs OU case, you’re going to constantly struggle with the RL agent’s misaligned incentives until you switch to an OU agent. You can try to patch things by explicitly punishing manipulation of the reward signal, warping the agent’s beliefs to think manipulation of the rewards is impossible, etc, but this is really the wrong approach. Switching to OU makes all of that unnecessary. Unfortunately, in the case of the hard problem, it’s not clear there’s an analogous move which makes all the slippery problems disappear.

Illustration: An Agent Embedded in Its Own Utility Function

If an agent is logically uncertain of its own utility function, the easy problem can turn into the hard problem.

It’s quite possible that an agent might be logically uncertain of its own utility function if the function is quite difficult to compute. In particular, human judgement could be difficult to compute even after learning all the details of the human’s preferences, so that the AI needs uncertain reasoning about what the model tells it.

Why can this turn the easy problem of wireheading into the hard problem? If the agent is logically uncertain about the utility function, its decisions may have logical correlations with the utility function. This can give the agent some logical control over its utility function, reintroducing a wireheading problem.

As a concrete example, suppose that we have constructed an AI which maximizes CEV: it wants to do what an imaginary version of human society, deliberating under ideal conditions, would decide is best. Obviously, the AI cannot actually simulate such an ideal society. Instead, the AI does its best to reason about what such an ideal society would do.

Now, suppose the agent figures out that there would be an exact copy of itself inside the ideal society. Perhaps the ideal society figures out that it has been constructed as a thought experiment to make decisions about the real world, so they construct a simulation of the real world in order to better understand what they will be making decisions about. Furthermore, suppose for the sake of argument that our AI can break out of the simulation and exert arbitrary control over the ideal society’s decisions.

Naively, it seems like what the AI will do in this situation is take control over the ideal society’s deliberation, and make the CEV values as easy to satisfy as possible—just like an RL agent modifying its utility module.

Obviously, this could be taken as reason to make sure the ideal society doesn’t figure out that it’s just a thought experiment, or that they don’t construct copies of the AI. But, we don’t generally want good properties of the AI to rely on assumptions about what humans do; wherever possible, we want to design the AI to avoid such problems.

Indifference and CDT

In this case, it seems like the right thing to do is for the AI to ignore any influence which its actions have on its estimate of its utility function. It should act as if it only has influence over the real world. That way, the ideal society which defines CEV can build all the copies of the AI they want; the AI only considers how its actions have influence over the real world. It avoids corrupting the CEV.

Clearly, this would be an indifference-style solution. What’s interesting to me is that it also looks like a CDT-style solution. In fact, this seems like an answer to my question at the end of Smoking Lesion Steelman: a case of ignorance about your own utility function which doesn’t arise from an obviously bad agent design. Like the smoking lesion steelman, ignorance about utility here seems to recommend CDT-like reasoning over EDT-like reasoning.

This suggests to me that there is a deep connection between CDT, indifference, stable pointers, and corrigibility. As Jason Konek & Ben Levinstein argued in The Foundations of Epistemic Decision Theory, CDT is about getting the direction-of-fit right in decision theory: you want beliefs to better fit the world (if your beliefs don’t match the world, you change your beliefs), but you want the world to better fit your goals (if your goals don’t match the world, you change the world). The easy problem of wireheading is to follow the second maxim when you have your utility function in hand. The hard problem of wireheading is to go about this when your utility is not directly observable. If you build a stable pointer to a human, you become corrigible. Doing this correctly seems to involve something which at least looks very similar to indifference.

This picture is a little too clean, and likely badly wrong in some respect: several of these concepts are likely to come apart when examined more closely. Nonetheless, this seems like an interesting way of looking at things.