Policy Alignment

(ETA: The name “policy approval” wasn’t great. I think I will use the term “policy alignment” to contrast with “value alignment” going forward, at the suggestion of Wei Dai in the comments.)

I recently had a conversation with Stuart Armstrong in which I claimed that an agent which learns your utility function (pretending for a moment that “your utility function” really is a well-defined thing) and attempts to optimize it is still not perfectly aligned with you. He challenged me to write up specific examples to back up my claims.

I’ll also give a very sketchy alternative to value learning, which I call policy alignment. (The policy alignment idea emerged out of a conversation with Andrew Critch.)


Stuart Armstrong has recently been doing work showing the difficulty of inferring human values. To summarize: because humans are irrational, a value-learning approach like CIRL needs to jointly estimate the human utility function and the degree to which the human is rational—otherwise, it would take all the mistakes humans make to be preferences. Unfortunately, this leads to a severe problem of identifiability: humans can be assigned any values whatsoever if we assume the right kind of irrationality, and the usual trick of preferring simpler hypotheses doesn’t seem to help in this case.

I also want to point out that a similar problem arises even without irrationality. Vladimir Nesov explored how probability and utility can be mixed into each other without changing any decisions an agent makes. So, in principle, we can’t determine the utility or probability function of an agent uniquely based on the agent’s behavior alone (even including hypothetical behavior in counterfactual situations). This fact was discovered earlier by Jeffrey and Bolker, and is analyzed in more detail in the book The Logic of Decision. For this reason, I call the transform “Jeffrey-Bolker rotation”.

To give an illustrative example: it doesn’t matter whether we assign very low probability to an event, or care very little about what happens given that event. Suppose a love-maximizing agent is unable to assign nonzero utility to a universe where love isn’t real. The agent may appear to ignore evidence that love isn’t real. We can interpret this as not caring what happens conditioned on love not being real; or, equally valid (in terms of the actions which the agent chooses), we can interpret the agent as having an extremely low prior probability on love not being real.

At MIRI, we sometimes use the term “probutility” to indicate the probability,utility pair in a way which reminds us that they can’t be disentangled from one another. Jeffrey-Bolker rotation changes probabilities and utilities, but does not change the overall probutilities.

Given these problems, it would be nice if we did not actually need to learn the human utility function. I’ll advocate that position.

My understanding is that Stuart Armstrong is optimistic that human values can be inferred despite these problems, because we have a lot of useful prior information we can take advantage of.

It is intuitive that a CIRL-like agent should learn what is irrational and then “throw it out”, IE, de-noise human preferences by looking only at what we really prefer, not at what we mistakenly do out of short-sightedness or other mistakes. On the other hand, it is not so obvious that the probability/​utility distinction should be handled in the same way. Should an agent disentangle beliefs from preferences just so that it can throw out human beliefs and optimize the preferences alone? I argue against this here.

Main Claim

Ignoring issues of irrationality or bounded rationality, what an agent wants out of a helper agent is that the helper agent does preferred things.

Suppose a robot is trying to help a perfectly rational human. The human has probability function and utility function . The robot is in epistemic state e. The robot has a set of actions . The proposition “the robot takes the ith action when in epistemic state e” is written as . The set of full world-states is S. What the human would like the robot to do is given by:

(Or by the analogous causal counterfactual, if the human thinks that way.)

This notion of what the human wants is invariant to Jeffrey-Bolker rotation; the robot doesn’t need to disentangle probability and utility! It only needs to learn probutilities.

The equation written above can’t be directly optimized, since the robot doesn’t have direct access to human probutilities. However, I’ll broadly call any attempt to approximate that equation “policy alignment”.

Notice that this is closely analogous to UDT. UDT solves dynamic inconsistencies—situations in which an AI could predictably dislike the decisions of its future self—by optimizing its actions from the perspective of a fixed prior, IE, its initial self. Policy alignment resolves inconsistencies between the AI and the human by optimizing the AI’s actions from the human’s perspective. The main point of this post is that we can use this analogy to produce counterexamples to the typical value-learning approach, in which the AI tries to optimize human utility but not according to human beliefs.

I will somewhat ignore the distinction between UDT1.0 and UDT1.1.


These examples serve to illustrate that “optimizing human utility according to AI beliefs” is not exactly the same as “do what the human would want you to do”, even when we suppose “the human utility function” is perfectly well-defined and can be learned exactly by the AI.

In these examples, I will suppose that the AI has its own probability distribution . It reasons updatelessly with respect to evidence e it sees, but with full prior knowledge of the human utility function:

I use an updateless agent to avoid accusations that of course an updateful agent would fail classic UDT problems. However, it is not really very important for the examples.

I assume prior knowledge of to avoid any tricky issues which might arise by attempting to combine updatelessness with value learning.

Counterfactual Mugging

It seems reasonable to suppose that the AI will start out with some mathematical knowledge. Imagine that the AI has a database of theorems in memory when it boots up, including the first million digits of pi. Treat these as part of the agent’s prior.

Suppose, on the other hand, that the human which the AI wants to help does not know more than a hundred digits of pi.

The human and the AI will disagree on what to do about counterfactual mugging with a logical coin involving digits of pi which the AI knows and the human does not. If Omega approaches the AI, the AI will refuse to participate, but the human will wish the AI would. If Omega approaches the human, the AI may try to prevent the human from participating, to the extent that it can do so without violating other aspects of the human utility function.

“Too Updateless”

Maybe the problem with the counterfactual mugging example is that it doesn’t make sense to program the AI with a bunch of knowledge in its prior which the human doesn’t have.

We can go in the opposite extreme, and make a broad prior such as the Solomonoff distribution, with no information about our world in particular.

I believe the observation has been made before that running UDT on such a prior could have weird results. There could be a world with higher prior probability than ours, inhabited by Omegas who ask the AI to optimize alien values in most universes (including Earth) in exchange for the Omegas maximizing in their own world. (This particular scenario doesn’t seem particularly probable, but it does seem quite plausible that some weird universes will have higher probability than our universe in the Solomonoff prior, and may make some such bargain.)

Again, this is something which can happen in the maximization using but not in the one using -- unless humans themselves would approve of the multiversal bargain.

“Just Having a Very Different Prior”

Maybe is neither strictly more knowledgable than nor less, but the two are very different on some specific issues. Perhaps there’s a specific plan which, when is conditioned on evidence so far, looks very likely to have many good consequences. considers the plan very likely to have many bad consequences. Also suppose that there aren’t any interesting consequences of this plan in counterfactual branches, so UDT considerations don’t come in.

Also, suppose that there isn’t time to test the differing hypotheses involved which make humans think this is such a bad plan while AIs think it is so good. The AI has to decide right now whether to enact the plan.

The value-learning agent will implement this plan, since it seems good on net for human values. The policy-alignment agent will not, since humans wouldn’t want it to.

Obviously, one might question whether it is reasonable to assume that things got to a point where there was such a large difference of opinion between the AI and the humans, and no time to resolve it. Arguably, there should be safeguards against this scenario which the value-learning AI itself would want to set up, due to facts about human values such as “the humans want to be involved in big decisions about their future” or the like.

Nonetheless, faced with this situation, it seems like policy-alignment agents do the right thing while value-learning agents do not.


Aren’t human beliefs bad?

Isn’t it problematic to optimize via human beliefs, since human beliefs are low-quality?

I think this is somewhat true and somewhat not.

  • Partly, this is like saying “isn’t UDT bad because it doesn’t learn?”—actually, UDT acts as if it updates most of the time, so it is wrong to think of it as incapable of learning. Similarly, although the policy-alignment agent uses , it will mostly act as if it has updated on a lot of information. So, maybe you believe human beliefs aren’t very good—but do you think we’re capable of learning almost anything eventually? If so, this may address a large component of the concern. In particular, if you trust the output of certain machine learning algorithms more than you trust yourself, the AI can run those algorithms and use their output.

  • On the other hand, humans probably have incoherent , and not just because of logical uncertainty. So, the AI still needs to figure out what is “irrational” and what is “real” in , just like value-learning needs to do for .

If humans would want an AI to optimize via human beliefs, won’t that be reflected in the human utility function?

Or: If policy-alignment were good, wouldn’t a value-learner self modify into policy-alignment anyway?

I don’t think this is true, but I’m not sure. Certainly there could be simple agents who value-learners cooperate with without ever deciding to self-modify into policy-alignment agents. Perhaps there is something about human preference which desires the AI to cooperate with the human even when the AI thinks this is (otherwise) net-negative for human values.

Aren’t I ignoring the fact that the AI needs its own beliefs?

In “Just Having a Very Different Prior”, I claimed that if and disagree about the consequences of a plan, value-learning can do something humans strongly don’t want it to do, whereas policy-alignment cannot. However, my definition of policy-alignment ignores learning. Realistically, the policy-alignment agent needs to also have beliefs , which it uses to approximate the human approval of its actions. Can’t the same large disagreement emerge from this?

I think the concern is qualitatively less, because the policy-alignment agent uses only to estimate and . If the AI knows that humans would have a large disagreement with the plan, the policy-alignment agent would not implement the plan, while the value-learning agent would.

For policy-alignment to go wrong, it needs to have a bad estimate of and .

The policy is too big.

Even if the process of learning is doing the work to turn it into a coherent probability distribution (removing irrationality and making things well-defined), it still may not be able to conceive of important possibilities. The evidence which the AI uses to decide how to act, in the equations given earlier, may be a large data stream with some human-incomprehensible parts.

As a result, it seems like the AI needs to optimize over compact/​abstract representations of its policy, similarly to how policy selection in logical inductors works.

This isn’t an entirely satisfactory answer, since (1) the representation of a policy as a computer program could still escape human understanding, and (2) it is unclear what it means to correctly represent the policy in a human-understandable way.


[Aside from issues with the approach, my term “policy approval” may be terrible. It sounds too much like “approval-directed agent”, which means something different. I think there are similarities, but they aren’t strong enough to justify referring to both as “approval”. Any suggestions?]

[Now using “Policy Alignment” for this. Editing post accordingly.]


(These are very speculative.)

Logical Updatelessness?

One of the major obstacles to progress in decision theory right now is that we don’t know of a good updateless perspective for logical uncertainty. Maybe a policy-alignment agent doesn’t need to solve this problem, since it tries to optimize from the human perspective rather than its own. Roughly: logical updatelessness is hard because it tends to fall into the “too updateless” issue above. So, maybe it can be a non-issue in the right formulation of policy alignment.


Stuart Armstrong is somewhat pessimistic about corrigibility. Perhaps there is something which can be done in policy-alignment land which can’t be done otherwise. The “Just Having Very Different Priors” example points in this direction; it is an example where policy-alignment acts in a much more corrigible way.

A value-learning agent can always resist humans if it is highly confidant that its plan is a good one which humans are opposing irrationally. A policy-alignment agent can think its plan is a good one but also think that humans would prefer it to be corrigible on principle regardless of that.

On the other hand, a policy-alignment agent isn’t guaranteed to think that. Perhaps policy-alignment learning can be specified with some kind of highly corrigible bias, so that it requires a lot of evidence to decide that humans don’t want it to behave corrigibly in a particular case?


I’ve left out some speculation about what policy-alignment agents should actually look like, for the sake of keeping mostly to the point (the discussion with Stuart). I like this idea because it involves a change in perspective of what an agent should be, similar to the change which UDT itself made.