Value learning: ultra-sophisticated Cake or Death

Many mooted AI designs rely on “value loading”, the update of the AI’s preference function according to evidence it receives. This allows the AI to learn “moral facts” by, for instance, interacting with people in conversation (“this human also thinks that death is bad and cakes are good – I’m starting to notice a pattern here”). The AI has an interim morality system, which it will seek to act on while updating its morality in whatever way it has been programmed to do.

But there is a problem with this system: the AI already has preferences. It is therefore motivated to update its morality system in a way compatible with its current preferences. If the AI is powerful (or potentially powerful) there are many ways it can do this. It could ask selective questions to get the results it wants (see this example). It could ask or refrain from asking about key issues. In extreme cases, it could break out to seize control of the system, threatening or imitating humans so it could give itself the answers it desired.

Avoiding this problem turned out to be tricky. The Cake or Death post demonstrated some of the requirements. If p(C(u)) denotes the probability that utility function u is correct, then the system would update properly if:

Expectation(p(C(u)) | a) = p(C(u)).

Put simply, this means that the AI cannot take any action that could predictably change its expectation of the correctness of u. This is an analogue of the conservation of expected evidence in classical Bayesian updating. If the AI was 50% convinced about u, then it could certainly ask a question that would resolve its doubts, and put p(C(u)) at 100% or 0%. But only as long as it didn’t know which moral outcome was more likely.

That formulation gives too much weight to the default action, though. Inaction is also an action, so a more correct formulation would be that for all actions a and b,

Expectation(p(C(u)) | a) = Expectation(p(C(u)) | b).

How would this work in practice? Well, suppose an AI was uncertain between whether cake or death was the proper thing, but it knew that if it took action a:”Ask a human”, the human would answer “cake”, and it would then update its values to reflect that cake was valuable but death wasn’t. However, the above condition means that if the AI instead chose the action b:”don’t ask”, exactly the same thing would happen.

In practice, this means that as soon as the AI knows that a human would answer “cake”, it already knows it should value cake, without having to ask. So it will not be tempted to manipulate humans in any way.

Combined moral and physical information: ultra-sophisticated Cake or Death

The above requirement ensures that the AI would update correctly on purely moral facts (“cake is good”), without trying to twist the system. Since it’s a Bayesian reasoner, it would follow traditional conservation of expected evidence on standard empirical facts (“cake is really hard to make”) without causing any problems.

There is still a problem with combined moral and empirical statements, though. For example, imagine that u_c is a utility function that values cake at 1 utiliton, u_d the one that valued death at 1 utiliton, and that the AI currently follows utility 0.5u_c + 0.5u_d (in other words, p(C(u_c))=p(C(u_d))=0.5). Further imagine that the AI will get one chance to produce either 1 cake or 1 death, and that exactly one of these tasks is hard. If the AI attempts the easy task, it will succeed with probability 1. If it attempts the hard task, it will succeed with probability 0.1. The AI is ignorant of which task is easy, so it assigns equal probability to the four following worlds:

World	p(C(u_c))	p(C(u_d))	p(cake)	p(death)
w₁	1	0	1	0.1
w₂	1	0	0.1	1
w₃	0	1	1	0.1
w₄	0	1	0.1	1

Here p(cake) means the probability of cake, should the AI attempt to make it. If the AI doesn’t find out any extra information, it will attempt to make a cake or a death at random. This will succeed with probability 0.55, and since it has the split utility 0.5u_c + 0.5u_d it will derive an expected utility of 0.275.

But now imagine that, from someone whose values it is programmed to trust (and whose judgement it does trust), the AI overhears “the true morality is the hard task”. In practice, this is informing the AI that it does not live in worlds w₁ (where cake is both good and easy) or w₄ (where death is both good and easy). Critically, this piece of information does not update its values at all—it still follows utility 0.5u_c + 0.5u_d. However, its expected utility is now reduced: even if it learns everything there is to know, it cannot go above 0.1.

Thus hearing “the true morality is the hard task” has significantly reduced the AI’s expected utility. It would really have preferred not to have heard this—it would much prefer to have manipulated or murdered the speaker, or simply not talked with them.

And the “conservation” requirements cannot prevent this. Define action a:”overhear that the true morality is the hard task” and action b:”arrange to not hear that sentence”. Then obviously action b does not change its estimation of the correctness of u_c or u_d. But we’ve seen that action a doesn’t either! So

Expectation(p(C(u_c)) | a) = Expectation(p(C(u_c)) | b)
Expectation(p(C(u_d)) | a) = Expectation(p(C(u_d)) | b)

In other words, the requirements above doesn’t prevent the AI from wanting to manipulate the content of mixed (moral and empirical) statements. Since the previous two problems were called “naive” and “sophisticated” Cake or Death, I’ve chosen to name this the “ultra-sophisticated” Cake or Death problem.

Extra conditions

The obvious solution is to extend the “conservation of expected evidence” idea to cover mixed statements. One idea could be that for all events {A_j} that are independent of actions a or b, we would want

Expectation(p(PFOR(C(u_i),A_j)) | a) = Expectation(p(PFOR(C(u_i),A_j)) | b),

where PFOR(C(u_i),A_j) is any propositional formula made up of terms of the type C(u_i) and A_j. These conditions include the previous ones, as C(u) is a (trivial) propositional formula.

This gives the result that we want for our example. The events “Cake is hard” and “Death is hard” are certainly independent of whether the AI overhears (action a) or not (action b). Then the propositional formula

(“Cake is hard”∧C(u_c)) ∨ (“Death is hard”∧C(u_d))

encodes “the true morality is the hard task”. The above conditions means that if the AI knows that it will know that statement to be true after overhearing (action a), then it will know it without needing to do overhear (action b). So action b gives no advantage, and the AI is not tempted to manipulate the system.

Is this also a sufficient condition for proper value loading (do we also need to include non-independent events)? I feel that it is sufficient. Especially since you can construct independent event by splicing dependent events together and conjuncting the result. But I also felt the original expectation requirements were sufficient, so my intuition is probably not reliable on this.

In any case, I hope to soon be able to show a system that works, even without knowing the exact sufficient conditions.