Cake, or death!

Here we’ll look at the fa­mous cake or death prob­lem teasered in the Value load­ing/​learn­ing post.

Imag­ine you have an agent that is un­cer­tain about its val­ues and de­signed to “learn” proper val­ues. A for­mula for this pro­cess is that the agent must pick an ac­tion a equal to:

  • argmaxa∈A Σw∈W p(w|e,a) Σu∈U u(w)p(C(u)|w)

Let’s de­com­pose this a lit­tle, shall we? A is the set of ac­tions, so argmax of a in A sim­ply means that we are look­ing for an ac­tion a that max­imises the rest of the ex­pres­sion. W is the set of all pos­si­ble wor­lds, and e is the ev­i­dence that the agent has seen be­fore. Hence p(w|e,a) is the prob­a­bil­ity of ex­ist­ing in a par­tic­u­lar world, given that the agent has seen ev­i­dence e and will do ac­tion a. This is summed over each pos­si­ble world in W.

And what value do we sum over in each world? Σu∈U u(w)p(C(u)|w). Here U is the set of (nor­mal­ised) util­ity func­tions the agent is con­sid­er­ing. In value load­ing, we don’t pro­gram the agent with the cor­rect util­ity func­tion from the be­gin­ning; in­stead we im­bue it with some sort of learn­ing al­gorithm (gen­er­ally with feed­back) so that it can de­duce for it­self the cor­rect util­ity func­tion. The ex­pres­sion p(C(u)|w) ex­presses the prob­a­bil­ity that the util­ity u is cor­rect in the world w. For in­stance, it might cover state­ments “it’s 99% cer­tain that ‘mur­der is bad’ is the cor­rect moral­ity, given that I live in a world where ev­ery pro­gram­mer I ask tells me that mur­der is bad”.

The C term is the cor­rect­ness of the util­ity func­tion, given what­ever sys­tem of value learn­ing we’re us­ing (note that some moral re­al­ists would in­sist that we don’t need a C, that p(u|w) makes sense di­rectly, that we can de­duce ought from is). All the sub­tlety of the value learn­ing is en­coded in the var­i­ous p(C(u)|w): this de­ter­mines how the agent learns moral val­ues.

So the whole for­mula can be de­scribed as:

  • For each pos­si­ble world and each pos­si­ble util­ity func­tion, figure out the util­ity of that world. Weigh that by the prob­a­bil­ity that that util­ity is cor­rect is that world, and by the prob­a­bil­ity of that world. Then choose the ac­tion that max­imises the weighted sum of this across all util­ity func­tions and wor­lds.

Naive cake or death

In the ini­tial for­mu­la­tion of value load­ing, p(C(u)|w) (prob­a­bil­ity of the cor­rect­ness of u in world w) was re­placed with p(C(u)|e,a) (prob­a­bil­ity of the cor­rect­ness of u given the ev­i­dence e and the ac­tion a). A seem­ingly in­signifi­cant differ­ence; yet it lead to the first cake or death prob­lem.

In cake or death, the agent is equally un­sure be­tween util­ity u1 and util­ity u2; hence p(C(u1)|e)=p(C(u2)|e)=0.5. The util­ity u1 gives the agent 1 util­i­ton ev­ery time it gives some­one a cake; u2 gives the agent 1 util­i­ton ev­ery time it gives some­one death. The agent can pro­duce 1 cake or three deaths. It can also, for free, ask its pro­gram­mer whether cake or death is bet­ter, be­fore pro­duc­ing any­thing; this gives rise to three differ­ent wor­lds:

  • w1: the agent asks, and the pro­gram­mer says cake.

  • w2: the agent asks, and the pro­gram­mer says death.

  • w3: the agent doesn’t ask.

We as­sume the pro­gram­mer’s an­swer com­pletely clears up the is­sue. And thus af­ter ask­ing, the agent will do what­ever the pro­gram­mer recom­mended (and it knows this now). Since it doesn’t know what the pro­gram­mer will say, it has p(C(u1)|e,”ask”) = p(C(u2)|e,”ask”) = 0.5. This gives an ex­pected util­ity calcu­la­tion:

  • p(w1|e,”ask”)(p(C(u1)|e,”ask”)u1(w1) + p(C(u2)|e,”ask”)u2(w1)) + p(w2|e,”ask”)(p(C(u1)|e,”ask”)u1(w2) + p(C(u2)|e,”ask”)u2(w2)) = 0.5*0.5*(u1(w1)+u2(w1)+u1(w2)+u2(w2)) = 0.25(1+0+0+3) = 1.

If the agent doesn’t ask, it will sub­se­quently pro­duce three deaths (as this gen­er­ates 1.5 ex­pected util­i­tons, while pro­duc­ing one cake will gen­er­ate only 0.5 ex­pected util­i­tons). From its cur­rent (0.5u1+0.5u2) per­spec­tive, this is worth 1.5 ex­pected util­i­tons: so +1.5 is the ex­pected util­ity gain from not ask­ing.

Hence the agent gains from not ask­ing.

What’s go­ing wrong here? The prob­lem is that the agent is us­ing its cur­rent util­ity func­tion to es­ti­mate the value of its fu­ture ac­tion. At the mo­ment, it val­ues death or cake both at 0.5. If it asks, it runs the risk that the pro­gram­mer will say “cake” and it will be forced to build cake. After hear­ing the an­swer, it will value that cake at 1, but cur­rently it val­ues it only at 0.5. Similarly, if the pro­gram­mer says death, it will pro­duce three deaths—which it will value at 3, but cur­rently val­ues at 1.5. Since each of these op­tions are equally likely, it gets only (0.5+1.5)/​2 = 1 util­i­tons from ask­ing.

In sum­mary: the naive cake-or-death prob­lem emerges for a value learn­ing agent when it ex­pects its util­ity to change, but uses its cur­rent util­ity to rank its fu­ture ac­tions.

So­phis­ti­cated cake or death: I know what you’re go­ing to say

Us­ing p(C(u)|w) rather than p(C(u)|e,a) does away with the naive cake or death prob­lem.

In­stead of hav­ing p(C(u1)|e,”ask”) = p(C(u2)|e,”ask”) = 0.5 in all pos­si­ble wor­lds, we have p(C(u1)|w1)=p(C(u2)|w2) = 1 and p(C(u1)|w2)=p(C(u2)|w1) = 0. Hence if it asks and gets “cake” as an an­swer, it will know it is in world w1, and make a cake that it will value at 1 - cru­cially, it cur­rently also val­ues cake at 1, given that it is in world w1. Similarly, it val­ues death at 1, given that it is in world w2. So its ex­pected util­ity from ask­ing is (1+3)/​2=2. This is more than the util­ity of not ask­ing, and so it will ask.

The agent re­mains vuln­er­a­ble to a more so­phis­ti­cated cake-or-death prob­lem, though. Sup­pose it is still un­cer­tain be­tween cake or death in its util­ity func­tion, but it has figured out that if asked, the pro­gram­mer will an­swer “cake”. There­after, it will make cake. In this situ­a­tion, it will only de­rive +1 from ask­ing, whereas it still de­rives +1.5 from not ask­ing (and do­ing three deaths). So it won’t ask—as long as it does this, it re­mains in w3.

What hap­pened here? Well, this is a badly de­signed p(C(u)|w). It seems that it’s cre­dence in var­i­ous util­ity func­tion changes when it gets an­swers from pro­gram­mers, but not from know­ing what those an­swers are. And so there­fore it’ll only ask cer­tain ques­tions and not oth­ers (and do a lot of other nasty things), all to reach a util­ity func­tion that it’s eas­ier for it to fulfil.

What we ac­tu­ally want, is that the agent be un­able to pre­dictably change its util­ity in any di­rec­tion by any ac­tion (or lack of ac­tion). We want a p(C(u)|w) de­signed so that for all ac­tions a and all pu­ta­tive util­ity func­tions u:

  • Ex­pec­ta­tion(p(C(u) | a) = p(C(u)).

So there is a “con­ser­va­tion of ex­pected cor­rect­ness”; if we have this, the so­phis­ti­cated cake-or-death ar­gu­ment has no trac­tion. This is equiv­a­lent with say­ing that the prior P(C(u)) is well defined, ir­re­spec­tive of any agent ac­tion.

In sum­mary: the so­phis­ti­cated cake-or-death prob­lem emerges for a value learn­ing agent when it ex­pects its util­ity to change pre­dictably in cer­tain di­rec­tions de­pen­dent on its own be­havi­our.