Dissolving Confusion around Functional Decision Theory


Func­tional De­ci­sion The­ory (FDT), (see also causal, ev­i­den­tial, time­less, up­date­less, and an­thropic de­ci­sion the­o­ries) recom­mends tak­ing co­op­er­a­tive, non-greedy ac­tions in twin pris­on­ers dilem­mas, New­com­bian prob­lems, Parfit’s hitch­hiker-like games, and coun­ter­fac­tual mug­gings but not smok­ing le­sion situ­a­tions. It’s a con­tro­ver­sial con­cept with im­por­tant im­pli­ca­tions for de­sign­ing agents that have op­ti­mal be­hav­ior when em­bed­ded in en­vi­ron­ments in which they may po­ten­tially in­ter­act with mod­els of them­selves. Un­for­tu­nately, I think that FDT is some­times ex­plained con­fus­ingly and mi­s­un­der­stood by its pro­po­nents and op­po­nents al­ike. To help dis­solve con­fu­sion about FDT and ad­dress key con­cerns of its op­po­nents, I re­fute the crit­i­cism that FDT as­sumes that cau­sa­tion can hap­pen back­ward in time and offer two key prin­ci­ples that provide a frame­work for clearly un­der­stand­ing it:

  1. Ques­tions in de­ci­sion the­ory are not ques­tions about what choices you should make with some sort of un­pre­dictable free will. They are ques­tions about what type of source code you should be run­ning.

  2. I should con­sider pre­dic­tor P to “sub­junc­tively de­pend” on agent A to the ex­tent that P makes pre­dic­tions of A’s ac­tions based on cor­re­la­tions that can­not be con­founded by my choice of what source code A runs.

Get­ting Up to Speed

I think that func­tional de­ci­sion the­ory (FDT) is a beau­tifully coun­ter­in­tu­itive and in­sight­ful frame­work for in­stru­men­tal ra­tio­nally. I will not make it my fo­cus here to talk about what it is and what types of situ­a­tions it is use­ful in. To gain a solid back­ground, I recom­mend this post of mine or the origi­nal pa­per on it by Eliezer Yud­kowsky and Nate Soares.

Ad­di­tion­ally, here are four differ­ent ways that FDT can be ex­plained. I find them all com­pli­men­tary for un­der­stand­ing and in­tu­it­ing it well.

  1. The de­ci­sion the­ory that tells you to act as if you were set­ting the out­put to an op­ti­mal de­ci­sion-mak­ing pro­cess for the task at hand.

  2. The de­ci­sion the­ory that has you co­op­er­ate in situ­a­tions similar to a pris­on­ers’ dilemma against a model of your­self—in­clud­ing when your op­po­nent locks in their choice and shows it to you be­fore you make yours.

  3. The de­ci­sion the­ory that has you one-box it in situ­a­tions similar to New­com­bian games—in­clud­ing when the boxes are trans­par­ent; see also Parfit’s Hitch­hiker.

  4. The de­ci­sion the­ory that shifts fo­cus from what type of de­ci­sions you should make to what type of de­ci­sion-mak­ing agent you should be.

I’ll as­sume a solid un­der­stand­ing of FDT from here on. I’ll be ar­gu­ing in fa­vor of it, but it’s fairly con­tro­ver­sial. Much of what in­spired this post was an AI Align­ment Fo­rum post called A Cri­tique of Func­tional De­ci­sion The­ory by Will MacAskill which raised sev­eral ob­jec­tions to FDT. Some of his points are dis­cussed be­low. The rest of this post will be ded­i­cated to dis­cussing two key prin­ci­ples that help to an­swer crit­i­cisms and dis­solve con­fu­sions around FDT.

1. Ac­knowl­edg­ing One’s own Predictability

Op­po­nents of FDT, usu­ally pro­po­nents of causal de­ci­sion the­ory (CDT), will look at a situ­a­tion such as the clas­sic New­com­bian game and rea­son as so:

I can choose to one-box it and take A or two-box it and take A+B. Re­gard­less of the value of A, A+B is greater, so it can only be ra­tio­nal to take both. After all, when I’m sit­ting in front of these boxes, what’s in them is already in them re­gard­less of the choice I make. The func­tional de­ci­sion the­o­rist’s per­spec­tive re­quires as­sum­ing that cau­sa­tion can hap­pen back­wards in time! Sure, one-box­ers might do bet­ter at these games, but non-smok­ers do bet­ter in smok­ing le­sion prob­lems. That doesn’t mean they are mak­ing the right de­ci­sion. Causal de­ci­sion the­o­rists may be dealt a bad hand in New­com­bian games, but it doesn’t mean they play it badly.

The prob­lem with this ar­gu­ment, I’d say, is sub­tle. I ac­tu­ally fully agree with the per­spec­tive that for causal de­ci­sion the­o­rists, New­com­bian games are just like smok­ing le­sion prob­lems. I also agree with the point that causal de­ci­sion the­o­rists are dealt a bad hand in these games but don’t play it badly. The prob­lem with the ar­gu­ment is some sub­tle con­fu­sion about the word ‘choice’ plus how it says that FDT as­sumes that cau­sa­tion can hap­pen back­wards in time.

The mis­take that a causal de­ci­sion the­o­rist makes isn’t in two-box­ing. It’s in be­ing a causal de­ci­sion the­o­rist in the first place. In New­com­bian games, the as­sump­tion that there is a highly-ac­cu­rate pre­dic­tor of you makes it clear that you are, well, pre­dictable and not re­ally mak­ing free choices. You’re just ex­e­cut­ing what­ever source code you’re run­ning. If this pre­dic­tor thinks that you will two-box it, your fate is sealed and the best you can do is then to two-box it. The key is to just be run­ning the right source code. And hence the first prin­ci­ple:

Ques­tions in de­ci­sion the­ory are not ques­tions about what choices you should make with some sort of un­pre­dictable free will. They are ques­tions about what type of source code you should be run­ning.

And in this sense, FDT is ac­tu­ally just what hap­pens when you use causal de­ci­sion the­ory to se­lect what type of source code you want to en­ter a New­com­bian game with. There’s no as­sump­tion that cau­sa­tion can oc­cur back­wards. FDT sim­ply ac­knowl­edges that the source code you’re run­ning can have a, yes, ***causal*** effect on what types of situ­a­tions you will be pre­sented with when mod­els of you ex­ist. FDT, prop­erly un­der­stood, is a type of meta-causal the­ory. I, in fact, lament that FDT was named “func­tional” and not “meta-causal.”

In­stead of FDT as­sum­ing causal di­a­grams like these:

It re­ally only as­sumes ones like these:

I think that many pro­po­nents of FDT fail to make this point: FDT’s ad­van­tage is that it shifts the ques­tion to what type of agent you want to be—not mis­lead­ing ques­tions of what types of “choices” you want to make. But this isn’t usu­ally how func­tional de­ci­sion the­o­rists ex­plain FDT, in­clud­ing Yud­kowsky and Soares in their pa­per. And I at­tribute some un­nec­es­sary con­fu­sion and mi­s­un­der­stand­ings like “FDT re­quires us to act as if cau­sa­tion hap­pens back­ward in time,” to it.

To see this prin­ci­ple in ac­tion, let’s look at a situ­a­tion pre­sented by Will MacAskill. It’s similar to a New­com­bian game with trans­par­ent boxes. And I say “similar” in­stead of “iso­mor­phic” be­cause of some vague­ness which will be dis­cussed soon. MacAskill pre­sents this situ­a­tion as fol­lows:

You face two open boxes, Left and Right, and you must take one of them. In the Left box, there is a live bomb; tak­ing this box will set off the bomb, set­ting you ablaze, and you cer­tainly will burn slowly to death. The Right box is empty, but you have to pay $100 in or­der to be able to take it.
A long-dead pre­dic­tor pre­dicted whether you would choose Left or Right, by run­ning a simu­la­tion of you and see­ing what that simu­la­tion did. If the pre­dic­tor pre­dicted that you would choose Right, then she put a bomb in Left. If the pre­dic­tor pre­dicted that you would choose Left, then she did not put a bomb in Left, and the box is empty.
The pre­dic­tor has a failure rate of only 1 in a trillion trillion. Helpfully, she left a note, ex­plain­ing that she pre­dicted that you would take Right, and there­fore she put the bomb in Left.
You are the only per­son left in the uni­verse. You have a happy life, but you know that you will never meet an­other agent again, nor face an­other situ­a­tion where any of your ac­tions will have been pre­dicted by an­other agent. What box should you choose?

Ma­caskill claims that you should take right be­cause it re­sults in a “guaran­teed pay­off”. Un­for­tu­nately, there is some vague­ness here about what it means for a long-dead pre­dic­tor to have run a simu­la­tion of you and for it to have an er­ror rate of one in a trillion trillion. Is this simu­la­tion true to your ac­tual be­hav­ior? What type of in­for­ma­tion about you did this long dead pre­dic­tor have ac­cess to? What is the refer­ence class for the er­ror rate?

Let’s as­sume that your source code was writ­ten long ago, that the pre­dic­tor un­der­stood how it func­tioned, that it ran a true-to-func­tion simu­la­tion, and that you were given an un­altered ver­sion of that source code. Then this situ­a­tion iso­mor­phic to a trans­par­ent-box New­com­bian game in which you see no money in box A (albeit more dra­matic), and the con­fu­sion goes away! If this is the case then there are only two pos­si­bil­ities.

  1. You are a causal de­ci­sion the­o­rist (or similar), the pre­dic­tor made a self-fulfilling prophecy by putting the bomb in the left box alongside a note, and you will choose the right box.

  2. You are a func­tional de­ci­sion the­o­rist (or similar), the pre­dic­tor made an ex­tremely rare, one in a trillion-trillion mis­take, and you will un­for­tu­nately take the left box with a bomb (just as a func­tional de­ci­sion the­o­rist in a trans­par­ent box New­com­bian game would take only box A).

So what source code would you rather run when go­ing into a situ­a­tion like this? As­sum­ing that you want to max­i­mize ex­pected value and that you don’t value your life at more than 100 trillion trillion dol­lars, then you want to be run­ning the func­tional de­ci­sion the­o­rist’s source code. Suc­cess­fully nav­i­gat­ing this game, trans­par­ent-box New­com­bian games, twin-op­po­nent-re­veals-first pris­on­ers’ dilem­mas, Parfit’s Hitchiker situ­a­tions, and the like all re­quire you have source code that would tell you to com­mit to mak­ing the sub­op­ti­mal de­ci­sion in the rare case in which the pre­dic­tor/​twin made a mis­take.

Great! But what if we drop our as­sump­tions? What if we don’t as­sume that this pre­dic­tor’s simu­la­tion was func­tion­ally true to your be­hav­ior? Then it be­comes un­clear how this pre­dic­tion was made, and what the refer­ence class of agents is for which this pre­dic­tor is sup­pos­edly only wrong one in a trillion trillion times. And this leads us to the sec­ond prin­ci­ple.

2. When a Pre­dic­tor is Sub­junc­tively En­tan­gled with an Agent

An al­ter­nate ti­tle for this sec­tion could be “when statis­ti­cal cor­re­la­tions are and aren’t mere.”

As es­tab­lished above, func­tional de­ci­sion the­o­rists need not as­sume that cau­sa­tion can hap­pen back­wards in time. In­stead, they only need to ac­knowl­edge that a pre­dic­tion and an ac­tion can both de­pend on an agent’s source code. This is noth­ing spe­cial what­so­ever: an or­di­nary cor­re­la­tion be­tween an agent and pre­dic­tor that arises from a com­mon fac­tor: the source code.

How­ever, Yud­kowsky and Soares give this type of cor­re­la­tion a spe­cial name in their pa­per: sub­junc­tive de­pen­dence. I don’t love this term be­cause it gives a fancy name to some­thing that is not fancy at all. I think this might be re­spon­si­ble for some of the con­fused crit­i­cism that FDT as­sumes that cau­sa­tion can hap­pen back­ward in time. Nonethe­less, “sub­junc­tive de­pen­dence” is at least work­able. Yud­kowsky and Soares write:

When two phys­i­cal sys­tems are com­put­ing the same func­tion, we will say that their be­hav­iors “sub­junc­tively de­pend” upon that func­tion.

This con­cept is very use­ful when a pre­dic­tor ac­tu­ally knows your source code and runs it to simu­late you. How­ever, this no­tion of sub­junc­tive de­pen­dence isn’t very flex­ible and quickly be­comes less use­ful when a pre­dic­tor is not do­ing this. And this is a bit of a prob­lem that MacAskill pointed out. A pre­dic­tor could make good pre­dic­tions with­out po­ten­tially query­ing a model of you that is func­tion­ally equiv­a­lent to your ac­tions. He writes:

...the pre­dic­tor needn’t be run­ning your al­gorithm, or have any­thing like a rep­re­sen­ta­tion of that al­gorithm, in or­der to pre­dict whether you’ll one box or two-box. Per­haps the Scots tend to one-box, whereas the English tend to two-box. Per­haps the pre­dic­tor knows how you’ve acted prior to that de­ci­sion. Per­haps the Pre­dic­tor painted the trans­par­ent box green, and knows that’s your favourite colour and you’ll strug­gle not to pick it up. In none of these in­stances is the Pre­dic­tor plau­si­bly do­ing any­thing like run­ning the al­gorithm that you’re run­ning when you make your de­ci­sion. But they are still able to pre­dict what you’ll do. (And bear in mind that the Pre­dic­tor doesn’t even need to be very re­li­able. As long as the Pre­dic­tor is bet­ter than chance, a New­comb prob­lem can be cre­ated.)

Here, I think that MacAskill is get­ting at an im­por­tant point, but one that’s hard to see clearly with the wrong frame­work. On its face though, there’s a sig­nifi­cant prob­lem with this ar­gu­ment. Sup­pose that in New­com­bian games, 99% of brown-eyed peo­ple one-boxed it, and 99% of blue-eyed peo­ple two-boxed it. If a pre­dic­tor only made its pre­dic­tion based on your eye color, then clearly the best source code to be run­ning would be the kind that always made you two-box it re­gard­less of your eye color. There’s noth­ing New­com­bian, para­dox­i­cal, or even difficult about this case. And point­ing out these situ­a­tions is es­sen­tially how crit­ics of MacAskill’s ar­gu­ment have an­swered it. Their coun­ter­point is that un­less the pre­dic­tor is query­ing a model of you that is func­tion­ally iso­mor­phic to your de­ci­sion mak­ing pro­cess, then it is only us­ing “mere statis­ti­cal cor­re­la­tions,” and sub­junc­tive de­pen­dence does not ap­ply.

But this coun­ter­point and Yud­koswky and Soares’ defi­ni­tion of sub­junc­tive de­pen­dence miss some­thing! MacAskill had a point. A pre­dic­tor need not know an agent’s de­ci­sion-mak­ing pro­cess to make pre­dic­tions based on statis­ti­cal cor­re­la­tions that are not “mere”. Sup­pose that you de­sign some agent who en­ters an en­vi­ron­ment with what­ever source code you gave it. Then if the agent’s source code is fixed, a pre­dic­tor could ex­ploit cer­tain statis­ti­cal cor­re­la­tions with­out know­ing the source code. For ex­am­ple, sup­pose the pre­dic­tor used ob­ser­va­tions of the agent to make prob­a­bil­is­tic in­fer­ences about its source code. Th­ese could even be ob­ser­va­tions about how the agent acts in other New­com­bian situ­a­tions. Then the pre­dic­tor could, with­out know­ing what func­tion the agent com­putes, make bet­ter-than-ran­dom guesses about its be­hav­ior. This falls out­side of Yud­kowsky and Soares’ defi­ni­tion of sub­junc­tive de­pen­dence, but it has the same effect.

So now I’d like to offer my own defi­ni­tion of sub­junc­tive de­pen­dence (even though still, I main­tain that the term can be con­fus­ing, and I am not a huge fan of it).

I should con­sider pre­dic­tor P to “sub­junc­tively de­pend” on agent A to the ex­tent that P makes pre­dic­tions of A’s ac­tions based on cor­re­la­tions that can­not be con­founded by my choice of what source code A runs.

And hope­fully, it’s clear why this is what we want. When we re­mem­ber that ques­tions in de­ci­sion the­ory are re­ally just ques­tions about what type of source code we want to en­ter an en­vi­ron­ment us­ing, then the choice of source code can only af­fect pre­dic­tions that de­pend in some way on the choice of source code. If the cor­re­la­tion can’t be con­founded by the choice of source code, the right kind of en­tan­gle­ment to al­low for op­ti­mal up­date­less be­hav­ior is pre­sent.

Ad­di­tional Topics

Go­ing Meta

Con­sider what I call a Mind Po­lice situ­a­tion: Sup­pose that there is a pow­er­ful mind polic­ing agent that is about to en­counter agent A and read its mind (look at its source code). After­ward, if the mind po­licer judges A to be us­ing de­ci­sion the­ory X, they will de­stroy A. Else they will do noth­ing.

Sup­pose that de­ci­sion the­ory X is FDT (but it could be any­thing) and that you are agent A who hap­pens to use FDT. If you were given the op­tion of over­writ­ing your source code to im­ple­ment some al­ter­na­tive, tol­er­ated de­ci­sion the­ory, would you? You’d be bet­ter off if you did, and it would be the out­put of an op­ti­mal func­tion for the de­ci­sion mak­ing task at hand, but it’s sort of un­clear whether this is a very func­tional de­ci­sion the­o­rist thing to do. Be­cause of situ­a­tions like these, I think that we should con­sider de­ci­sion the­o­ries to come in two fla­vors: static which will never over­write it­self, and autoup­dat­able, which might.

Also, note that the ex­am­ple above is only a first-or­der ver­sion of this type of prob­lem, but there are higher-or­der ones too. For ex­am­ple, what if the mind po­lice de­stroyed agents us­ing autoup­dat­able de­ci­sion the­o­ries?

Why Roko’s Basilisk is Nonsense

A naive un­der­stand­ing of FDT has led some peo­ple to ask whether a su­per­in­tel­li­gent sovereign, if one were ever de­vel­oped, would be ra­tio­nal to tor­ture ev­ery­one who didn’t help to bring it into ex­is­tence. The idea would be that this sovereign might con­sider this to be part of an up­date­less strat­egy to help it come into ex­is­tence more quickly and ac­com­plish its goals more effec­tively.

For­tu­nately, a proper un­der­stand­ing of sub­junc­tive de­pen­dence tells us that an op­ti­mally-be­hav­ing em­bed­ded agent doesn’t need to pre­tend that cau­sa­tion can hap­pen back­ward in time. Such a sovereign would not be in con­trol of its source code, and it can’t ex­e­cute an up­date­less strat­egy if there was noth­ing there to not-up­date on in the first place be­fore that source code was writ­ten. So Roko’s Basilisk is only an in­for­ma­tion haz­ard if FDT is poorly un­der­stood.


It’s all about the source code.