Thoughts from a Two Boxer

I’m writ­ing this for blog day at MSFP. I thought about a lot of things here like cat­e­gory the­ory, the 1-2-3 con­jec­ture and Paul Chris­ti­ano’s agenda. I want to start by thank­ing ev­ery­one for hav­ing me and say­ing I had a re­ally good time. At this point I in­tend to go back to think­ing about the stuff I was think­ing about be­fore MSFP (ran­dom ma­trix the­ory). But I learned a lot and I’m sure some of it will come to be use­ful. This blog is about (my con­fu­sion of) de­ci­sion the­ory.

Be­fore the work­shop I hadn’t read much be­sides Eliezer’s pa­per on FDT and my im­pres­sion was that it was mostly a good way of think­ing about mak­ing de­ci­sions and at least rep­re­sented progress over EDT and CDT. After think­ing more care­fully about some canon­i­cal thought ex­per­i­ments I’m no longer sure. I sus­pect many of the con­crete thoughts which fol­low will be wrong in ways that illus­trate very bad in­tu­itions. In par­tic­u­lar I think I am im­plic­itly guided by non-ex­am­ple num­ber 5 of an aim of de­ci­sion the­ory in Wei Dai’s post on the pur­poses of de­ci­sion the­ory. I wel­come any cor­rec­tions or in­sights in the com­ments.

The Prob­lem of De­ci­sion Theory

First I’ll talk about what I think de­ci­sion the­ory is try­ing to solve. Ba­si­cally I think de­ci­sion the­ory is the the­ory of how one should[1] de­cide on an ac­tion af­ter one already un­der­stands: The ac­tions available, the pos­si­ble out­comes of ac­tions, the prob­a­bil­ities of those out­comes and the de­sir­a­bil­ity of those out­comes. In par­tic­u­lar the an­swers to the listed ques­tions are only ad­ja­cent to de­ci­sion the­ory. I sort of think an­swer­ing all of those ques­tions is in fact harder than the ques­tion posed by de­ci­sion the­ory. Be­fore do­ing any read­ing I would have naively ex­pected that the prob­lem of de­ci­sion the­ory, as stated here, was triv­ial but af­ter pul­ling on some edge cases I see there is room for a lot of cre­ative and rea­son­able dis­agree­ment.

A lot of the ac­tual work in de­ci­sion the­ory is the con­struc­tion of sce­nar­ios in which ideal be­hav­ior is de­bat­able or un­clear. Peo­ple choose their own philo­soph­i­cal po­si­tions on what is ra­tio­nal in these hairy situ­a­tions and then con­struct gen­eral pro­ce­dures for mak­ing de­ci­sions which they be­lieve be­have ra­tio­nally in a wide class of prob­lems. Th­ese con­struc­tions are a con­crete ver­sion of for­mu­lat­ing prop­er­ties one would ex­pect an ideal de­ci­sion the­ory to have.

One such prop­erty is that an ideal de­ci­sion the­ory shouldn’t choose to self mod­ify in some wide vaguely defined class of “fair” prob­lems. An ob­vi­ously un­fair prob­lem would be one in which the over­seer gives CDT $10 and any other agent $0. One of my biggest open ques­tions in de­ci­sion the­ory is where this line be­tween fair and un­fair prob­lems should lie. At this point I am not con­vinced any prob­lem where agents in the en­vi­ron­ment have ac­cess to our de­ci­sion the­ory’s source code or copies of our agent are fair prob­lems. But my im­pres­sion from hear­ing and read­ing what peo­ple talk about is that this is a hereti­cal po­si­tion.

New­comb’s Problem

Let’s dis­cuss New­comb’s prob­lem in de­tail. In this prob­lem there are two boxes one of which you know con­tains a dol­lar. In the other box an en­tity pre­dict­ing your ac­tion may or may not put a mil­lion dol­lars. They put a mil­lion dol­lars if and only if they pre­dict you will only take one box. What do you do if the pre­dic­tor is 99 per­cent ac­cu­rate? How about if it is perfectly ac­cu­rate? What if you can see the con­tent of the boxes be­fore you make your de­ci­sion?

An aside on why New­comb’s prob­lem seems im­por­tant: It is sort of like a pris­oner’s dilemma. To see the anal­ogy imag­ine you’re play­ing a clas­si­cal pris­oner’s dilemma against a player who can re­li­ably pre­dict your ac­tion and then chooses to match it. New­comb’s prob­lem seems im­por­tant be­cause pris­oner’s dilem­mas seem like sim­plifi­ca­tions of situ­a­tions which re­ally do oc­cur in real life. The tragedy of pris­oner dilem­mas is that game the­ory sug­gests you should defect but the real world seems like it would be bet­ter if peo­ple co­op­er­ated.

New­comb’s prob­lem is weird to think about be­cause the pre­dic­tor and agent’s be­hav­iors are log­i­cally con­nected but not causally. That is, if you tell me what the agent does or what the pre­dic­tor pre­dicts as an out­side ob­server I can guess what the other does with high prob­a­bil­ity. But once the pre­dic­tor pre­dicts the agent could still take ei­ther op­tion and flip flop­ping won’t flip flop the pre­dic­tor. Still one may ar­gue you should one box be­cause be­ing a one boxer go­ing into the prob­lem means you will likely get more util­ity. I dis­agree with this view and see New­comb’s prob­lem as pun­ish­ing ra­tio­nal agents.

If New­comb’s prob­lem is ubiquitous and one imag­ines an agent walk­ing down the street con­stantly be­ing New­combed it is in­deed un­for­tu­nate if they are doomed to two box. They’ll end up with far fewer dol­lars. But this thought ex­per­i­ment is miss­ing an im­por­tant part of real world de­tail in my view. How the pre­dic­tors pre­dict the agents be­hav­ior. There are three pos­si­bil­ities:

  • The pre­dic­tors have a so­phis­ti­cated un­der­stand­ing of the agent’s in­ner work­ings and use it to simu­late the agent to high fidelity.

  • The pre­dic­tors have seen many agents like our agent do­ing prob­lems like this prob­lem and use this to com­pute a prob­a­bil­ity of our agent’s choice and com­pare it to a de­ci­sion thresh­old.

  • The pre­dic­tor has been fol­low­ing the be­hav­ior of our agent and uses this his­tory to as­sign its fu­ture be­hav­ior a prob­a­bil­ity.

In the third bul­let the agent should one box if they pre­dict they are likely to be New­combed of­ten[2]. In the sec­ond bul­let they should one box if they pre­dict that mem­bers of their pop­u­la­tion will be New­combed of­ten and they de­rive more util­ity from the ex­tra dol­lars their pop­u­la­tion will get than the ex­tra dol­lar they could get for them­selves. I have already stated I see the first bul­let as an un­fair prob­lem.

Mind Read­ing isn’t Cool

My big com­plaint with mind read­ing is that there just isn’t any mind read­ing. All my un­der­stand­ings of how peo­ple be­have comes from ob­serv­ing how they be­have in gen­eral, how the hu­man I’m try­ing to un­der­stand be­haves speci­fi­cally, what­ever they have ex­plic­itly told me about their in­ten­tions and what­ever self knowl­edge I have I be­lieve is ap­pli­ca­ble to all hu­mans. Nowhere in the cur­rent world do peo­ple have to make de­ci­sions un­der the con­di­tion of be­ing ac­cu­rately simu­lated.

Why then do peo­ple de­velop so much de­ci­sion the­ory in­tended to be ro­bust in the pres­ence of ex­ter­nal simu­la­tors? I sup­pose its be­cause there’s an ex­pec­ta­tion that this will be a ma­jor prob­lem in the fu­ture which should be solved philo­soph­i­cally be­fore it is prac­ti­cally im­por­tant. Mind read­ing could be­come im­por­tant to hu­mans if mind surveillance be­cause pos­si­ble and de­ployed. I don’t think such a thing is pos­si­ble in the near term or likely even in the ful­l­ness of time. But I also can’t think of any in­sur­mountable phys­i­cal ob­struc­tions so maybe I’m too op­ti­mistic.

Mind read­ing is rele­vant to AI safety be­cause what­ever AGI is cre­ated will likely be a pro­gram on a com­puter some­where which could rea­son its pro­gram stack is fully trans­par­ent or its cre­ators are hold­ing copies of it for pre­dic­tions.


Hav­ing writ­ten that last para­graph I sud­denly un­der­stand why de­ci­sion the­ory in the AI com­mu­nity is the way it is. I guess I wasn’t prop­erly en­gag­ing with the premises of the thought ex­per­i­ment. If one ac­tu­ally did tell me I was about to do a New­comb ex­per­i­ment I would still two box be­cause know­ing I was in the real world I wouldn’t re­ally be­lieve that an ac­cu­rate pre­dic­tor would be de­ployed against me. But an AI can be prac­ti­cally simu­lated and what’s more can rea­son that it is just a pro­gram run by a cre­ator that could have cre­ated many copies of it.

I’m go­ing to post this any­way since it’s blog-day and not im­por­tant-qual­ity-writ­ing day but I’m not sure this blog has much of a pur­pose any­more.

  1. This may read like I’m already ex­plic­itly guided by the false pur­pose Wei Dai warned against. My un­der­stand­ing is that the goal is to un­der­stand ideal de­ci­sion mak­ing. Just not for the pur­poses of im­ple­men­ta­tion. ↩︎

  2. I don’t re­ally know any­thing but I imag­ine the game the­ory of rep­u­ta­tion is well de­vel­oped ↩︎