# The Mad Scientist Decision Problem

Con­sider Alice, the mad com­puter sci­en­tist. Alice has just solved gen­eral ar­tifi­cal in­tel­li­gence and the al­ign­ment prob­lem. On her com­puter she has two files, each con­tain­ing a seed for a su­per­in­tel­li­gent AI, one of them is al­igned with hu­man val­ues, the other one is a pa­per­clip max­i­mizer. The two AIs only differ in their goals/​val­ues, the rest of the al­gorithms, in­clud­ing de­ci­sion pro­ce­dures, are iden­ti­cal.

Alice de­cides to flipp a coin. If the coin comes up heads, she starts the friendly AI, and if it comes up tails, she starts the pa­per­clip max­i­mizer.

The coin comes up heads. Alice starts the friendly AI, and ev­ery­one re­joice. Some years later the friendly AI learns about the coin­flip and of the pa­per­clip max­i­mizer.

Should the friendly AI coun­ter­fac­tu­ally co­op­er­ate with the pa­per­clip max­i­mizer?

What does var­i­ous de­ci­sion the­o­ries say in this situ­a­tion?

What do you think is the cor­rect an­swer?

• Yes, of course! The only re­quire­ment is that FAI must know the source code of hy­po­thet­i­cal Clippy and vice versa. As­sum­ing that, here’s one way it could work:

Hu­mans are risk-averse—a 100% chance for hu­man­ity to get 40% of the uni­verse is strictly bet­ter than 50% chance to get the whole uni­verse. While Clippy is pre­sum­ably risk-neu­tral about pa­per­clips. So we can come up with a deal that’s prof­itable for both par­ties, e.g. FAI con­verts 60% of the uni­verse to pa­per­clips in ex­change for Clippy giv­ing 40% of the uni­verse to hu­man­ity. Let’s hand­wave away the difficul­ties of bar­gain­ing and as­sume that a sin­gle op­ti­mal deal is known to both par­ties.

Now both FAI and Clippy would benefit if the fol­low­ing sen­tence S about com­puter pro­grams was true: “FAI holds up its end of the deal if and only if Clippy holds up its end”. The key part is that S can be made true, if both sides im­ple­ment a cer­tain de­ci­sion pro­ce­dure that has no effect un­less the other side does the same. Namely, FAI should im­ple­ment the de­ci­sion pro­ce­dure “hold up my end of the deal if and only if S is prov­able in less than a mil­lion steps”, and Clippy should do the same. That will make S prov­able by a bounded var­i­ant of Löb’s the­o­rem, so both sides will find the proof and hold up their ends of the deal.

Note that the proof of S re­lies on both sides im­ple­ment­ing the de­ci­sion pro­ce­dure. If one side tries to cheat, S sim­ply be­comes un­prov­able and both sides know the deal isn’t hap­pen­ing.

Refer­ences: Rolf Nel­son came up with AI de­ter­rence in 2007, I came up with the proof-based mechanism in 2010, then MIRI took it fur­ther in Mo­dal Com­bat.

• I’m not sure it can be as­sumed that the deal is prof­itable for both par­ties. The way I un­der­stand risk aver­sion is that it’s a bug, not a fea­ture; hu­mans would be bet­ter off if they weren’t risk averse (they should self-mod­ify to be risk neu­tral if and when pos­si­ble, in or­der to be bet­ter at fulfilling their own val­ues).

• I was us­ing risk aver­sion to mean sim­ply that that some re­source has diminish­ing marginal util­ity to you. The Von-Neu­mann-Mor­gen­stern the­o­rem al­lows such util­ity func­tions just fine. An agent us­ing one won’t self-mod­ify to a differ­ent one.

For ex­am­ple, let’s say your ma­te­rial needs in­clude bread and a cir­cus ticket. Both cost a dol­lar, but bread has much higher util­ity be­cause with­out it you’d starve. Now you’re risk-averse in money: you strictly pre­fer a 100% chance of one dol­lar to a 60% chance of two dol­lars and 40% chance of noth­ing. If some­one offers you a mod­ifi­ca­tion to be­come risk-neu­tral in money, you won’t ac­cept that, be­cause it leads to a risk of star­va­tion ac­cord­ing to your cur­rent val­ues.

By anal­ogy with that, it’s easy to see why hu­man­ity is risk-averse w.r.t. how much of the uni­verse they get. In fact I’d ex­pect most util­ity func­tions as com­plex as ours to be risk-averse w.r.t. ma­te­rial re­sources, be­cause the most im­por­tant needs get filled first.

• Uhm. That makes sense. I guess I was op­er­at­ing un­der the defi­ni­tion of risk aver­sion that makes peo­ple give up risky bets just be­cause the al­ter­na­tive is a less risky bet, even if it ac­tu­ally trans­lates in less of ab­solute ex­pected util­ity com­pared to the risky one. As far as I know, that’s the most used mean­ing of risk aver­sion. Isn’t there an­other term to dis­am­biguate be­tween con­cave util­ity func­tions and straight­for­ward ir­ra­tional­ity?

• I sus­pect you may be think­ing of the thing where peo­ple pre­fer e.g. a (A1) 100% chance of win­ning 100€ (how do I make a dol­lar sign?) to a (A2) 99% chance of win­ning 105€, but at the same time pre­fer (B2) a 66% chance of win­ning 105€ to (B1) a 67% chance of win­ning 100€. This is in­deed ir­ra­tional, be­cause it means you can be ex­ploited. But de­pend­ing on your util­ity func­tion, it is not nec­es­sar­ily ir­ra­tional to pre­fer both A1 to A2 and B1 to B2.

• You’re right, the “ir­ra­tional” kind of risk aver­sion is also very im­por­tant. It’d be nice to have a term to dis­am­biguate be­tween the two, but I don’t know any. Sorry about the con­fu­sion, I re­ally should’ve qual­ified it some­how :-/​ Any­way I think my origi­nal com­ment stands if you take it to re­fer to “ra­tio­nal” risk aver­sion.

• Prob­a­bly you should have sim­ply said some­thing similar to “in­creas­ing por­tions of phys­i­cal space have diminish­ing marginal re­turns to hu­mans”.

• I ve­he­mently dis­agree. Ex­pected util­ity is only an apri­ori ra­tio­nal mea­sure iff the fol­low­ing hold:

1. Your as­sign­ment of prob­a­bil­ities is ac­cu­rate.

2. You are fac­ing an iter­ated de­ci­sion prob­lem.

3. The em­piri­cal prob­a­bil­ity mass func­tion of the iter­ated de­ci­sion prob­lem doesn’t vary be­tween differ­ent en­coun­ters of the prob­lem.

If these con­di­tions don’t hold, then EU is vuln­er­a­ble to Pas­cal mug­ging.

Risk aver­sion is ir­ra­tional iff you ac­cept EU as the perfect mea­sure of ra­tio­nal choice—I haven’t seen an ar­gu­ment for EU that jus­tifies it in sin­gle­ton (one-shot) de­ci­sion prob­lems.

• That’s mostly wrong. The vNM the­o­rem ap­plies just fine to one-shot situ­a­tions and to sub­jec­tive prob­a­bil­ities. And Pas­cal’s mug­ging only ap­plies to util­ity func­tions that al­low vast util­ities.

• I am not an EU-max­imiser, ex­plain­ing my de­ci­sion the­ory would take a few thou­sand words, so you’ll have to wait for that, but I’ll offer an in­tu­ition pump be­low. Show that I can be Dutch booked or oth­er­wise money pumped.

(I’ll use “–” in­stead of “_” be­cause the ed­i­tor is crap).

Sup­pose that the fol­low­ing are true about me (if you re­ject them, then sup­pose they are true about an­other agent).

1. I have an un­bounded util­ity func­tion.

2. Utility grows lin­early in some quan­tity X (e.g num­ber of lives saved) for me (this is not nec­es­sary, but makes the in­tu­ition pump eas­ier).

Con­sider the fol­low­ing de­ci­sion prob­lem let’s call it π–4:

A = {a–1, a–2}

S = {s–1, s–2}

O = {(a–1, s–1) := 5X, (a–1, s–2) := 1X, (a–2, s–1) := 1X, (a–2, s–2) := ack(10)}

P(s–1) = 1 - (1*10^-10)

P(s–2) = 1*10^-10

What would you pick on π–4?

1. If you faced it just once

2. If you faced it in an iter­ated sce­nario an un­known num­ber of times.

,

,

,

,

,

[Think­ing space]

,

,

,

,

,

My an­swers:

1. a–1

2. a–2

It doesn’t mat­ter how high the pay­off of (a–2,s–2) was, I would not choose it in sce­nario 1, but I would choose it in sce­nario 2.

If you in­sist I’m equiv­a­lent to an agent who max­imises util­ity then you im­ply:

My util­ity func­tion varies de­pend­ing on how many times I think I face the prob­lem (bounded some times, un­bounded oth­ers).

It is pretty clear to me that I sim­ply don’t max­imise ex­pected util­ity.

Dutch­book me.

• What you de­scribed is com­pat­i­ble with EU max­i­miza­tion, ex­cept the part where you claim your util­ity to be lin­ear in X. That seems like a wrong claim. The ul­ti­mate source of truth when de­ter­min­ing an agent’s util­ity func­tion is the agent’s prefer­ences among ac­tions. (The vNM the­o­rem takes prefer­ences among ac­tions as given, and hacks to­gether a util­ity func­tion de­scribing them.) And your prefer­ences among ac­tions im­ply a util­ity func­tion that’s non­lin­ear in X.

• How does non-lin­ear­ity lead to me choos­ing differ­ent op­tions in sin­gle vs iter­ated prob­lems?

I’m fine with say­ing I max­imise ex­pected util­ity (I in­ter­pret that as it is pos­si­ble to con­struct an ex­pected util­ity max­imis­ing agent with some prefer­ence who would always choose the same strat­egy I do), but I’m not sure this is the case.

To offer in­sight into my util­ity func­tion:

In sin­gle­ton prob­lems:
If prob­a­bil­ity of a set of states is be­low ep­silon, I ig­nore that set of states.
In iter­ated prob­lems, I con­sider it iff the prob­a­bil­ity of the set of states is high enough that I ex­pect it to oc­cur at least one dur­ing the num­ber of iter­a­tions.

Only one state of the world would man­i­fest. If I not ex­pect to not see that state of the world, I ig­nore it, ir­re­spec­tive of the pay­off of that state. You could in­ter­pret this as a bounded util­ity func­tion. How­ever, in iter­ated prob­lems I might con­sider that state, so my util­ity func­tion isn’t bounded.

I’m try­ing to max­imise util­ity, and not ex­pected util­ity. In prob­lems with patholog­i­cal (very un­equal) prob­a­bil­ity dis­tri­bu­tions, I may com­pletely ig­nore a cer­tain set of states. This is be­cause in a given sin­gle­ton prob­lem, I ex­pect that state to not oc­cur. I don’t care about other Everett branches, so some of the EU ar­gu­ments also don’t move me.

• DagonGod, you are clearly not get­ting the point here, which is that the vN-M the­o­rem that defines util­ity is not com­pat­i­ble with you declar­ing val­ues of your util­ity func­tion. If you do that, you are no longer talk­ing about the same con­cept of “util­ity”.

• The con­cept of a util­ity func­tion is only rele­vant in­so­much as you can model ra­tio­nal de­ci­sion mak­ers as pos­sess­ing a util­ity func­tion that they try to max­imise in some way. I do pos­sess a util­ity func­tion (not nec­es­sar­ily in the VnM sense as I don’t max­imise ex­pected util­ity, and max­imis­ing ex­pected util­ity is im­plicit in the defi­ni­tion of VnM util­ity (this is a point of con­tention for me)). If I make choices that don’t max­imise ex­pected util­ity, then you must be able to demon­strate that I am ir­ra­tional on some way (with­out spe­cial plead­ing to my failure to max­imise EU). Either that, or max­imis­ing ex­pected util­ity is not the perfect perfor­mance mea­sure for ra­tio­nal choice.

• I’m not an ex­pert on de­ci­sion the­ory, but my un­der­stand­ing (of FDT) is that there is no rea­son for the AI to co­op­er­ate with the pa­per­clip max­i­mizer (co­op­er­ate how?) be­cause there is no sce­nario in which the pa­per­clip max­i­mizer treats the friendly AI differ­ently based on it co­op­er­at­ing in counter-fac­tual wor­lds. For it to be a ques­tion at all, it would re­quire that

1) the pa­per­clip max­i­mizer is not a pa­per­clip max­i­mizer but a differ­ent kind of un­friendly AI

2) this un­friendly AI is ac­tu­ally launched (but may be in an in­fe­rior po­si­tion)

I think there could be situ­a­tions where it should co­op­er­ate. As I un­der­stand it, up­date­less/​func­tional may say yes, causal and ev­i­den­tal would say no.

• “1) the pa­per­clip max­i­mizer is not a pa­per­clip max­i­mizer but a differ­ent kind of un­friendly AI”

Be­ing a pa­per­clip max­i­mizer is about val­ues, not about de­ci­sion the­ory. You can want to max­i­mize pa­per­clips but still use some of acausal de­ci­son the­ory that will co­op­er­ate with de­ci­sion mak­ers that would co­op­er­ate with pa­per­clip­pers, as in cousin_it’s re­sponse.

• That seems true, thanks for the cor­rec­tion.

• Depends on what value the FAI places on hu­man flour­ish­ing in hy­po­thet­i­cal al­ter­nate re­al­ities I guess. If it’s fo­cused on the uni­verse it’s in then there’s no rea­son to waste half of it on pa­per­clips. If it’s try­ing to help out the peo­ple liv­ing in a uni­verse where the pa­per­clip max­i­mizer got ac­ti­vated then it should co­op­er­ate. I guess a large part of that is also about whether it de­ter­mines there re­ally are par­allel uni­verses or not to be con­cerned about.

• Just to be clear, i’m imag­in­ing coun­ter­fac­tual co­op­er­a­tion to mean the FAI build­ing vaults full of pa­per­clips in ev­ery re­gion where there is a sur­plus of alu­minium (or a similar metal). In the other pos­si­bil­ity branch, the pa­per­clip max­i­mizer (which thinks iden­ti­cally) re­cip­ro­cates by pre­serv­ing semi-au­tonomous cities of hu­mans among the moun­tains of pa­per­clips.

If my un­der­stand­ing above is cor­rect, then yes, i think these two would co­op­er­ate IF this type of soft­ware agent shares my per­spec­tive on acausal game the­ory and branch­ing timelines.

• This is an in­ter­est­ing re­for­mu­la­tion of Coun­ter­fac­tual Mug­ging. In the case where the co­op­er­a­tion of the pa­per­clip max­imiser is prov­able I don’t see it any differ­ent from a Coun­ter­fac­tual Mug­ging tak­ing place be­fore the AI comes into ex­is­tence. The only way I see this as be­com­ing more com­pli­cated is when the AI tries to black­mail you in the coun­ter­fac­tual world.