# The Mad Scientist Decision Problem

Consider Alice, the mad computer scientist. Alice has just solved general artifical intelligence and the alignment problem. On her computer she has two files, each containing a seed for a superintelligent AI, one of them is aligned with human values, the other one is a paperclip maximizer. The two AIs only differ in their goals/values, the rest of the algorithms, including decision procedures, are identical.

Alice decides to flipp a coin. If the coin comes up heads, she starts the friendly AI, and if it comes up tails, she starts the paperclip maximizer.

The coin comes up heads. Alice starts the friendly AI, and everyone rejoice. Some years later the friendly AI learns about the coinflip and of the paperclip maximizer.

Should the friendly AI counterfactually cooperate with the paperclip maximizer?

What does various decision theories say in this situation?

What do you think is the correct answer?

Yes, of course! The only requirement is that FAI must know the source code of hypothetical Clippy and vice versa. Assuming that, here’s one way it could work:

Humans are risk-averse—a 100% chance for humanity to get 40% of the universe is strictly better than 50% chance to get the whole universe. While Clippy is presumably risk-neutral about paperclips. So we can come up with a deal that’s profitable for both parties, e.g. FAI converts 60% of the universe to paperclips in exchange for Clippy giving 40% of the universe to humanity. Let’s handwave away the difficulties of bargaining and assume that a single optimal deal is known to both parties.

Now both FAI and Clippy would benefit if the following sentence S about computer programs was true: “FAI holds up its end of the deal if and only if Clippy holds up its end”. The key part is that S can be made true, if both sides implement a certain decision procedure that has no effect unless the other side does the same. Namely, FAI should implement the decision procedure “hold up my end of the deal if and only if S is provable in less than a million steps”, and Clippy should do the same. That will make S provable by a bounded variant of Löb’s theorem, so both sides will find the proof and hold up their ends of the deal.

Note that the proof of S relies on both sides implementing the decision procedure. If one side tries to cheat, S simply becomes unprovable and both sides know the deal isn’t happening.

References: Rolf Nelson came up with AI deterrence in 2007, I came up with the proof-based mechanism in 2010, then MIRI took it further in Modal Combat.

I’m not sure it can be assumed that the deal is profitable for both parties. The way I understand risk aversion is that it’s a bug, not a feature; humans would be better off if they weren’t risk averse (they should self-modify to be risk neutral if and when possible, in order to be better at fulfilling their own values).

I was using risk aversion to mean simply that that some resource has diminishing marginal utility to you. The Von-Neumann-Morgenstern theorem allows such utility functions just fine. An agent using one won’t self-modify to a different one.

For example, let’s say your material needs include bread and a circus ticket. Both cost a dollar, but bread has much higher utility because without it you’d starve. Now you’re risk-averse in money: you strictly prefer a 100% chance of one dollar to a 60% chance of two dollars and 40% chance of nothing. If someone offers you a modification to become risk-neutral in money, you won’t accept that, because it leads to a risk of starvation according to your current values.

By analogy with that, it’s easy to see why humanity is risk-averse w.r.t. how much of the universe they get. In fact I’d expect most utility functions as complex as ours to be risk-averse w.r.t. material resources, because the most important needs get filled first.

Uhm. That makes sense. I guess I was operating under the definition of risk aversion that makes people give up risky bets just because the alternative is a less risky bet, even if it actually translates in less of absolute expected utility compared to the risky one. As far as I know, that’s the most used meaning of risk aversion. Isn’t there another term to disambiguate between concave utility functions and straightforward irrationality?

I suspect you may be thinking of the thing where people prefer e.g. a (A1) 100% chance of winning 100€ (how do I make a dollar sign?) to a (A2) 99% chance of winning 105€, but at the same time prefer (B2) a 66% chance of winning 105€ to (B1) a 67% chance of winning 100€. This is indeed irrational, because it means you can be exploited. But depending on your utility function, it is not necessarily irrational to prefer both A1 to A2 and B1 to B2.

You’re right, the “irrational” kind of risk aversion is also very important. It’d be nice to have a term to disambiguate between the two, but I don’t know any. Sorry about the confusion, I really should’ve qualified it somehow :-/ Anyway I think my original comment stands if you take it to refer to “rational” risk aversion.

Probably you should have simply said something similar to “increasing portions of physical space have diminishing marginal returns to humans”.

I vehemently disagree. Expected utility is only an apriori rational measure iff the following hold:

Your assignment of probabilities is accurate.

You are facing an iterated decision problem.

The empirical probability mass function of the iterated decision problem doesn’t vary between different encounters of the problem.

If these conditions don’t hold, then EU is vulnerable to Pascal mugging.

Risk aversion is irrational iff you accept EU as the perfect measure of rational choice—I haven’t seen an argument for EU that justifies it in singleton (one-shot) decision problems.

That’s mostly wrong. The vNM theorem applies just fine to one-shot situations and to subjective probabilities. And Pascal’s mugging only applies to utility functions that allow vast utilities.

I am not an EU-maximiser, explaining my decision theory would take a few thousand words, so you’ll have to wait for that, but I’ll offer an intuition pump below. Show that I can be Dutch booked or otherwise money pumped.

(I’ll use “–” instead of “_” because the editor is crap).

Suppose that the following are true about me (if you reject them, then suppose they are true about another agent).

I have an unbounded utility function.

Utility grows linearly in some quantity X (e.g number of lives saved) for me (this is not necessary, but makes the intuition pump easier).

Consider the following decision problem let’s call it π–4:

A = {a–1, a–2}

S = {s–1, s–2}

O = {(a–1, s–1) := 5X, (a–1, s–2) := 1X, (a–2, s–1) := 1X, (a–2, s–2) := ack(10)}

P(s–1) = 1 - (1*10^-10)

P(s–2) = 1*10^-10

What would you pick on π–4?

If you faced it just once

If you faced it in an iterated scenario an unknown number of times.

,

,

,

,

,

[Thinking space]

,

,

,

,

,

My answers:

a–1

a–2

It doesn’t matter how high the payoff of (a–2,s–2) was, I would not choose it in scenario 1, but I would choose it in scenario 2.

If you insist I’m equivalent to an agent who maximises utility then you imply:

It is pretty clear to me that I simply don’t maximise expected utility.

Dutchbook me.

What you described is compatible with EU maximization, except the part where you claim your utility to be linear in X. That seems like a wrong claim. The ultimate source of truth when determining an agent’s utility function is the agent’s preferences among actions. (The vNM theorem takes preferences among actions as given, and hacks together a utility function describing them.) And your preferences among actions imply a utility function that’s nonlinear in X.

How does non-linearity lead to me choosing different options in single vs iterated problems?

I’m fine with saying I maximise expected utility (I interpret that as it is possible to construct an expected utility maximising agent with some preference who would always choose the same strategy I do), but I’m not sure this is the case.

To offer insight into my utility function:

Only one state of the world would manifest. If I not expect to not see that state of the world, I ignore it, irrespective of the payoff of that state. You could interpret this as a bounded utility function. However, in iterated problems I might consider that state, so my utility function isn’t bounded.

I’m trying to maximise utility, and not expected utility. In problems with pathological (very unequal) probability distributions, I may completely ignore a certain set of states. This is because in a given singleton problem, I expect that state to not occur. I don’t care about other Everett branches, so some of the EU arguments also don’t move me.

DagonGod, you are clearly not getting the point here, which is that the vN-M theorem that defines utility is not compatible with

you declaring values of your utility function. If you do that, you are no longer talking about the same concept of “utility”.The concept of a utility function is only relevant insomuch as you can model rational decision makers as possessing a utility function that they try to maximise in some way. I do possess a utility function (not necessarily in the VnM sense as I don’t maximise expected utility, and maximising expected utility is implicit in the definition of VnM utility (this is a point of contention for me)). If I make choices that don’t maximise expected utility, then you must be able to demonstrate that I am irrational on some way (without special pleading to my failure to maximise EU). Either that, or maximising expected utility is not the perfect performance measure for rational choice.

I’m not an expert on decision theory, but my understanding (of FDT) is that there is no reason for the AI to cooperate with the paperclip maximizer (cooperate how?) because there is no scenario in which the paperclip maximizer treats the friendly AI differently based on it cooperating in counter-factual worlds. For it to be a question at all, it would require that

1) the paperclip maximizer is not a paperclip maximizer but a different kind of unfriendly AI

2) this unfriendly AI is actually launched (but may be in an inferior position)

I think there could be situations where it should cooperate. As I understand it, updateless/functional may say yes, causal and evidental would say no.

“1) the paperclip maximizer is not a paperclip maximizer but a different kind of unfriendly AI”

Being a paperclip maximizer is about values, not about decision theory. You can want to maximize paperclips but still use some of acausal decison theory that will cooperate with decision makers that would cooperate with paperclippers, as in cousin_it’s response.

That seems true, thanks for the correction.

Depends on what value the FAI places on human flourishing in hypothetical alternate realities I guess. If it’s focused on the universe it’s in then there’s no reason to waste half of it on paperclips. If it’s trying to help out the people living in a universe where the paperclip maximizer got activated then it should cooperate. I guess a large part of that is also about whether it determines there really are parallel universes or not to be concerned about.

Just to be clear, i’m imagining counterfactual cooperation to mean the FAI building vaults full of paperclips in every region where there is a surplus of aluminium (or a similar metal). In the other possibility branch, the paperclip maximizer (which thinks identically) reciprocates by preserving semi-autonomous cities of humans among the mountains of paperclips.

If my understanding above is correct, then yes, i think these two would cooperate IF this type of software agent shares my perspective on acausal game theory and branching timelines.

This is an interesting reformulation of Counterfactual Mugging. In the case where the cooperation of the paperclip maximiser is provable I don’t see it any different from a Counterfactual Mugging taking place before the AI comes into existence. The only way I see this as becoming more complicated is when the AI tries to blackmail you in the counterfactual world.