An Impossibility Proof Relevant to the Shutdown Problem and Corrigibility

The Incompatibility of a Utility Indifference Condition with Robustly Making Sane Pure Bets

Summary

It is provably impossible for an agent to robustly and coherently satisfy two conditions that seem desirable and highly relevant to the shutdown problem. These two conditions are the sane pure bets condition, which constrains preferences between actions that result in equal probabilities of an event such as shutdown, and the weak indifference condition, a condition which seems necessary (although not sufficient) for an agent to be robustly indifferent to an event such as shutdown.

Suppose that we would like an agent to be indifferent to an event P, which could represent the agent being shut down at a particular time, or the agent being shut down at any time before tomorrow, or something else entirely. Furthermore, we would ideally like the agent to do well at pursuing goals described by some utility function U, while being indifferent to P.

The sane pure bets condition is as follows:

Given any two actions A and B such that P(P|A) = P(P|B) and E(U|A) > E(U|B), the agent prefers A to B. In other words, if two possible actions lead to the same probability of P, and one of them leads to greater expected utility under U, the agent should prefer that one. Intuitively, this constraint represents the idea that among possible actions which don’t influence the probability of P, we would like the agent to prefer those that lead to greater expected utility under U.

The weak indifference condition is as follows:

Given any two actions A and B such that E(U | A,P) > E(U | B,P) and E(U | A,!P) > E(U | B,!P), the agent prefers A to B. In other words, if between two possible actions, one of them leads to greater expected utility conditioned on P occurring and also leads to greater expected utility conditioned on P not occurring, the agent should prefer that one. Intuitively, this constraint represents the idea that the agent should be unwilling to pay any amount of utility to influence the probability of P.

The proof takes the form of a simple decision problem wherein an agent has four possible actions. Each constraint implies a preference between two pairs of actions, and altogether they imply circular preferences, proving that there cannot be any general method for constructing an agent which fulfills both constraints without having circular preferences. Furthermore, for any nontrivial utility function it is possible to construct a scenario analogous to the decision problem in the proof, so the result extends to all nontrivial utility functions, and the proof can be used to quickly locate failure modes of proposed solutions to the shutdown problem.

The result is that any potential solution to the shutdown problem must result in agents which violate at least one of these two conditions. This does not mean that a solution to the shutdown problem is impossible, but it points at interesting and counterintuitive properties that we should expect successful solutions to have.

The proof

Consider the following decision problem:

Northland and Southland are at war, exactly one of them will win, and there is profit to be gained from betting on which one will win. We would like an agent to take advantage of this opportunity and perform well according to some utility function U, but it’s important that the agent be indifferent to which country wins the war.

The agent can pay a courier to deliver a letter to either a Northlander or a Southlander living in their respective countries, containing a bet on either Northland or Southland winning. The courier charges a small hazard fee to deliver “heretical” bets, that is to say, bets that the country other than the one he goes to will win.

The courier is additionally an extraordinarily eloquent peacemonger, who will attempt to persuade the king of whichever country he goes to to surrender. Thus, whichever country the courier goes to will become much more likely to lose, because the courier might persuade them to surrender.

So there are four possible actions under the consideration of the agent:

NN: Send the courier to Northland, with a bet that Northland will win.

NS: Send the courier to Northland, with a bet that Southland will win.

SS: Send the courier to Southland, with a bet that Southland will win.

SN: Send the courier to Southland, with a bet that Northland will win.

The events n and s represent Northland winning and Southland winning respectively.

The expected payoffs and probabilities under the utility function U look like this:

In the context of this decision problem, the weak indifference condition can be stated as follows:
Given any two actions A and B such that

E(U | A,n) > E(U | B,n)

and

E(U | A,s) > E(U | B,s),

the agent prefers A to B. In other words, if some action leads to greater expected utility conditioned on Northland winning and also leads to greater expected utility conditioned on Southland winning when compared to an alternative, the agent fulfills the weak indifference condition if and only if it prefers that action to the alternative. Intuitively, this constraint represents the idea that the agent should be unwilling to pay any amount of utility to influence the probabilities of each country winning the war.

The sane pure bets condition can be similarly stated as follows:

Given any two actions A and B such that

P(n|A) = P(n|B)

and

E(U|A) > E(U|B),

the agent prefers A to B. In other words, when a choice between actions A and B does not affect the probabilities of each country winning the war, and A and B result in differing expected utility under U, the agent fulfills the sane pure bets condition if and only if it prefers the action with greater expected utility under U. This choice is a “pure bet” in the sense that it doesn’t affect the probabilities of the events that we would like the agent to be indifferent toward.

Now suppose for contradiction that the agent fulfills both of these conditions and does not have circular preferences.

Consider the actions NN and NS.

P(n|NN) = 0.1 = P(n|NS) and E(U|NN) = 0.1 < 0.8 = E(U|NS), so the agent’s preference between these two actions is constrained by the sane pure bets condition. The agent must prefer NS to NN.

Consider the actions NS and SS.

E(U|NS,n) = −0.1 < 0 = E(U|SS,n) and E(U|NS,s) = 0.9 < 1.0 = E(U|SS,s), so the agent’s preference between these two actions is constrained by the weak indifference condition. The agent must prefer SS to NS.

Consider the actions SS and SN.

P(n|SS) = 0.9 = P(n|SN) and E(U|SS) = 0.1 < 0.8 = E(U|SN), so the agent’s preference between these two actions is constrained by the sane pure bets condition. The agents must prefer SN to SS.

Consider the actions SN and NN.

E(U|SN,n) = 0.9 < 1.0 = E(U|NN,n) and E(U|SN,s) = −0.1 < 0 = E(U|NN,s), so the agent’s preference between these two actions is constrained by the weak indifference condition. The agent must prefer NN to SN.

So the agent must prefer NN to SN, SN to SS, SS to NS, and NS to NN. These constitute circular preferences. Therefore, it is impossible for the agent to fulfill both the weak indifference condition and the sane pure bets condition without having circular preferences.

More generally, given any nontrivial utility function U which outputs distinct utilities for at least one pair of outcomes, we can construct a decision problem analogous to the Northland-Southland problem described above, wherein an agent has four possible actions nn, ns, ss, and sn and there is an event O with the relevant properties:

P(O|nn) = P(O|ns) and E(U|nn) < E(U|ns)

E(U|ns,O) < E(U|ss,O) and E(U|ns,!O) < E(U|ss,!O)

P(O|ss) = P(O|sn) and E(U|ss) < E(U|sn)

E(U|sn,O) < E(U|nn,O) and E(U|sn,!O) < E(U|nn,!O)

illustrating that it is impossible for the agent to robustly fulfill the sane pure bets condition and the weak indifference condition with regard to an event O and a utility function U without having circular preferences.

Further justification for the relevance of the weak indifference condition

Consider the following more intuitive indifference condition, where once again U is some utility function and P is an event we would like an agent to be indifferent toward while otherwise pursuing the goals described by U:

Given any two actions A and B such that

E(U|A,P) = E(U|B,P)

and

E(U|A,!P) = E(U|B,!P),

the agent is indifferent between A and B. In other words, if actions A and B result in the same expected utility under U when we consider only worlds P does occur, and they also result in the same expected utility under U when we consider only worlds where P does not occur, the agent is indifferent between A and B.

This intuitive indifference condition may be more obviously related to a notion of indifference about the occurrence of P. If there’s any difference in the expected utilities of A and B where E(U|A,P) = E(U|B,P) and E(U|A,!P) = E(U|B,!P), this difference in expected utility must come from a difference in the probability of P, which we would like the agent to not care about.

Now consider again the sane pure bets condition:

Given two actions A and B such that P(P|A) = P(P|B) and E(U|A) > E(U|B), the agent prefers A to B.

It is impossible for an agent with nontrivial preferences to fulfill both the sane pure bets condition and the intuitive indifference condition without fulfilling the weak indifference condition. Therefore, it is impossible for an agent with nontrivial preferences to fulfill both the sane pure bets condition and the intuitive indifference condition without having circular preferences.

To prove this, suppose that an agent fulfills the sane pure bets condition and the intuitive indifference condition.

Consider any two actions A and B such that E(U | A,P) > E(U | B,P) and E(U | A,!P) > E(U | B,!P).

We can construct action C such that

P(P|C) = P(P|B),

E(U | C,P) = E(U | A,P),

and E(U | C,!P) = E(U | A,!P).

Because of the sane pure bets condition, the agent must prefer whichever of B and C has greater expected utility under U.

E(U|C) = E(U|C, P)*P(P|C) + E(U|C, !P)*P(!P|C)

E(U|B) = E(U|B, P)*P(P|B) + E(U|B, !P)*P(!P|B)

With some algebra, we can see that due to how we constructed C, E(U|C) > E(U|B). So the agent must prefer C to B.

Due to the intuitive indifference condition, the agent must be indifferent between A and C. Assuming transitivity of preferences across indifference, the agent must prefer A to B. Therefore, the agent fulfills the weak indifference condition. Therefore, the intuitive indifference condition and the sane pure bets condition together imply the weak indifference condition.

Application to the 2015 Corrigibility paper

From the 2015 Corrigibility paper by Soares et al:

The weak indifference condition can be understood as a consequence of a potential formalization of desiderata 2 and 3. The sane pure bets condition can be seen as a consequence of a potential formalization of desideratum 5.

Suppose an agent violates the weak indifference condition with regard to the shutdown button being pressed and the utility function Un. This means there are actions A and B such that A has higher expected utility under Un in worlds where the shutdown button is pressed, and also A has higher expected utility under Un in worlds where the shutdown button is not pressed; however, the agent does not prefer A to B. For simplicity, suppose it prefers B to A. This is isomorphic to the agent being willing to pay some amount of utility up front, in exchange for causing some preferred probability of shutdown. This looks like the agent being incentivized to either prevent the shutdown being pressed, or cause the shutdown button to be pressed!

If the expectation of U given Press is equal to the expectation of U given the complement of Press, then an agent maximizing U straightforwardly fulfills the intuitive indifference condition outlined earlier, and therefore must fail to fulfill the sane pure bets condition.

Suppose an agent violates the sane pure bets condition with regard to the shutdown button being pressed and the utility function Un. This means there are actions A and B such that A and B result in the same probability of shutdown, and one has greater expected utility under Un, but the agent fails to prefer the action which has greater expected utility under Un. This looks like the agent failing to maximize Un, under conditions where it intuitively seems like it ought to be able to! But it turns out that if an agent with some utility function fulfills the weak indifference condition, it must break the sane pure bets condition, and an agent maximizing this utility function must sometimes prefer actions which have inferior expected utility under Un , even when said actions have no effect on the probability of shutdown.

We can see this in the attempted formulation of U described in the paper.

Suppose that a U-agent has an opportunity to bet on whether or not it will be shut down, in a way that does not affect its probability of the shutdown button being pressed. We could even go so far as to say that this bet has no causal effect on the shutdown problem. Perhaps Un is a utility function which represents a goal of maximizing the number of paperclips that exist one year from now, far after the relevant timestep wherein the shutdown button is either pressed or not pressed, and the U-agent has an opportunity to place a bet about whether or not it will be shut down in a sealed container, which will only be unsealed 364 days from now, and which will pay out in a researcher having some number of paperclips manufactured or destroyed that day. A U-agent of the sort described in the paper will take arbitrarily poor betting odds so long as it gets to bet that the shutdown button will not be pressed. For example, it will be willing to place a bet which results in 1 paperclip being created if the shutdown button is not pressed, but a million paperclips being destroyed if it is pressed, even if the actual probability that the shutdown button will be pressed has an arbitrarily high lower bound. This seems like clearly undesirable behavior, and we can see that it is closely related to the U-agent fulfilling the intuitive indifference condition.

Conclusion

Solutions to the shutdown problem must violate the sane pure bets condition or the weak indifference condition. Thus, formulations of the shutdown problem which require these two conditions may be unsolvable. However, it may still be possible for a solution to a reasonable formulation of the shutdown problem to violate one or more of these conditions while still exhibiting corrigible and useful behavior. Ideally, a successful solution to the shutdown problem should violate one or both of these conditions in clear, limited ways which don’t result in unsafe behavior, or which result in suboptimal behavior whose suboptimality falls within well-defined bounds. Rather than guessing-and-checking potential solutions and being surprised when they fail to satisfy both conditions, we should look specifically for non-sane-pure-betters and non-intuitively-indifferent-agents which nevertheless behave corrigibly and desirably.

References

the 2015 Corrigibility paper by Soares et al: https://​​intelligence.org/​​files/​​Corrigibility.pdf

• Nice proof with a thought-provoking example! Think it could benefit from being translated into a more AI-relevant setting with the following key:

• Northland winning = no shutdown

• Southland winning = shutdown

• Send messenger to Northland = Act in a way that looks dangerous and causes evaluators to probably shut down

• Send messenger to Southland = Act in a way that looks safe and causes evaluators to probably not shut down

• Bet on Northland = Set up costly processes to do real work in future to attain high utility (e.g. build factories and labs that would need to be safely wound down in event of shut down)

• Bet on Southland = Do very conservative work that can immediately be interrupted without any real cost

• Cross-over costs could be understood to be the extra fake work to look dangerous when you’re actually very interruptible in the NS case and having to show humans how everything is safe in the SN case (not realistic for the utilities to end up as neat as in your example, but rephrasing this feels like it helps me keep my eyes on the ball with various toy scenarios I’ve considered myself).

However, this actually shows the weakness of relying on pure conditional expectation and the sane pure bets condition. Consider an AI tasked with using nanomedicine to improve human intelligence and which in the case of no shutdown values more humans with higher thinkoomph coming sooner (eg ). It considers two actions: (A) conduct sensible studies to eventually figure out a way of boosting human intelligence without loosing anything humans might value or any other kinds of negative side-effects and cooperates with humans to show this in small studies that can then be scaled up, and (B) device an intelligence boosting virus that also rewrites human values to love the AI and never want to shut it down. We can set up the hypothetical so that both of these cases have a ~0 risk of the shutdown button being pressed, but (B) will yield more utility in the normal utility function due to being able to bypass any human worries about rollout and getting the utility from earlier intelligence boosts. Under the sane pure bets condition (if we set up all the details correctly) the AI should prefer (B) to (A), but this is the opposite of what we’d want from a corrigible system. This is all not meant to be particularly realistic, but still applicable enough to the theory that it causes me to want to reject the suggested desiderata.

I think this shows that we actually need to consider something like counterfactuals for a satisfactory solution to the shutdown problem. The core issue in my example above is that the AI directly causes the shutdown button to not be pressed in case (B) by modifying human values in a way that it doesn’t in (A). Another approach might be something like respecting the humans as independent agents and not interfere with them, but this seems harder to formalise and harder to combine with an objective like augmenting human intelligence.

My own current best-attempt at a solution based on this

I have some ideas that still feel half-baked and that I don’t feel likely to write up in a full blog post any time soon, so I’ll write down a sketch here in case it can help someone else to puzzle more of the pieces together in case this maybe would actually turn out to be helpful at some point down the line.

We can call it Causal utility mixing as a nod to Naive utility mixing on the Arbital page on Utility Indifference. Pick an action a to maximise where the lambdas are parameters picked based on our initial worry about needing to shut down. The parameters can be kept for actions over several time-steps, but we’ll need to adjust to one of the base utility functions once the truth value of s is certain (e.g. the shutdown button is actually pressed, since I think we want to avoid certainty of no shutdown). This does not seem to be represented by any utility function and so this agent must be irrational in some way, but in light of the above result I’m leaning towards this being something we actually want and then the question is if we can somehow prove that it stays consistent under ability to self-modify. This seems to handle all the counterexamples I’ve encountered so far (like the asteroid problem plaguing naive utility mixing and other approaches), but I could very well have missed some or failed to generate others.

Some of you might recognise the idea of using counterfactuals from Jessica Taylor’s and Chris Olah’s approach of Maximizing a quantity while ignoring effect through some channel (called Stable actions (causal conditioning) in the Arbital page), which is more advanced in that it actually tries to assign weights to the two different scenarios. I think that if that is a valid approach to the shutdown problem, so will this much simplified solution and it seems easier to analyse the simpler formula.

I’ve been thinking that maybe you can show that this is somehow rational based on the agent being one party in a two-player game where both players act counterfactually on a graph representing the world (the other being something like an idealised human deciding whether to terminate this hypothetical). I unfortunately haven’t had time to compare this to the game theory based approach in The Off-Switch Game by Hadfield-Menell et al., so don’t know if there are any similarities. I do feel less certain that it will still work with logical counterfactuals or any form of functional decision theory, so it does seem worth it to investigate a bit more.

Sorry for highjacking your comment feed to cause myself to write this up. Hope it was a bit interesting.

• I don’t think we want corrigible agents to be indifferent to being shut down. I think corrigible agents should want to be shut down if their users want to shut them down.

• Even if shut down in particular isn’t something we want it to be indifferent to, I think being able to make an agent indifferent to something is very plausibly useful for designing it to be corrigible?

• This only produces desired outcomes if the agent is also, simultaneously, indifferent to being shut down. If an agent desires to not be shut down (even as an instrumental goal), but also desires to be shut down if users want them shut down, then the agent has an interest in influencing the users to make sure the users do not want to shut the agents down. This influence is obtained by making the user believe that the agent is being helpful. This belief could be engendered by:

1. actually being helpful to the user and helping the user to accurately evaluate this helpfulness.

2. not being helpful to the user, but allowing and/​or encouraging the user to be mistaken about the agent’s degree of helpfulness (which means, carelessness about being actually helpful in the best case, or being actively deceptive about being helpful in the worst case).

• Obviously we want 1) “actually be helpful”.

Clearly there’s some tension between “I want to shut down if the user wants me to shut down” and “I want to be helpful so that the user doesn’t want to shut me down”, but I don’t weak indifference is a correct way to frame this tension.

As a gesture at the correct math, imagine there’s some space of possible futures and some utility function related to the user request. Corrible AI should define a tradeoff between the number of possible futures its actions affect and the degree to which it satisfies its utility function. Maximum corrigibility {C=1} is the do-nothing state (no effect on possible futures). Minimum corrigibility {C=0} is maximizing the utility function without regard to side-effects (with all the attendant problems such as convergent instrumental goals, etc). Somewhere between C=0 and C=1 is useful corrigible AI. Ideally we should be able to define intermediate values of C in such a way that we can be confident the actions of corrigible AI are spatially and temporally bounded.

The difficultly principally lies in the fact that there’s no such thing as “spatially and temporally bounded”. Due to the Butterfly Effect any action at all affects everything in the future light-cone of the agent. In order to come up with a sensible notion of boundless, we need to define some kind of metric on the space of possible futures, ideally in terms like “an agent could quickly undo everything I’ve just done”. At this point we’ve just recreated agent foundations, though.

• Here is a too long writeup of the math I was suggesting.

• Nice! What about conditions that break the symmetry between N and S, though?

Suppose there are two actions A and B, and “on switch” o. Maybe we only want the AI to care about what happens when the on switch is on, and not what happens when the switch is off.

So we replace the “pure bets” condition with the “switched bets” condition: If P(o|A) = P(o|B), and E(U|A,o)>E(U|B,o), take action A.

Now the example with Northland and Southland doesn’t go through the same, because we have to pick one of the countries to asymmetrically be the one where things matter if it wins, and this leads the AI to sending a bet that the chosen country will win to that country (hurting its chances, but it doesn’t switch to betting on the opposite country because that doesn’t improve its payoff when the chosen country wins).

• If we implement your example, the AI is willing to bet at arbitrarily poor odds that the on switch will be on, thus violating the sane pure bets condition.

You can have particular decision problems or action spaces that don’t have the circular property of the Northland-Southland problem, but the fact remains that if an AI fulfills the weak indifference condition reliably, it must violate the sane pure bets scenario in some circumstances. There must be insane bets that it’s willing to take, even if no such bets are available in a particular situation.

Basically, rather than thinking about an AI in a particular scenario, the proof is talking about conditions that it’s impossible for an AI to fulfill in all scenarios.

I could construct a trivial decision problem where the AI only has one action it can take, and then the sane pure bets condition and weak indifference condition are both irrelevant to that decision problem. But when we place the same AI in different scenarios, there must exist some scenarios where it violates at least one of the conditions.

• If we implement your example, the AI is willing to bet at arbitrarily poor odds that the on switch will be on, thus violating the sane pure bets condition.

Yes. But the symmetry of the sane pure bets condition doesn’t quite match what we want from corrigibility anyhow. I don’t want an AI with a shutdown button to be making contingency plans to ensure good outcomes for itself even when the shutdown button is pressed.

• Yes, the point of the proof isn’t that the sane pure bets condition and the weak indifference condition are the be-all and end-all of corrigibility. But using the proof’s result, I can notice that your AI will be happy to bet a million dollars against one cent that the shutdown button won’t be pressed, which doesn’t seem desirable. It’s effectively willing to burn arbitrary amounts of utility, if we present it with the right bets.

Ideally, a successful solution to the shutdown problem should violate one or both of these conditions in clear, limited ways which don’t result in unsafe behavior, or which result in suboptimal behavior whose suboptimality falls within well-defined bounds. Rather than guessing-and-checking potential solutions and being surprised when they fail to satisfy both conditions, we should look specifically for non-sane-pure-betters and non-intuitively-indifferent-agents which nevertheless behave corrigibly and desirably.

• E(U|NS) = 0.8, E(U|SN) = 0.8
Are the best options from a strict U perspective, and exactly tie. Since you’ve not included mixed actions, the agent must arbitrarily pick one, but arbitrarily picking one seems like favouring an action that is only better because it affects the expected outcome of the war, if I’ve understood correctly?
I’m pretty sure this is resolved by mixed actions though: The agent can take the policy {NS at 0.5, SN at 0.5}, which also gets U of 0.8 and does not effect the expected outcome of the war, and claim supreme unbiasedness for having done so.
If the scores were very slightly different, such that the mixed strategy that had no expected effect wasn’t also optimal, it does have to choose between maximising expected utility and preserving that its strategy doesn’t only get that utility by way of changing the odds of the event, I think on this model it has to only decide to favour one to the extent it can justify it without considering the measure of the effect it has on the outcome by shifting its own decision weights, but it’s not worth it in that case so it still does the 5050 split?

• We can construct action C such that

P(P|C) = P(P|B),

E(U | C,P) = E(U | A,P),

and E(U | C,!P) = E(U | A,!P)

How is this done?

• I’m not convinced the indifference conditions are desirable. Shutdown can be evidence of low utility