What’s Hard About The Shutdown Problem

I’ve been having some productive back-and-forth with Sami Petersen about his Invulnerable Incomplete Preferences (which is intended to complement Elliot Thornley’s approach to the Shutdown Problem). This post is a (non-formal) distillation of some useful pieces which I’ve gotten out of that discussion so far.

The Shutdown Problem

Conceptually: we want to equip a powerful AI with a shutdown button. If and when the button is unpressed, we want the AI to follow one goal (we’ll call it the “on-goal”). If and when the button is pressed, we want the AI to follow a different goal (we’ll call it the “off-goal”). Presumably the off-goal involves the AI shutting down itself and all its associated tools/​children/​etc, but that part’s flavor text for current purposes.

The key failure mode we want to avoid here is accidentally incentivizing the AI to manipulate the shutdown button—i.e. either cause it to be pressed, or cause it to be unpressed. Why would that happen? Well, suppose that when we first turn on the AI, it is just optimizing for the on-goal. The AI realizes that, if the button is pressed, its future self will stop optimizing for the on-goal, and therefore it will achieve lower performance on the on-goal. So, the AI realizes that it can achieve better performance on the on-goal by preventing the button from being pressed. And since it’s just optimizing for performance on the on-goal, it then prevents the button from being pressed.

MIRI’s paper on the shutdown problem goes through more examples of simple proposals for the shutdown problem, and how they incentivize button manipulation of one sort or another. It turns out that it’s generally Hard to get the desired behavior (i.e. no button manipulation) out of any expected utility maximizer.

Why It’s Hard, Level 1: Desired Behavior Implies Incomplete Revealed Preferences

Why is it hard to get an expected utility maximizer to not manipulate the button (while still doing useful stuff otherwise)? Here’s one intuitive answer I got out of the discussion with Sami.

The central idea is to ask what the behavior we want from the shutdown-button-agent implies about that agent’s revealed preferences.

Suppose that, at various times, the agent is offered opportunities to spend resources in order to cause the button to be pushed/​unpushed. We want the agent to turn down such opportunities, in both directions—implying either indifference or lack of preference in any revealed preferences. Further, we do want the agent to spend resources to cause various different outcomes within the button-pressed or button-unpressed worlds, so there’s nontrivial revealed preference ordering within button-pressed worlds and within button-unpressed worlds. But if the agent is to turn down costly opportunities to cause the button to be pressed/​unpressed, and those opportunities jump between enough different pressed-outcome and unpressed-outcome pairs (which themselves each have nontrivial revealed preferences), then there’s going to be a revealed preference gap—i.e. the behavior cannot be represented by complete preferences, only incomplete preferences.

Visually:

Agent spends resources to move from A1 to A2, but doesn’t spend resources to move in either direction between B and either A-outcome. This is a “preference gap”; it is of interest because any assignment of preferences consistent with the preference gap must be incomplete.

We want the agent to spend resources to move from outcome A1 to A2, within the button-unpressed worlds, so it has a revealed preference there. But we want the agent to not spend resources to move between A1 and B, in either direction—so it has either revealed indifference (assuming the agent has nonzero bid/​ask spread), or revealed incompleteness of preferences, between A1 and B1. By the same logic, there’s also either revealed indifference or incompleteness between A2 and B1. But since the agent has a revealed preference for A2 over A1, any revealed preference assignment cannot be indifferent between both A1/​B1 and A2/​B1; that would imply indifference between A1 and A2. So the agent’s revealed preferences must be incomplete.[1]

Upshot: (one way to frame) the reason that the shutdown problem is difficult/​interesting in the first place, is that the desired behavior implies a revealed preference gap. Insofar as e.g. any standard expected utility maximizer cannot have a revealed preference gap, such standard EU maximizers cannot behave the way we want.

Some notes on this:

  • Revealed preferences might be very underdetermined, i.e. there may be many assignments of preferences compatible with an agent’s behavior. The argument above says that any assignment of preferences consistent with the desired behavior must be incomplete.

  • Note that the story implicitly routes above routes through causal interventions/​counterfactuals to probe preferences—i.e. we’re interested in whether the agent will spend resources to cause the button to be pressed/​unpressed. That won’t be very relevant for this post, but may be important for people aiming to formalize all this.

  • Apparently sometimes people take “no preference” (i.e. incompleteness) between A and B to mean that an agent chooses randomly/​arbitrarily between the two. That is not the notion of “no preference” of relevance here; we’re talking about “no preference” as in “the agent will not give up either to gain the other, given the opportunity”.

Ok, so now we have a relatively intuitive story for why it’s Hard to get an expected utility maximizer to behave the way we want in the shutdown problem: insofar as EU maximizers don’t have revealed preference gaps, they can’t behave the way we want. (More generally, if we’re training an agent, any one-dimensional reward signal presumably cannot incentivize a revealed preference gap.) But then presumably the solution is to just… use an agent with incomplete preferences?

Why It’s Hard, Level 2: Incomplete Preferences Want To Complete

The post Why Not Subagents? goes into more depth on this, but I’ll give a brief overview here.

Looking back at our diagram with A1, A2, and B1, we could imagine a set of incomplete preferences which prefers A2 over A1, but has no preference between B1 and either of the A’s (i.e. the agent won’t spend resources to switch between button-unpressed and button-pressed outcomes, or vice versa). But this agent will spend resources to turn A1 outcomes into A2 outcomes.

Now the problem: this agent can potentially probabilistically precommit to accepting both opportunities to turn A1 outcomes into B1 outcomes (i.e. spend resources to make the button be pressed), and turn B1 outcomes into A2 outcomes (i.e. spend resources to make the button not be pressed). If it does both of these with the right probabilities, then it can maintain the same proportions of pressed (B1) and unpressed (A1+A2) outcomes, but shift some probability mass from A1 outcomes to A2. That’s an improvement, by the incomplete preferences: the agent isn’t accepting any net shift between A’s and B’s, it’s just net-shifting probability around within A-outcomes, but it’s doing so by manipulating the shutdown button.

(I learned from Sami’s post that this is called “trammelling” of incomplete preferences.)

So we want some kind of agent which won’t do that.

Some notes on this:

  • The key question is whether the revealed preferences are immune to trammelling. This was a major point of confusion for me in discussion with Sami—his proposal involves a set of preferences passed into a decision rule, but those “preferences” are (potentially) different from the revealed preferences. (I’m still unsure whether Sami’s proposal solves the problem.)

  • That divergence between revealed “preferences” vs “preferences” in the sense of a goal passed to some kind of search/​planning/​decision process potentially opens up some approaches to solve the problem.

  • One can obviously design a not-very-smart agent which has stable incomplete preferences. The interesting question is how to do this without major limitations on the capability of the agent or richness of the environment.

  • Note that trammelling involves causing switches between outcomes across which the agent has no preference. My instinct is that causality is somehow key here; we’d like the agent to not cause switches between pressed and unpressed outcomes even if the relative frequencies of both outcomes stay the same.

  1. ^

    This all assumes transitivity of preferences; one could perhaps relax transitivity rather than incompleteness, but then we’re in much wilder territory. I’m not exploring that particular path here.