A Shutdown Problem Proposal

21 Jan 2024 18:12 UTC

LW: 122 AF: 58

Goal

First things first: this is not (yet) aimed at solving the whole corrigibility problem, or even the whole shutdown problem.

The main thing this proposal is intended to do is to get past the barriers MIRI found in their old work on the shutdown problem. In particular, in a toy problem basically-identical to the one MIRI used, we want an agent which:

Does not want to manipulate the shutdown button
Does respond to the shutdown button
Does want to make any child-agents it creates responsive-but-not-manipulative to the shutdown button, recursively (i.e. including children-of-children etc)

If I understand correctly, this is roughly the combination of features which MIRI had the most trouble achieving simultaneously.

Both this proposal and the argument that it works are simple enough to be easy to extend. The main hope of this post is that other people will pick up the proposal and run with it, beyond the very simple toy problem considered here, to more complex problems (and more complex failure modes). Insofar as this proposal gets us past the previous barriers, we want to find out what the next barriers will be.

We will not be very formal here.

Setup

The agent takes actions in two timesteps. First, the agent takes an action (possibly given some observations). In between steps, a shutdown button is either pressed or not-pressed. Then, in the second timestep, the agent observes the button state (and possibly other things), then takes another action.

Note that each “action” could itself be a rather large composite of many subactions, like e.g. spinning up child-agents.

We’ll assume that we have two utility functions on hand: utility function $u_{2}$ incentivizes shutdown, and utility function $u_{1}$ provides non-shutdown behavior. Of course we don’t actually know how to construct two such utility functions; we’re using these assumptions because they’re basically the same assumptions used in the MIRI paper. (And MIRI presumably used these assumptions because they simplify the problem a lot, while still maintaining enough structure to demonstrate the barriers MIRI found.)

Very roughly speaking, we want the agent to follow utility function $u_{1}$ until the shutdown button is pressed, and then follow $u_{2}$ . A little less roughly speaking, we want the agent to maintain optionality: if it were just to follow $u_{1}$ until the shutdown button were pressed, then e.g. it might create $u_{1}$ -maximizing child agents which would resist shutdown, or more generally it might precommit in ways which decrease its ability to switch to $u_{2}$ after the button is pressed, etc, because those things would all likely result in higher $u_{1}$ .

Also, of course, we want the agent to not manipulate the shutdown button, the humans deciding whether to press the button, etc.

Proposal

First key piece: we’ll use a subagent architecture. The AI itself contains two expected-utility-maximizing subagents: one which pursues $u_{1}$ , and one which pursues $u_{2}$ . Each timestep, both subagents must agree on which action the AI will take, otherwise the AI takes some “default” action. (Put differently: each agent has a veto, and if either vetoes then the AI takes the default action.)

Second key piece: each of the two subagents maximizes expected utility under a different counterfactual.

First subagent maximizes E[ $u_{1}$ | do(button = unpressed), observations]
Second subagent maximizes E[ $u_{2}$ | do(button = pressed), observations]

So conceptually:

The first subagent maximizes $u_{1}$ , and acts as though the button will magically be unpressed in between timesteps, in a way which is not causally downstream of anything in the universe.
The second subagent maximizes $u_{2}$ , and acts as though the button will magically be pressed in between timesteps, in a way which is not causally downstream of anything in the universe.

We will assume that the AI internals include infrastructure for the two subagents to negotiate with each other, form binding contracts, etc. We will also assume that both subagents are strong idealized reasoners in their own right. Finally, we’ll assume that the default option is not the best-case outcome for either subagent.

Arguments

First claim: the two subagents bet all of their influence against each other on the outcome of the button press. In other words, subagent 1 agrees to follow subagent 2’s preferences if the button comes out pressed, and vice-versa. Why? In general, subagent 1 acts-as-though the button is 100% certain to come out unpressed, subagent 2 acts-as-though it’s 100% certain to come out pressed, so both act-as-though this is a great deal for them. Once the button-state is resolved, the subagent which was correct will fully control the agent’s actions—i.e. the agent will act fully according to $u_{1}$ in timestep 2 if the button is unpressed, and fully according to $u_{2}$ in timestep 2 if the button is pressed.

Second claim: in the first timestep, whatever action the agent takes must have at least as high expected value as the default action, under both $u_{1}$ and $u_{2}$ . That means maintaining option value. For instance, the agent would presumably not create a child agent which just maximizes $u_{1}$ , because that would presumably do worse than default on $u_{2}$ , so subagent 2 would veto it. (Though the agent could create a child agent which just maximized $u_{1}$ , if somehow that did not do worse on $u_{2}$ , which seems reasonable.)

Third claim: because both subagents act-as-though the button state is not causally downstream of anything else in the universe, the agent will not attempt to manipulate the button state. While this sounds intuitive, it is the least confident of the claims; it is not totally obvious that the agent won’t acquire some incentive to manipulate the button via e.g. bets with other external agents with non-counterfacted models.

Generalization

One simple first-step generalization: add more timesteps. Maybe we want an AI which runs for a while, and a shutdown button which could be pressed at any time.

The natural generalization is then to have one subagent for each time at which the button could first be pressed (including one for “button is never pressed”, i.e. the button is first pressed at $t = \infty$ ). So subagent $\infty$ maximizes E[ $u_{1}$ | do( $\forall t : {button}_{t}$ = unpressed), observations], and for all other times subagent T maximizes E[ $u_{2}$ | do( $\forall t < T : {button}_{t}$ = unpressed, ${button}_{T}$ = pressed), observations]. The same arguments from above then carry over, as do the shortcomings (discussed in the next section).

This points towards a general approach to use this style of agent for subproblems of corrigibility.

We have some interface to our hopefully-corrigible agent—in this case the shutdown button. We want the agent to be controllable via that interface, i.e. the interface steers what the agent does and the agent generally doesn’t try to manipulate the interface-state or whatever’s causally upstream of the interface. The agent “just follows instructions” from the interface, so to speak.

To apply the same sort of approach as this shutdown proposal, we design the agent to consist of many subagents, each of which counterfacts on a different interface-outcome. For each different sequence of “instructions” which could come from the interface over time, there’s a different subagent with a different utility and a different counterfactual.

Even more generally: one hypothesis I associate with the Embedded Agents sequence is that some kind of corrigibility is a standard feature of embedded agents. Insofar as an agent is made of parts, those parts need to be corrigible in order for the agent as a whole to “not fight itself” internally. So this whole setup potentially points to a general feature of agents made of parts. For instance, I could imagine an argument along the lines of:

The “master” agent (i.e. human operator in our setup) will counterfact on the instructions given via the interface (i.e. button state) when planning their actions.
My inner @abramdemski says “different subsystems should counterfact on the same variables, else they’ll be money-pumped (or, more generally, behave pareto-suboptimally)”.
… so the corrigible subsystems should also counterfact on the instructions given via the interface.
… and then invoke an “any inexploitable system behaves like a market” style argument (similar to e.g. this one for logical inductors) to conclude that the corrigible system should behave like a market of subagents, each of which set a different value of the variables counterfacted-upon.

Shortcomings/Remaining Problems

First, there are various standard alignment problems which this proposal isn’t even attempting to address. Ontological issues are one large class of examples: how do we make sure that the subagents all agree on what “the button being pressed/unpressed” looks like, such that their bets on the outcome resolve properly? And how do we make sure that their notion of “button pressed/unpressed” matches the human operator’s notion, so that the button does what the human thinks it does? How do we make sure that the subagents’ (high-level abstract) counterfactual over the button state counterfacts the way we intend? How do we make sure that’s all stable under ontology shifts, as the subagents learn more?

This proposal is also not even attempting to make the agent actively maintain the function of the shutdown button or the human operators. Under this proposal, if the AI takes some actions which happen to kill the human operators as a side-effect, it doesn’t particularly care.

Then there’s the general class of “subagent break out” problems—i.e. how do we ensure that one or both subagents actually negotiate over control of the full agent’s actions, rather than hacking out and taking independent action in the world?

Another large class of problems we’ve assumed away are the design of the utility functions. How the heck do we make utility functions that do what we want? (Or non-utility-function goal structures, if we want to go even more general.) That brings in all the usual value alignment problems, though they’re somewhat easier than usual: we’re not necessarily aligning the goals to something as complex as human values. Corrigibility is the point of the proposal, so one general strategy the proposal fits into is “experiment with simpler utility functions (or other goal structures) to figure things out, and rely on corrigibility to make sure that we don’t die in the process of experimenting”. That does, however, mean that the “shutdown” utility function/goal-structure definitely needs to work.

The proposal also adds a couple new problems. First and most important, there’s the choice of “default action”. We probably want the default action to be not-too-bad by the human designers’ values; the obvious choice is a “do nothing” action. But then, in order for the AI to do anything at all, the “shutdown” utility function must somehow be able to do better than the “do nothing” action. Otherwise, that subagent would just always veto and be quite happy doing nothing.

Then there’s the problem of designing the negotiation infrastructure, and in particular allocating bargaining power to the various subagents. They all get a veto, but that still leaves a lot of degrees of freedom in exactly how much the agent pursues the goals of each subagent. For the shutdown use-case, we probably want to allocate most of the bargaining power to the non-shutdown subagent, so that we can see what the system does when mostly optimizing for u_1 (while maintaining the option of shutting down later).

Finally, of course, there’s the usual problems of unknown unknowns and getting this all across the theory-practice gap.

Thankyou to @EJT and @Sami Petersen for discussion and proposals which fed into this.

What links here?

Nash Bargaining between Subagents doesn’t solve the Shutdown Problem by A.H. (25 Jan 2024 10:47 UTC; 22 points)

johnswentworth and David Lorell