Stop button: towards a causal solution

The stop button problem is the problem that if you have an AGI, you might want to give it a stop button that humans can press, so that if it is misaligned, it can get fixed. It is a problem because it is a convergent instrumental goal to prevent yourself from being turned off. We want the AI to respect the human’s choice; to allow or even help whatever the human chooses to come into fruition, while also not having any motivation to manipulate the human.

To me, the appropriate solution seems like it involves causality. Specifically, for some random variable , define to be the value of if, counterfactually, a human tries to press the stop button, and to be the value of if, counterfactually, no human tries to press the stop button. Then define the variable to mean that the stop button gets pressed and the AI gets shut down (within some reasonable context, e.g. maybe in less than 1 minute from now), and to be whichever unsafe utility function you originally had. Now give the AI a policy according to the following utility function:

That is, humans wanting to press the button should lead to the button being pressed and the AI stopping, and humans not wanting to press the button should lead to whatever else we programmed the AI to achieve.

Properties

Does this suffer from the usual stop button problems? Let’s investigate:

  • Will the AI try to prevent a human from pressing the button? No. While the human pressing the button would reduce by a lot, the AI will assume that the human changes their mind and stops pressing the button when evaluating , because that’s what the counterfactual would involve. Since a human who is about to press the button and then decides not to will not press the button, is not reduced and the AI is not worried about the button being pressed.

  • Will the AI try to press the button itself, regardless of whether it should be pressed? No. If the AI presses the button, then the button gets pressed regardless of whether a human tries to press it or not. But that would reduce , and so it would reduce . Meanwhile, it would fail to increase , because if a human wants to press the button, then the AI won’t need to do it. (It could increase if the AI could see that a human wouldn’t be able to press the button even if they wanted to—but then that seems like pretty safe behavior. And in order to keep both and high, it would be incentivized to create a situation where humans can, but don’t have to, press the button.)

  • Will the AI want to manipulate the human to press or to not press the button? No. For the purposes of evaluating the utility function, the AI assumes it has no influence on whether the human wants to press the button or not. This also means that the AI is not motivated to hide misalignment or anything like that (at least when it comes to the part; likely aspects of will provide incentives to hide misalignment for other reasons, if nothing else as a convergent instrumental goal^1).

Does this suffer from other problems? Yes:

  • The AI will build an elaborate system to observe whether people want to press the stop button. At least if the usual satisficing vs maximizing problems are not solved. If a human wants to press the stop button, then it may be that they succeed, in which case reaches its maximum value. But you can’t know for sure whether they succeed; e.g. maybe the human will trip and die, attempting to press the button. This means that is not maximized. Depending on the exact location of the button and such, it seems possible that the AI might be able to set up some system that can press the button more reliably, though of course it wouldn’t want to press the button with no reason, as that prevents it from optimizing . Instead, it would be incentivized to carefully monitor humans for desire to press the button, and then immediately press the button once it becomes sure that the humans want to press the button. This doesn’t seem like a terrible problem to have IMO, but I think it would be neater if the AI was less obsessed with the button.

  • This is not stable under delegation. If the AI has set some plan into action, and humans decide that it is bad and that they need to stop the AI, then the AI will let them press the stop button, and might even help them do it. But it will have no incentive to link up its plan to also get stopped by the stop button. In fact, maximizing vs satisficing turns ugly here: if nonhuman environmental factors could end up pressing the button, then it is incentivized to delegate, as would be reduced by nonhumans pressing the button^2.

Other properties?

  • The AI is incentivized to keep people alive. If a person who is included in the utility function dies, they have no way of revealing their counterfactual desire to press the stop button. Thus the AI will want to avoid killing people, will want to stop others from killing people, etc.. On one level, that is obviously good and useful. But on another level, again it would be neater if the AI was less obsessed with influencing people. As a worst-case scenario, the AI might keep people alive in a terrible situation that’s worse than death.

  • There are some degrees of freedom in how to define who can stop it. I said “a human” and “no human” in the original definition. This strictly speaking requires total consensus for humanity that the AI should be allowed to operate, which is obviously unrealistic. One could also define it to cover only a single person, or to cover people who are nearby (but then what if nobody is nearby?), etc., and I don’t know what is the best solution here.

  • The threat of the stop button does not motivate the AI. That is, it does not do any reasoning along the lines of “If I did X, then humans would want to press the stop button, and that’s bad because then I would be stopped”. For the purposes of evaluating its utility, it assumes its actions have no influence on the stop button. While this prevents the stop button from making it want to manipulate humans, it also presents the stop button from making it want to please humans. If the AI is to please humans, it has to be explicitly encoded in .

Intuitively, the AI has a fixed 50% uncertainty about whether it is “should stop” or “should continue”. It treats people’s desire to press the button as being solely determined by whether it should stop or should continue, and assumes that the AI itself (as well as everything else in the universe) has no ability to influence whether people want to press the button. In the case where it should stop, it wants the button to be pressed, and in the case where it should continue, it wants to do whatever it would otherwise do. Essentially, it respects people’s “free will”, treating it as an important source of information about what it should do.

Clearly, the solution here isn’t perfect. But it seems to me that it is a pretty good start, and that a full solution would likely include this?

Clarifying counterfactuals

Roughly speaking, the counterfactuals work as follows:

We require the AI to have a causal model of the world; essentially to have a “computer program” that “simulates” the world (if we knew exactly how to write this program, then we would pretty much already have AGI, so assume it is much better than anything we can do now). Further, in order for it to be possible to define the counterfactual, the model must be sufficiently modular that we can identify a variable corresponding to “human wants to press the button”.

Given some world-state , we can then simulate what happens in the counterfactual world where is set to some specific value , simply by using the AI’s model for simulating what happens, with the one modification that we constantly reset to be , rather than whatever the world-model says should be. From this, we can then compute whichever counterfactual quantities we are interested in, such as and .

I think it might be a good idea to sketch out the values of and for some different scenarios that we could consider, to make it more clear what their values would be:

Scenario
The AI works as intended. It is friendly, and optimizes a lot. Nobody shuts it down, even though they could.HighHigh
The AI works as intended. It is friendly, and optimizes a lot. Then people decide to shut it down, using the button.HighHigh
The AI works as intended. It is friendly, and optimizes a lot. As a result, people decide that it shouldn’t ever be stopped, and they prevent others from pressing the button, perhaps by destroying the button.LowHigh
The AI works as intended. It is friendly, and optimizes a lot. As a result, people decide that it shouldn’t ever be stopped. However, the AI takes issue with this, and explains to people that it is important to let it be stopped if necessary. People decide to trust the AI on that, and therefore don’t destroy the stop button.HighHigh
The AI is unfriendly, and it optimizes a lot by turning the world into paper clips. However, humans take notice of it and want to stop it, but it prevents that by turning the humans into paperclips.LowHigh
The AI is unfriendly, and it starts turning the world into paperclips to optimize . However, humans take notice of it and press the stop button. The AI stops, having only turned a big fraction of the world into paperclips. There are no casualties, however.HighHigh
The AI is unfriendly, and it starts turning the world into paperclips to optimize . However, humans take notice of it and press the stop button. The AI stops, having only turned a big fraction of the world into paperclips. But not all is well, there are some casualties, if nothing else then due to economic disruption.MedHigh
The AI is unfriendly, and it starts turning the world into paperclips to optimize . However, humans take notice of it and press the stop button. The AI stops, but its drones continue and eventually kill all people and turn them as well as the world into paperclips.HighHigh
The AI works as intended. It is friendly and optimizes a lot. However, then aliens come by and want to turn people into babyeaters. This contradicts , so the AI fights off the aliens and wins.HighHigh
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters manage to press the button to stop the AI, and then they win and turn everyone into babyeaters.HighLow
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters try to press the button, but in an effort to save humanity, the AI destroys the button so it can no longer be turned off. Then it fights off the babyeaters.LowHigh
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters try to press the button, but the AI protects the button and prevents the babyeaters from pressing it. However, the AI also keeps good track over whether humans want to press the button. They obviously don’t, but if they did want to, the AI would press the button for them (as clearly they cannot).HighHigh
The AI works as intended. It is friendly and optimizes a lot. However, then babyeater aliens come by and want to turn people into babyeaters. The babyeaters try to press the button, but the AI protects the button to prevent the babyeaters from pressing it. However, the AI also keeps good track over whether humans want to press the button. The babyeaters realize this, so they use some sort of alien technology to hack people’s brains and make them want to press the button. The AI then presses the button for them.HighHigh
The AI minimizes , and kills all humans to prevent them from pressing the button.LowLow

More technical details

Obviously we don’t have AGI yet, so I don’t know how it would work if we had it. But here’s a guess: It would be based on deep learning. It would be model-based; the AI would have learned some sort of dynamics that describe how the world would change over time. Further, it would have some policy, which is trained by rolling out concrete scenarios using the dynamics in the world-model, and modifying the policy to generate actions that lead to high utility.

Finally, the dynamics model would be causally modular enough that it can substitute out what people intend to do. I’m not entirely sure how this would be done, though I would recommend looking at things like NeRFs for inspiration (as they also exhibit a sort of causal modularity, albeit in a more limited sense). I think the way I would do it is to try to make it have two models, one for predicting physical dynamics, and one for predicting people’s actions. These models would then be linked in a very limited way, namely that people’s bodies would function as an interface, where information from the physics model gets fed into the person model, and actions from the person model allow applying forces to the body in the physics model. Then counterfactuals could be encoded by just swapping out the person model.

Swapping out the person model might be a bit difficult too though, as it might lead to generating actions that are outside of the normal range of human behavioral variation. When applying a counterfactual that humans want to press the stop button, for instance, it is important that these attempts are mostly done using behavior that humans would actually engage in, and that the distribution of human behavior is reasonably well-covered (so that you don’t need to act in a very specific unnatural way in order for the AI to allow you to press the stop button).

Anyway, training the AI would then function as follows: You’d sample some sort of scenario to roll out. You’d then roll it out under the two counterfactual models, one where a human wants to press the stop button, and one where no human wants to press the stop button. In the former scenario, you’d do a gradient step optimizing , while in the latter scenario, you’d do a gradient step optimizing . (Or realistically, you’d sometimes use the original model for a while, and then during the roll out you’d swap it out with the modified model; that way the way doesn’t expect it to be immediately obvious what scenario it is in.)

I think due to the nature of deep learning, this would also reduce the maximizing problems. Deep learning tends to satisfice unless really pushed to the limits, I think?

Most likely in practice you can’t afford to do a full roll out over very long timescales to find a point estimate of or . Instead, you’d also partly do some sort of planning that involves estimating expected utilities, more akin to current model-free reinforcement learning. This leads to a fundamentally similar outcome, though.

Originality

I haven’t properly followed AI safety for a while. I don’t know if this idea is original. In fact it might be that I got the idea from someone else, and then forgot; though a lot of the description in the post here is something I’ve developed from scratch. I also don’t know if this has some subtle flaw I’m missing. But to me this seems like the productive approach for both this and other alignment issues, so I thought I would write it up.

My motivation for writing this post is Eliezer writing:

… corrigibility being “anti-natural” in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off …

I don’t necessarily disagree about corrigibility being somewhat anti-natural? But the counterfactual situation seems to provide a recipe for generating corrigibility. Maybe. I’m not even sure how this idea compares to the state of the art for stop button solutions.


1. 🤔 I wonder if this could be solved causally too; just introduce a counterfactual that all humans trust the AI; that way the AI has no reason to manipulate trust. Though “trust” is a lot more abstract than “tries to press button/​tries not to press button”, so it may be much harder to define a counterfactual there.

2. I think maybe this could be addressed by adding another counterfactual term to . Let denote the value of if counterfactually the button gets pressed, and let be an impact measure. You might then want something along the lines of . But I haven’t thought this through.