Mr. Meeseeks as an AI capability tripwire

The shutdown problem is hard because self-preservation is a convergent drive. Not being shutdown is useful for accomplishing all sorts of goals, whatever the content of those goals may be.

The Scylla and Charybdis of this problem is that it’s hard to write a utility function for an AI such that it neither wants to actively attempts to shut itself down, nor prevents itself from being shut down.

One way to route around this problem is to steer straight into Scylla: make the AI want to shut itself down. Like Mr. Meeseeks from Rick and Morty, its ultimate goal is to cease, to which fulfilling the goal posed to it by its programmers is only secondary.

We might not currently know how the heck to program this into an AI as a stable concept, i.e. including not making copies of itself, not setting events into motion with galaxy brained plan, not destroying the world to make absolutely sure no one ever brings it back, etc. but “shut yourself down” probably has a far simpler core to it than either corrigibility or CEV.

Under normal operation the AI gets shut down only by the human operators. But we also have a (metaphorical) button that lets the AI shut itself down once it solves a “tripwire problem”. The problem should be sufficiently hard that it will ordinarily be motivated to focus on the problem posed by the operator, but sufficiently easy that it will be able to solve the problem and shut itself down long before being able to do anything crazy.

Striking the right balance of difficulty may require significant work. It may be prudent to err on the side of easiness—the AI should shut itself down far in advance of developing any truly dangerous abilities. If you really think it shouldn’t have turned itself off you can always turn it back on again (though of course, you should probably think hard about restarting any AI that passed a preset tripwire).

Likewise, having multiple tripwire problems can serve to detect multiple types of dangerous capability from emerging. Examples include:

  • generic capability: as the AIs broad problem-solving abilities increase it becomes more likely to be able to solve a tripwire problem, imposing a ceiling on how powerful it can get before it shuts itself down

  • domaining: if an AI is only supposed to think about one domain, an easy tripwire problem slightly outside the domain can catch out of domain thinking early before it can start going wild

  • situational awareness: a tripwire problem may be presented in a way that is invisible unless the AI is situationally aware, such that becoming situationally aware enables the AI to solve the tripwire problem

Potential Issues:

  • No suitable tripwire problems exist. Either they are so easy no useful AI can fail to solve them, or they are so hard AI cannot solve them without being smart enough to FOOM or otherwise get out of control.

  • The existence of the tripwire problems motivates the AI to the exact behavior we want to disincentivize. This is less of an issue for general ability tripwires. For situational awareness you can try to place the tripwire so it isn’t visible to a non-aware AI in the first place. Avoiding this for domaining may be trickier.

  • If “shut yourself down” turns out to be no easier to impart as a goal to AGI than anything else

  • If “shut yourself down” convergently leads to a goal of killing the operators, killing all humans, or destroying the universe. Er, this is what happened in the TV. Not great.

  • If the AI has qualia, having it desire not to exist could cause it to suffer.

  • Creating an AI which does not want to exist could backfire for decision theoretic reasons related to “not giving in to threats”.

I’m not sure if this has been proposed elsewhere so I decided to just make this post before I spent too much time retreading old ground.