Why Instrumental Goals are not a big AI Safety Problem

For concreteness, I’ll focus on the “off button” problem, which is that an AI (supposedly) will not let you turn it off. Why not? The AI will have some goal. ~Whatever that goal is, the AI will be better able to achieve it if it is “on” and able to act. Therefore, ~all AIs will resist being turned off.

Why is this wrong? First, I’ll offer an empirical argument. No actually existing AIs exhibit this problem, even the most agentic and capable ones. Go-playing AIs are superhuman in pursuing their “goal” of winning a Go game, but they do not offer any resistance to being turned off. Language models have a seemingly impressive world model, and are fairly capable at their “goal” of predicting the next token, but they do not resist being turned off. (Indeed, it’s not even clear that language models are “on” in any sense, since different executions of the model share no state, and any individual execution is short-lived). Automated trading systems make stateful economically significant decisions, but they do not resist being turned off (actually, being easy to turn off is an important design consideration).

So that’s the empirical case. How about a thought experiment? Suppose you scaled up a game-playing AI with 1000x more data/​compute/​model-size. It would get much better at achieving its “goal” of playing the game well. Would it start to resist being turned off? Obviously it would not. It’s not even going to be able to represent the idea of being “turned off”, since it’s just thinking very hard about e.g. Go.

So what about the theory? How come theory says ~all AIs should resist being turned off, and yet ~no actually existing AIs (or even powered-up versions of existing AIs), do resist being turned off? The reason is that “agents with a goal” is not a good way of thinking about AIs.

It’s a tempting way to think about AIs, because it is a decent (although imperfect) way of thinking about humans. And “reward functions” do look an awful lot like goals. But there are some key differences:

1) Training defines the “world” that the AI is thinking about. This was the problem with the super-Go-playing AI—we just told it to think hard about the world of Go. So even though it is superhuman, even godlike, *in that domain*, it hasn’t thought about the real world. It can’t even interact with the real world, since it takes input and provides output in the very specific way that it was trained to.

2) AI is composable. Unlike a human, you can run arbitrary code—or another AI—at *any point* inside an AI. You can add AI-driven checks that proposed actions are safe. You could have many different action-proposer AIs that were trained on different datasets or reward functions, totally independently and without considering how they will interact with the other AI-pieces. There is no reason a heterogenous agent like this will be well-modeled as having a singular goal, even if some of the pieces that compose it might.

OK, but what if we did build an agentic AI that interacted with the real world and was a single system? I agree, then you might have a problem. But you probably wouldn’t want to. Why?

1) Training an AI on the entire world is extremely wasteful, compared to thinking about some narrowly defined problem. If you want to “run a company” well, thinking about how your desk chair is made up of atoms that obey the laws of physics is likely to be a distraction compared to case studies of how similar companies acted in the past. Even thinking about how to circumvent the “off” button is a distraction, compared to just focusing on your job. Humans lack the ability to focus their thoughts like this; AIs don’t, and that is both a competitive advantage and a contribution to safety. (You could even monitor your AIs to make sure they’re staying on task; consider the contract-drafting em).

2) Composability is a very powerful way to improve safety, and is ubiquitous in existing computer systems. A software “off switch” is the most obvious example (e.g. “turn yourself off after you get this message”). Heart-beating (“turn yourself off if you haven’t talked to ‘home base’ in a few seconds) is also very common. Finance has “risk checks” (before taking any actions, sanity check them against some simple hard-coded rules). Composability also allows you to define pieces that address narrowly-defined problems, which is helpful as argued in (1).

So even agentic AIs probably won’t have “goals”. Some of their subcomponents may have goals, but they will probably be too narrow to translate into instrumental goals for the agent as a whole. Which is good, because instrumental goals would be dangerous, so we should try and emphasize these important ways that AIs are less goal-oriented than humans when thinking about and designing AIs.