AI safety without goal-directed behavior

When I first entered the field of AI safety, I thought of the problem as figuring out how to get the AI to have the “right” utility function. This led me to work on the problem of inferring values from demonstrators with unknown biases, despite the impossibility results in the area. I am less excited about that avenue because I am pessimistic about the prospects of ambitious value learning (for the reasons given in the first part of this sequence).

I think this happened because the writing on AI risk that I encountered has the pervasive assumption that any superintelligent AI agent must be maximizing some utility function over the long term future, such that it leads to goal-directed behavior and convergent instrumental subgoals. It’s often not stated as an assumption; rather, inferences are made assuming that you have the background model that the AI is goal-directed. This makes it particularly hard to question the assumption, since you don’t realize that the assumption is even there.

Another reason that this assumption is so easily accepted is that we have a long history of modeling rational agents as expected utility maximizers, and for good reason: there are many coherence arguments that say that, given that you have preferences/​goals, if you aren’t using probability theory and expected utility theory, then you can be taken advantage of. It’s easy to make the inference that a superintelligent agent must be rational, and therefore it must be an expected utility maximizer.

Because this assumption was so embedded in how I thought about the problem, I had trouble imagining how else to even consider the problem. I would guess this is true for at least some other people, so I want to summarize the counterargument, and list a few implications, in the hope that this makes the issue clearer.

Why goal-directed behavior may not be required

The main argument of this chapter is that it is not required that a superintelligent agent takes actions in pursuit of some goal. It is possible to write algorithms that select actions without doing a search over the actions and rating their consequences according to an explicitly specified simple function. There is no coherence argument that says that your agent must have preferences or goals; it is perfectly possible for the agent to take actions with no goal in mind simply because it was programmed to do so; this remains true even when the agent is intelligent.

It seems quite likely that by default a superintelligent AI system would be goal-directed anyway, because of economic efficiency arguments. However, this is not set in stone, as it would be if coherence arguments implied goal-directed behavior. Given the negative results around goal-directed behavior, it seems like the natural path forward is to search for alternatives that still allow us to get economic efficiency.

Implications

At a high level, I think that the main implication of this view is that we should be considering other models for future AI systems besides optimizing over the long term for a single goal or for a particular utility or reward function. Here are some other potential models:

  • Goal-conditioned policy with common sense: In this setting, humans can set goals for the AI system simply by asking it in natural language to do something, and the AI system sets out to do it. However, the AI also has “common sense”, where it interprets our commands pragmatically and not literally: it’s not going to prevent us from setting a new goal (which would stop it from achieving its current goal), because common sense tells it that we don’t want it to do that. One way to think about this is to consider an AI system that infers and follows human norms, which are probably much easier to infer than human values (most humans seem to infer norms very accurately).

  • Corrigible AI: I’ll defer to Paul Christiano’s explanation of corrigibility.

  • Comprehensive AI Services (CAIS): Maybe we could create lots of AI services that interact with each other to solve hard problems. Each individual service could be bounded and episodic, which immediately means that it is no longer optimizing over the long term (though it could still be goal-directed). Perhaps we have a long-term planner that is trained to produce good plans to achieve particular goals over the span of an hour, and a plan executor that takes in a plan and executes the next step of the plan over an hour, and leaves instructions for the next steps.

There are versions of these scenarios which are compatible with the framework of an AI system optimizing for a single goal:

  • A goal-conditioned policy with common sense could be operationalized as optimizing for the goal of “following a human’s orders without doing anything that humans would reliably judge as crazy”.

  • MIRI’s version of corrigibility seems like it stays within this framework.

  • You could think of the services in CAIS as optimizing for the aggregate reward they get over all time, rather than just the reward they get during the current episode.

I do not want these versions of the scenarios, since they then make it tempting to once again say “but if you get the goal even slightly wrong, then you’re in big trouble”. This would likely be true if we built an AI system that could maximize an arbitrary function, and then tried to program in the utility function we care about, but this is not required. It seems possible to build systems in such a way that these properties are inherent in the way that they reason, such that it’s not even coherent to ask what happens if we “get the utility function slightly wrong”.

Note that I’m not claiming that I know how to build such systems; I’m just claiming that we don’t know enough yet to reject the possibility that we could build such systems. Given how hard it seems to be to align systems that explicitly maximize a reward function, we should explore these other methods as well.

Once we let go of the idea of optimizing for a single goal and it becomes possible to think about other ways in which we could build AI systems, there are more insights about how we could build an AI system that does what we intend instead of what we say. (In my case it was reversed—I heard a lot of good insights that don’t fit in the framework of goal-directed optimization, and this eventually led me to let go of the assumption of goal-directed optimization.) We’ll explore some of these in the next chapter.