Against the Backward Approach to Goal-Directedness

Introduction: Forward and Backward Approaches

I first started thinking about deconfusing goal-directedness after reading Rohin’s series of four posts on the subject. My goal was to make sense of his arguments related to goal-directedness, and to understand whether alternatives where possible and/​or viable. I thus thought of this research as quite naturally following two complementary approach:

  • A forward approach, starting from the intuitions about goal-directedness and trying to find a satisfactory formalization from a philosophical standpoint.

  • A backward approach, starting from the arguments on AI risk using goal-directedness (like Rohin’s), and trying to find what about goal-directedness made these arguments work.

In the end, both approaches would meet in the middle and inform each other, hopefully settling whether the cluster of concepts around goal-directedness was actually relevant for the arguments using the latter.

The thing is, I became less and less excited about the backward approach over time, to the point that I don’t work on it anymore. I sincerely feel like most of the value will come from nailing the forward approach (with additional constraints mentioned below).

Yet I never wrote anything about my reason for this shift, if only because I never made explicit this approach, except with my collaborators Michele Campolo and Joe Collman. Since Daniel Kokotajlo pushed for what is essentially the backward approach in a comment in our Literature Review on Goal-Directedness, I believe this is the perfect time to do so.

Trying the Backward Approach

How do we start to investigate goal-directedness through the backward approach? Through the arguments about AI risks relying on goal-directedness. Let’s look at the arguments for convergent instrumental subgoals from Omohundro’s The Basic AI Drives, which require goal-directedness as mentioned by Rohin’s Coherence arguments do not imply goal-directed behavior. This becomes clear through the definition of AI used by Omohundro:

To say that a system of any design is an “artificial intelligence”, we mean that it has goals which it tries to accomplish by acting in the world.

The first convergent instrumental subgoal is self-improvement:

One kind of action a system can take is to alter either its own software or its own physical structure. Some of these changes would be very damaging to the system and cause it to no longer meet its goals. But some changes would enable it to reach its goals more effectively over its entire future. Because they last forever, these kinds of self-changes can provide huge benefits to a system. Systems will therefore be highly motivated to discover them and to make them happen. If they do not have good models of themselves,they will be strongly motivated to create them though learning and study. Thus almost all AIs will have drives towards both greater self-knowledge and self-improvement.

What feature of goal-directedness is used in this argument? That a goal-directed system will do things that clearly improve its abilities to reach its goal. Well… disentangling that from the intuitions around goal-directedness might prove difficult.

Let’s look at the second convergent instrumental subgoal, rationality:

So we’ll assume that these systems will try to self-improve. What kinds of changes will they make to themselves? Because they are goal directed, they will try to change them-selves to better meet their goals in the future. But some of their future actions are likely to be further attempts at self-improvement. One important way for a system to better meet its goals is to ensure that future self-improvements will actually be in the service of its present goals. From its current perspective, it would be a disaster if a future version of itself made self-modifications that worked against its current goals. So how can it ensure that future self-modifications will accomplish its current objectives? For one thing, it has to make those objectives clear to itself. If its objectives are only implicit in the structure of a complex circuit or program, then future modifications are unlikely to preserve them. Systems will therefore be motivated to reflect on their goals and to make them explicit.

In an ideal world, a system might be able to directly encode a goal like “play excellent chess” and then take actions to achieve it. But real world actions usually involve tradeoffs between conflicting goals. For example, we might also want a chess playing robot to play checkers. It must then decide how much time to devote to studying checkers versus studying chess. One way of choosing between conflicting goals is to assign them real-valued weights. Economists call these kinds of real-valued weightings “utility functions”. Utility measures what is important to the system. Actions which lead to a higher utility are preferred over those that lead to a lower utility.

Once again, the property on which these arguments rely is that goal-directed system try to be better at accomplishing their goals.

The same can be said for all convergent instrumental subgoals in the paper, and as far as I know, every argument about AI risks using goal-directedness. In essence, the backward approach finds out that what is used in the argument is the fundamental property that the forward approach is trying to formalize. This means in turn that we should work on the deconfusion of goal-directedness from the philosophical intuitions instead of trying to focus on the arguments for AI risks, because these arguments depend completely on the intuitions themselves.

Extended Forward Approach

Of course, arguments on AI risks have a role to play. What we want is to find if they hold or not, in the end. So the properties of goal-directedness should help clarify these arguments, or at least relate to them.

The model I’m working with is thus (where the flow of arrow capture the successive steps): Definition of goal-directedness → Test against philosophical intuitions → Test against AI risk arguments. I don’t think it’s especially new either; my impression is that Rohin has a similar model, although he might put more importance on the last step that I do at this point in the research.

I also am not pretending that the only way to make the AI risk arguments mentioned above work is through formalization of the cluster of intuitions around goal-directedness. There might be another really important concept that tie these arguments together, without any link to goal-directedness. It’s just that at the moment, there seem to be a confluence of arguments around this concept and the intuitions linked with it. I’m taking the bet that following this lead is the best way we have right now to poke at these arguments and make them cleaner or break them.