Atonement for my sins towards deconfusion

I have argued that the deconfusion of goal-directedness should not follow what I dubbed “the backward approach”, that is starting from the applications for the concept and reverse-engineering its coherent existence (or contradiction) from there. I have also argued that deconfusion should always start and center around applications.

In summary, I was wrong. About the former

If deconfusion indeed starts at the applications, what about my arguments against the backward approach to goal-directedness? I wrote

The same can be said for all convergent instrumental subgoals in the paper, and as far as I know, every argument about AI risks using goal-directedness. In essence, the backward approach finds out that what is used in the argument is the fundamental property that the forward approach is trying to formalize. This means in turn that we should work on the deconfusion of goal-directedness from the philosophical intuitions instead of trying to focus on the arguments for AI risks, because these arguments depend completely on the intuitions themselves.

My best answer is this post: an exploration of applications of deconfusing goal-directedness, and how they actually inform and constrain the deconfusion itself. The gist is that in discarding a different approach than what felt natural to me, I failed to notice all the ways in which applications do constraint and direct deconfusion. In this specific case, the most fruitful and important applications I’ve found are convergent subgoals, i replacing optimal policies, formalizing inner alignment and separating approval-directed systems from pure maximizers.

Thanks to John S. Wentworth for pushing hard on the importance of starting at the applications.

Applications

Convergent subgoals

Convergent subgoals (self-preservation, resource acquisitions…) are often important ingredients in scenarios starting with misspecified objectives and ending with catastrophic consequences. Without them, even an AGI would let itself be shut down, greatly reducing the related risks. Convergent subgoals are also clearly linked with goal-directedness, since the original argument proposes that most goals lead to them.

As an application for deconfusion, what does this entail? Well, goal-directed systems should be the whole class of systems that could have convergent subgoals. It doesn’t necessarily mean that most goal-directed systems will actually have such convergent subgoals, but a low-goal-directed system shouldn’t have them at all. Hence high goal-directedness should be a necessary (but not necessarily sufficient) condition for having convergent subgoals.

This constraint then point to concrete approaches for deconfusing goal-directedness that I’m currently pursuing:

Search for informal necessary conditions to each convergent subgoal, and then try to see the links/common denominator. Here I am looking for requirements which are simpler than goal-directedness, because they will hopefully be components of it.
List for each convergent subgoals examples of systems with and without this goal, and search for commonalities.
Based on Alex’s deconfusion of power-seeking, look for a necessary condition on the policies for his theorems to hold.

Replacing optimal policies

When we talk about AGI having goals, we have a tendency to use optimal policies as a stand-in. These policies do have a lot of nice properties: they are maximally constrained by the goal, allow some reverse-engineering of goals without thinking about error models, and make it easy to predict what happens in the long-term—optimality.

Yet as Richard points out in this comment, true optimal policies for real world tasks are probably incredibly complex and intractable. It’s fair to say that for any task we cannot just enumerate on, we probably haven’t built an optimal policy. For example AlphaGo and its successors are very good at Go, but they are not strictly speaking optimal.

The above point wouldn’t matter if optimal policies pretty much behaved just like merely competent ones. But that’s not the case in general: usually the very optimal strategy is something incredibly subtle that uses many tricks and details that we have no way of finding except through exhaustive search. Notably, the reason we expect quantilizers to be less catastrophic than pure maximizers is indeed that difference between optimal behavior and competent one.

Because of this, focusing on optimal policies when thinking about goal-directedness might have two dangerous effects:

If the problems we investigate only appear for optimal policies, then it is probably a waste of time to study them, as we won’t build an optimal policy (or not for a very long time). And wasting resources might prove very bad for shorter timelines.
If the problems we investigate also appear for merely competent goal-directed policies, but we have to wait for optimality before spotting them when training/studying something, we’re fucked because they will crop up way before that point.

What I take from this analysis is that we want to replace the optimality assumption by goal-directedness + some competence assumption.

Here we don’t really have a necessary condition, so unraveling what the constraint entails is slightly more involved. But we can still look at the problems with optimality, and turn them into requirements for goal-directedness:

Since optimal policies don’t capture the competent policies we’re actually building, we want goal-directedness to do it.
- Possible approach: list the competent AI we’re able to produce, and try to find some commonality beyond being good at their task.
But not all policies should be goal-directed, or it becomes a useless category.
- Possible approach: find examples of policies which we don’t want to include in the goal-directed ones, possibly because there is no way for them to get convergent subgoals.
Less sure, but I see the issues with optimality as stemming from an obsession with competence. This confort me in my early impression that goal-directedness is more about “really trying to accomplish the goal” than in accomplishing it.
- Find the commonality between not very competent goal-directed policies (like an average chess player), and try to formalize it.

Grounding inner alignment

Risks from Learned Optimisation introduced the concept of mesa-optimizers or inner optimizers to point to the results of search that might themselves be doing internal search/optimization. This has been consistently confusing, and people constantly complain about it. Abram has a recent post that looks at different ways of formalizing it.

In addition to the confusion, I believe that focusing on inner optimizers as currently defined underestimate the range of models that would be problematic to build, because I expect some goal-directed systems to not use optimization in this way or have explicit goals. I also expect goal-directedness to be easier to define than inner optimization, even if that probably comes from a bias.

Rephrasing, the application is that goal-directedness should be a sufficient condition for the arguments in Risks to apply.

The implications are quite obvious:

Mesa-optimizers should be goal-directed
- Possible approach: look for the components of a mesa-optimizers, and see what can be relaxed or abstracted while still keeping the whole intuitively goal-directed.
Goal-directed systems should have the same problems/issues argued in Risks.
- Possible approach: find necessary conditions for the arguments in Risks to hold.
Goal-directedness should be less confusing than inner-optimization/mesa-optimization.
- Possible approach: list all the issues and confusions people have with inner optimization, and turn that into constraints for a less confusing alternative.

Approval-directed systems as less goal-directed

This application is definitely more niche than the others, but it seems quite important to me. Both Paul in his approval-directed post and Rohin in this comment to one of his posts on goal-directedness have proposed that approval-directed systems are inherently less goal-directed than pure maximizers.

Why? Because approval-directed systems would have a more flexible goal, and also wouldn’t not have the same convergent subgoals that we expect from competent goal-directed systems.

I share this intuition, but I haven’t been able to find a way to actually articulate it convincingly. Hence why I add this constraint: approval-directed systems should have low-goal-directedness (or at least lower than pure-maximizers)

Since the constraint is quite obvious, let’s focus on the approaches to goal-directedness this suggests.

Deconfuse approval-directed systems as much as possible, to have a better idea of what their low goal-directedness would entail
List all the intuitive differences between approval directed systems and highly goal-directed systems
Look for sufficient conditions (on a definition of goal-directedness) for approval-directed systems to have low goal-directedness.

Conclusion: all that is left is work

In refusing to focus on the application, I slowed myself down in two ways:

By going into weird tangents and nerd-snipe without a mean to check if the digression was relevant or not
By missing out on the many approaches and research directions that emerge after even a cursory exploration of these applications.

I attempted to correct my mistake in this post, by looking at the most important applications for deconfusing goal-directedness (convergent subgoals, optimality, inner optimization and approval directedness), and extracting constraints and questions to investigate from them.

This cuts my work for me on the topic; if you find yourself interested or excited by any of the research ideas proposed in this post, send me a message so we can talk!

Applications for Deconfusing Goal-Directedness