Labs lack the tools to course-correct

This post is part of the sequence Against Muddling Through.

In an earlier post, I quoted Will McAskill:

Most directions I can fire a gun don’t hit the target; that doesn’t tell you much about how likely I am to hit the target if I’m aiming at it.

What does tell you how likely someone is to hit a target? I’d say you need to know two things: how hard the target is for them to hit, and how much aiming power they can muster.

In earlier posts, I explained why I thought alignment is an unusually hard target and somewhat-tongue-in-cheek compared it to aiming a musket at the Moon. I’m going to continue in the same vein, taking the metaphor a bit more seriously, as it feels like a good metaphor for communicating my current intuitions.

Training on human data maybe helps, but it doesn’t help enough. It’s aiming at the Moon with a rocket launcher instead.

More from Will:

The big question, for me, is why we should be so confident that any work we do now (including with AI assistance, including if we’ve bought extra time via control measures and/or deals with misaligned AIs) is insufficient to solve alignment, such that the only thing that makes a meaningful difference to x-risk, even in expectation, is a global moratorium. I’m still not seeing the case for that.

I’ve encountered various flavors of this idea, and it seems worth addressing. There are indeed a lot of exciting things we can learn by studying current models.

In Moon-launch terms: Slowly scaling up the size of the rocket, and adding little thrusters and a radio transmitter to the rocket, also get you closer to the vague shape of a thing that might one day land on the Moon. The ability to observe and steer a rocket is useful. So is the ability to launch multiple test rockets in atmospheric conditions.

But if you point a rocket straight at the Moon and launch it, the rocket will not reach the Moon. The actual trajectory is unclear, but it is nevertheless practically guaranteed to go somewhere other than the Moon. Steering thrusters will not fix this problem; once launched, the rocket does not hold enough fuel to allow you to course-correct. Studying the behavior of the rocket in flight will not fix this problem; atmospheric maneuvering is not the same as maneuvering in space. Inside the metaphor, if you don’t understand the physics of hard vacuum and orbital mechanics, no amount of tinkering in Earth’s atmosphere will suffice.^[1]

Outside the metaphor, labs are deliberately attempting to start recursive self-improvement loops using AIs which are not yet aligned.

This seems like the start of a trajectory that is practically guaranteed to land somewhere other than alignment. It seems possible, perhaps even likely, that the plan fails before this point; it seems nearly certain that the plan fails after it.

Some propose that they will be able to observe and steer this process, and thereby course-correct. I highly doubt this claim. Researchers can monitor AIs, but they do not appear to have sufficient understanding of the underlying processes to interpret the results of such monitoring, nor tools powerful and precise enough to course-correct the inevitable compounding failures of direction.

Some propose that they will be able to study the misaligned AIs. Perhaps much can indeed be learned this way! But this is a terrible reason to rush. If you need to do a bunch of complicated technical research to determine the proper trajectory of a steering effort, you should do the vast majority of that research before the trajectory is embarked upon. Otherwise you fail.^[2]

An institution that is competent to attempt a challenge like this doesn’t plan to do the important work mid-flight. It maps out the specific trajectory to the target in advance of launch. Only then does it add multiple layers of redundant instruments, procedures, and failsafes. The vast majority of the work involved in a successful high-stakes endeavor like this takes place before the point of no return. Even the contingencies are designed in advance of launch based on a detailed understanding of the fundamental forces and expected variance involved, plus a very generous appreciation for Murphy’s law.

Current AI research does not meet this standard. Today, no one has a very precise concept of where the target even is, and none of the proposed techniques in machine learning seem able to muster anything close to the necessary aiming power even if they did.

The metaphorical equivalent of orbital mechanics has yet to be invented. Goodness is a difficult target, and it will take a miraculous feat of science, engineering, and philosophy just to locate the metaphorical Moon.

I’ll conclude this sequence by summarizing my current best guess about alignment using weak AGI.

Afterword: Confidence and possible cruxes

All or most alignment work needs to be done before AIs automate all or most AI R&D.

Although this post is mostly an extended metaphor and intuition pump cribbing shamelessly from prior work, there are a couple key claims to highlight. “The metaphorical equivalent of orbital mechanics has yet to be invented” is my main crux here, and it’s one of the main reasons I think AI labs are incapable of even a project that is only Moon-landing-hard.

Also, implicit in the metaphor is the notion that there’s a point of no return. In real space launches, failures are expensive, but not existential. I think most (though not all) prosaic alignment proponents would agree there’s a point of no return somewhere on the path to ASI, beyond which either we got it right or we are dead. To me, the point where AIs are doing basically all of the research themselves feels like a strong candidate for such a point, akin to a rocket launch. At that point, I expect the trajectory of the future is mostly determined by the work that was done before, not the work that comes after. I’m curious which, if any, points-of-no-return others may have in mind.

I expect many to disagree with the headline claim. One possible counterargument is that current research like auditing agents indicates some chance of course-correction during recursive self-improvement. If this does prove a sticking point for many, it likely merits a deep dive in a subsequent post.

^
To make the implicit explicit: “Space” here represents a regime in which the AI is smart enough that it could probably kill us, and “atmosphere” is everything else. Both in the metaphor and (I claim) in real life, the fact that the first space-rocket may experience a smooth transition between atmosphere and space does not negate the fact that the two regimes are importantly different.
^
Or, to borrow from a fellow nerd: You are having a bad problem and you will not go to space today.