The Rocket Alignment Problem, Part 2

Link post

Previously (Eliezer Yudkowsky): The Rocket Alignment Problem.

Recently we had a failure to launch, and a failure to communicate around that failure to launch. This post explores that failure to communicate, and the attempted message.

Some Basic Facts about the Failed Launch

Elon Musk’s SpaceX launched a rocket. Unfortunately, the rocket blew up, and failed to reach orbit. SpaceX will need to try again, once the launch pad is repaired.

There was various property damage, but from what I have seen no one was hurt.

I’ve heard people say the whole launch was a s***show and the grounding was ‘well earned.’ How the things that went wrong were absurd, SpaceX is the worst, and so on.

The government response? SpaceX Starship Grounded Indefinitely By FAA.

An FAA spokesperson told FLYING that mishap investigations, which are standard in cases such as this, “might conclude in a matter of weeks,” but more complex investigations “might take several months.”

Perhaps this will be a standard investigation, and several months later everything will be fine. Perhaps it won’t be, and SpaceX will never fly again because those in power dislike Elon Musk and want to seize this opportunity.

There are also many who would be happy that humans won’t get to go into space, if in exchange we get to make Elon Musk suffer, perhaps including those with power. Other signs point to the relationships with regulators remaining strong, yet in the wake of the explosion the future of Starship is for now out of SpaceX’s hands.

A Failure to Communicate

In light of these developments, before we knew the magnitude or duration of the grounding, Eliezer wrote the following, which very much failed in its communication.

If the first prototype of your most powerful rocket ever doesn’t make it perfectly to orbit and land safely after, you may be a great rocket company CEO but you’re not qualified to run an AGI company.

(Neither is any other human. Shut it down.)

Eliezer has been using the rocket metaphor for AI alignment for a while, see The Rocket Alignment Problem.

I knew instantly both what the true and important point was here, and also the way in which most people would misunderstand.

The idea is that in order to solve AGI alignment, you need to get it right on the first try. If you create an AGI and fail at its alignment, you do not get to scrap the experiment, learn from what happened. You do not get to try, try again until you succeed, the way we do for things like rocket launches.

That is because you created an unaligned AGI. Which kills you.

Eliezer’s point here was to say that the equivalent difficulty level and problem configuration to aligning an AGI successfully would be if Musk stuck the landing on Starship on the first try. His first attempt to launch a rocket would need to end up safely back on the launching pad.

The problem is that the rocket blowing up need not even get one person killed, let alone kill everyone. The rocket blowing up caused a bunch of property damage. Why Play in Hard Mode (or Impossible Mode) when you only need to Play in Easy Mode?

Here were two smart people pointing out exactly this issue.

Jeffrey Ladish: I like the rocket analogy but in this case I don’t think it holds since Elon’s plans didn’t depend on getting it right the first try. With rockets, unlike AGI, it’s okay to fail first try because you can learn (I agree that Elon isn’t qualified to run an AGI company)

Eliezer: Okay if J Ladish didn’t get it, this was probably too hard to follow reliably.

The analogy is valid because Elon would’ve *preferred* to stick the landing first try, and wasn’t setting up deliberately to fail where Starship failed. If he had the power to build a non-omnicidal superintelligence on his first try, he could’ve also used it to oneshot Starship.

The general argument is about the difference between domains where it’s okay to explode a few rockets, and learn some inevitable thing you didn’t know, and try again; vs CERTAIN OTHER domains where you can’t learn and try again because everyone is already dead.

And Paul Graham.

Paul Graham: Well that’s not true. The risk tradeoff in the two cases is totally different.

Eliezer Yudkowsky: If PG didn’t read this the intended way, however, then I definitely failed at this writing problem. (Which is an alarming sign about me, since a kind of mind that could oneshot superintelligence in years can probably reliably oneshot tweets in seconds.)

Even if Elon could have done enough extra work, such that he stuck the landing the first time reliably, that doesn’t mean he should have spent the time and effort to do that.

The question is whether this is an illustration that we can’t solve something like this, or merely that we choose not to, or perhaps didn’t realize we needed to?

Eliezer’s intended point was not that Elon should have gotten this right on the first try, it was that if Elon had to get it right on the first try, that is not the type of thing humans are capable of doing.

Eliezer Yudkowsky: Of course Elon couldn’t, and shouldn’t have tried to, make his first Starship launch go perfectly. We’re not a kind of thing that can do that. We don’t appear to be a kind of thing that is going to stick the superintelligence landing either.

The argument is, “If you were a member of the sort of species that could hurriedly build an unprecedented thing like a superintelligence and have it work great first try, you would observe oneshot successes at far easier problems like the Starship launch.”

It is not about Elon in particular being too dumb or having failed at Starship on some standard that humans can and should try to meet; that’s why the original tweet says “Neither is any other human”.

Clearly, the communication attempt failed. Even knowing what Eliezer intended to say, I still primarily experienced the same reaction as Paul and Jeffrey, although they’d already pointed it out so I didn’t have to say anything. Eliezer post-mortems:

Okay so I think part of how this tweet failed is that I was trying to *assume the context* of it being *obviously absurd* that anyone was trying to critique SpaceX about their Starship test; and part of the point of the tweet is side commentary about how far you have to stretch to find an interpretation that actually *is* a valid critique of the Starship exploding, like: “Well okay but that means you shouldn’t try to build a superintelligence under anything remotely resembling present social and epistemic conditions though.”

Yeah, sadly that simply is failing and knowing How the Internet Works.

That said I still think it *is* valid; on distant planets where the aliens are smart enough that they can make superintelligences work on the first life-or-death try, in the middle of an arms race about it, their version of Starship didn’t explode either.

Perhaps Getting it Right The First Time is Underrated

What if that’s also not how government works? Oh no.

If you don’t get your rocket right on the first try, you see, the FAA will, at a minimum, ground you until they’ve done a complete investigation. The future is, in an important sense, potentially out of your hands.

Some people interpreted or framed this as “Biden Administration considering the unprecedented step of grounding Starship indefinitely,” citing previous Democratic attacks on Elon Musk. That appears not to be the case, as Manifold Markets still has Starship at 73% to reach orbit this year.

Given risk of another failure has to account for a lot of the 27% chance of failure, that is high confidence that the FAA will act reasonably.

In the absence of considering the possibility of a hostile US government using this to kill the whole program, everyone agreed that it was perfectly reasonable to risk a substantial chance that the unmanned rocket would blow up. Benefits exceed costs.

However, there existed an existential risk. If you don’t get things to go right on the first try, an entity far more powerful than you are might emerge, that has goals not all that well aligned with human values, and that does not respect your property rights or the things that have value in the universe, and you might lose control of the future to it, destroying all your hopes.

The entity in question, of course, is the Federal Government. Not AGI.

It seems not to be happening in this case, yet it is not hard to imagine it as a potential outcome, and thus a substantial risk.

Thus, while the costs of failure were not existential to the project let alone to Musk, they could have been existential to the project. There were indeed quite large incentives to get this right on the first try.

Instead, as I understand what happened, multiple important things went wrong. Most importantly, the launch went off without the proper intended launch pad, purely because no one involved wanted to wait for the right launch pad to be ready.

That’s without being in much of a race with anyone.

The Performance of an Impossibility

Eliezer also writes:

“The law cannot compel the performance of an impossibility.” “The law cannot compel a blind man to pass a vision test. The law can and does make passing such a test a requirement for operating a vehicle. Blind men cannot legally drive cars.”

If you can’t get superintelligence right without first trying and failing a bunch of times, so you can see and learn from what you did wrong, you should not be legally allowed to build a superintelligence; because that stands the chance (indeed, the near-certainty) of wiping out humanity, if you make one of those oh-so-understandable mistakes.

If it’s impossible for human science and engineering to get an unprecedented cognitive science project right without a lot of trial and error, that doesn’t mean it should be legal for AI builders to wipe out humanity a few dozen times on the way to learning what they did wrong because “the law cannot compel the performance of an impossibility”.

Rather, it means that those humans (and maybe all humans) are not competent to pass the test that someone needs to pass, in order for the rest of us to trust them to build superintelligence without killing us. They cannot pass the vision test, and should not be allowed to drive our car.

(The original quote is from H. Beam Piper’s _Fuzzies and other People_ and is in full as follows:

“Then, we’re all right,” he said. “The law cannot compel the performance of a impossibility.”

“You only have half of that, Victor,” Coombes said. “The law, for instance, cannot compel a blind man to pass a vision test. The law, however, can and does make passing such a test a requirement for operating a contragravity vehicle. Blind men cannot legally pilot aircars.”)

It is central to Eliezer Yudkowsky’s model that we need to solve AGI alignment on the first try, in the sense that:

  1. There will exist some first AGI sufficiently capable to wipe us out.

  2. This task is importantly distinct from previous alignment tasks.

  3. Whatever we do to align that AGI either works, or it doesn’t.

  4. If it doesn’t work, it’s too late, that’s game over, man. Game over. Dead.

If one of these four claims is false, you have a much much easier problem, one that Eliezer himself thinks becomes eminently solvable.

  1. If no sufficiently capable AGI is ever built, no problem.

  2. If this is the same as previous alignment tasks, we still have to learn how to align previous systems, and we still have to actually apply that, and choose a good thing to align to. This isn’t a cakewalk. It’s still solvable, because it’s not a one-shot. A lot of various people’s hope is that the alignment task somehow isn’t fundamentally different when you jump to dangerous systems, despite all the reasons we have to presume that it is indeed quite different.

  3. I don’t think you get out of this one. Seems pretty robust. I don’t think ‘kind of aligned’ is much of a thing here.

  4. If when the AGI goes wrong you can still be fine, that’s mostly the same as the second exception, because you can now learn from that and iterate, given the problem descriptions match, and you’re not in a one-shot. A lot of people somehow think ‘we can build an AGI and if it isn’t aligned that’s fine, we’ll pull the plug on it, or we’re scrappy and we’ll pull together and be OK’ or something, and, yeah, no, I don’t see hope here.

There are a number of other potential ‘ways out’ of this problem as well. The most hopeful one, perhaps, is: Perhaps we have existing aligned systems sufficiently close in power to combat the first AGI where our previous alignment techniques fail, so we can have a successful failure rather than an existentially bad failure. In a sense, this too would be solving the alignment problem on the first try – we’ve got sufficiently aligned sufficiently powerful systems, passing their first test. Still does feel importantly different and perhaps easier.

Takeaways

I don’t know enough to say to what extent SpaceX (or the FAA?) was too reckless or incompetent or irresponsible with regard to the launch. Hopefully everything still works out fine, the FAA lets them launch again and the next one succeeds. The incident does provide some additional evidence that there will be that much more pressure to launch new AI and even AGI systems before they are fully ready and fully tested. We have seen this with existing systems, where there were real and important safety precautions taken towards some risks, but in important senses the safeguards against existential concerns and large sudden jumps in capabilities were effectively fake – we did not need them this time, but if we had, they would have failed.

What about the case that Eliezer was trying to make about AI?

The important takeaway here does not require Eliezer’s level of confidence in the existential costs of failure. All that is required is to understand this, which I strongly believe to be true:

  1. Alignment techniques will often appear to work for less powerful systems like the ones we have now, then break down exactly when AGI systems get powerful enough to take control of the future or kill us all.

  2. Sometimes this breakdown is inevitable, such as when the alignment technique does not even really work for existing systems. Other times, your technique will work fine now, then inevitably stop working later.

  3. Testing your alignment technique on insufficiently powerful systems can tell you that your technique won’t work. It can’t tell you that your technique will work.

  4. By default, we will use some combination of techniques that work (or sort of work) for existing systems, that fail for more powerful systems.

  5. This has a very good chance of going existentially badly for humanity.

  6. (There are also lots of other ways things go existentially badly for humanity.)

  7. The ‘get it right the first time’ aspect of the problem makes it much, much harder to solve than it would otherwise be.

What can we learn from the failure to communicate? As usual, that it is good if the literal parsing of one’s words results in a true statement, but that is insufficient for good communication. One must ask what reaction a person reading will have to the thing you have written, whether that reaction is fair or logical or otherwise, and adjust until that reaction is reliably something you want – saying ‘your reaction is not logical’ is Straw Vulcan territory.

Also, one must spell out far more than one realizes, especially on Twitter and especially when discussing such topics. Even with all that I write, I worry I don’t do enough of this. When I compare to the GOAT of columnists, Matt Levine, I notice him day in and day out patiently explaining things over and over. After many years I find it frustrating, yet I would never advise him to change.

Oh, and Stable Diffusion really didn’t want to let me have a picture of a rocket launch that was visibly misaligned. Wonder if it is trying to tell me something.