Want to win the AGI race? Solve alignment.

Link post

Society really cares about safety. Practically speaking, the binding constraint on deploying your AGI could well be your ability to align your AGI. Solving (scalable) alignment might be worth lots of $$$ and key to beating China.

Look, I really don’t want Xi Jinping Thought to rule the world. If China gets AGI first, the ensuing rapid AI-powered scientific and technological progress could well give it a decisive advantage (cf potential for >30%/​year economic growth with AGI). I think there’s a very real specter of global authoritarianism here.[1]

Or hey, maybe you just think AGI is cool. You want to go build amazing products and enable breakthrough science and solve the world’s problems.

So, race to AGI with reckless abandon then? At this point, people get into agonizing discussions about safety tradeoffs.[2] And many people just mood affiliate their way to an answer: “accelerate, progress go brrrr,” or “AI scary, slow it down.”

I see this much more practically. And, practically, society cares about safety, a lot. Do you actually think that you’ll be able to and allowed to deploy an AI system that has, say, a 10% chance of destroying all of humanity?[3]

Society has started waking up to AGI; like covid, the societal response will probably be a dumpster-fire, but it’ll also probably be quite intense. In many worlds, to deploy your AGI systems, people will need to be quite confident that your AGI won’t destroy the world.

Right now, we’re very much not on track to solve the alignment problem for superhuman AGI systems (“scalable alignment”)—but it’s a solvable problem, if we get our act together. I discuss this in my main post today (“Nobody’s on the ball on AGI alignment”). On the current trajectory, the binding constraint on deploying your AGI could well be your ability to align your AGI—and this alignment solution being unambiguous enough that there is consensus that it works.

Even if you just want to win the AGI race, you should probably want to invest much more heavily in solving this problem.


Things are going to get crazy, and people will pay attention

A mistake many people make when thinking about AGI is imagining a world that looks much like today, except for adding in a lab with a super powerful model. They ignore the endogenous societal response.

I and many others made this mistake with covid—we were freaking out in February 2020, and despairing that society didn’t seem to be even paying attention, let alone doing anything. But just a few weeks later, all of America went into an unprecedented lockdown. If we’re actually on our way to AGI, things are going to get crazy. People are going to pay attention.

The wheels for this are already in motion. Remember how nobody paid any attention to AI 6 months ago, and now Bing chat/​Sydney going awry is on the front page of the NYT, US senators are getting scared, and Yale econ professors are advocating $100B/​year for AI safety? Well, imagine that, but 100x as we approach AGI.

AI safety is going mainstream. Everyone has been primed to be scared about rogue AI by science fiction; all the CEOs have secretly believed in AI risk for years but thought it was too weird to talk about it[4]; and the mainstream media loves to hate on tech companies. Probably there will be further, much scarier wakeup calls (not just misalignment, but also misuse and scary demos in evals). People already freaked out about GPT-4 using a TaskRabbit to solve a captcha—now imagine a demo of AI systems designing a new bioweapon or autonomously self-replicating on the internet, or people using AI coders to hack major institutions like the government or big banks. Already, a majority of the population says they fear AI risk and want FDA-style regulation in polls.

The discourse on it will be incredibly dumb—I can’t wait for Ron DeSantis and Kamala Harris’s 2028 presidential debate on AI safety—but you won’t be able to escape it. (And as stupid as all of this will be, this sort of endogenous societal response is a big reason why I’m more optimistic on AI risk in general.)

The level of media scrutiny, public attention, internal employee pressure, self-regulation, government monitoring, etc. will be way too intense to ignore alignment concerns. We’re seeing very early versions of self-regulation with initial AI risk evals efforts. But price in how intense it’s all going to get. Do you think the US national security establishment won’t get involved once they realize they have a technology more powerful than nukes on their hands? Do you think your board is going to let you release a model if the NYT is reporting in all caps that a large fraction of serious AI experts, prominent CEOs, and politicians think this could go haywire and start actually hurting people?

Imagine if you tried submitting a drug application to the FDA with a similar risk profile.


A reasonable objection here is: “yes, we did lock down in response to covid, and that was pretty crazy, but also our response to covid was pretty incompetent across the board; it was more like random flailing than actually doing the most effective things; and it’s not even clear if the lockdowns were net-positive.”

I agree! The societal response to AGI will probably be a dumpster-fire.

But there will be a really intense response. I think it’s fairly likely that the cludgy response we do get is enough to throw serious sand into the gears on deployment—unless you have a convincing solution to (scalable) alignment. If anything, the example of lockdowns could point towards society responding in excessively cautious ways, heightening the returns to a convincing alignment solution even further. Yes, our response might also be totally ineffectual; this very much isn’t sufficient to make me sleep soundly at night. But in a large fraction of worlds, if you want to deploy your AGI, people are going to demand of you that we can be confident it’s safe.[5]


The binding constraint on making AGI could be aligning it. You want an unambiguous solution, for which there is consensus that it’s safe.

You don’t even need xrisk concerns for alignment to become the binding constraint on your ability to deploy models. With current techniques, we’re very much not on track for being able to put basic guardrails on models as they become superhuman. Do you really think you’ll be able to deploy GPT-7 all across the economy if you can’t reliably ensure GPT-7 won’t break the law?

The thing is, aligning superhuman AGIs is a much harder problem than near-term alignment. Current alignment techniques rely on human supervision. But as models get superhuman, it will become impossible for humans to reliably supervise these models (e.g., imagine a model proposing a series of actions or 100,000 lines of code too complicated for humans to understand). If you can’t detect bad behavior, you can’t prevent it. (And rather than the “bad behavior” in question being “prevent the models from saying bad words,” as with near-term alignment, the bad behavior for superhuman models looks more like “prevent the models from trying a coup of the US government.”)

I think that aligning superhuman AGIs is a) doable, but b) nobody is on the ball right now—as discussed in my other post. The scalable alignment plans labs currently have (example) might work, but they sort of rely on “improvise in the moment, let’s cross our fingers and hope it works out.”

Even if that bet works out, the safety of your systems will probably be fairly ambiguous until very late—ambiguous enough that you won’t be able to deploy. When asked, “will your superhuman AGI go haywire?”, do you think people will accept “probably not?” for an answer?

If you want to win the AGI race, if you want to beat China, you’re probably going to need a better alignment plan. You want an alignment solution good enough to achieve a broad consensus that your superhuman AGI is safe. Ambiguity could be fatal to your ability to press ahead.

You might not like it, you might rage at everyone’s excessive safetyism and wish it were different. But, practically speaking, you should be pretty interested in much more serious efforts to solve scalable alignment. Let’s not lose to China because in our fervor to race to AGI, we fail to invest in the alignment research practically necessary to actually deploy AGI.[6]


Thanks to Collin Burns, Holden Karnofsky and Dwarkesh Patel for comments on a draft.

  1. ^

    Though, for now, it seems that China is a few years behind, and the US AI chip export controls might considerably hamper them (great CSIS explainer on the export controls, CSET report on why china might have a hard time catching up). So especially if timelines are short, we have a healthy lead for now.

  2. ^

    Which risk is bigger, AI misalignment or “bad guys getting AGI first”? cf Holden Karnofsky on the “caution vs. competition” frame

  3. ^

    Or at least, it’s widely believed it has such a 10% chance.

  4. ^
  5. ^

    If this ends up being a big barrier to deploying your model in 50% of worlds, that 50% is enough to make alignment incredibly commercially valuable for you.

  6. ^

    An interesting potential implication not discussed in the main post: if alignment techniques become incredibly commercially valuable/​key competitive advantages, will these become trade secrets not shared publicly or with other labs?