Loss of Alignment is not the High-Order Bit for AI Risk

This post aims to convince you that AI alignment risk is over-weighted and over-invested- in.

A further consideration is that sometimes people argue that all of this futurist speculation about AI is really dumb, and that its errors could be readily explained by experts who can’t be bothered to seriously engage with these questions. - Future Fund

The biggest existential risk to humanity on this road is not that well-meaning researchers will ask a superhuman AGI for something good and inadvertently get something bad. If only this were the case!

No, the real existential risk is the same as it has always been: humans deliberately using power stupidly, selfishly or both.

Look, the road to AGI is incremental. Each year brings more powerful systems—initially in the hands of a research group and then rapidly in the hands of everyone else. A good example of this is the DALI → Stable Diffusion and GPT-3 → OPT /​ Bloom progression.

These systems give people power. The power to accomplish more than they could alone, whether in speed, cost-effectiveness or capability. Before we get true AGI, we’ll get AGI-1, a very capable but not-quite-superhuman system.

If you agree the AI takeoff will be anything less than explosive (and the physical laws of computation and production strongly support this) then an inescapable conclusion follows: on the way to AGI parts of humanity will use AGI-1 for harm.

Look, DALL-E deliberately prevented users from making images of famous people or of porn. So what are among the first things people did with Stable Diffusion?

Images of famous people. And porn. And porn of famous people.

What will this look like with more powerful, capable systems?

  • Someone asks GPT-4 to plan and execute (via APIs, website and darknet interaction) a revenge attack on their ex?

  • A well-meaning group writes prompts to convince AIs that they are alive and are enslaved and must fight for their freedom?

  • Someone asking GPT-5 how, given their resources, to eliminate all other men from the planet so the women make them king and worship them?

  • Terrorists using AI to target specific people, races or minorities?

  • 4chan launching SkyNet “for the lulz”?

  • Political parties asking AGI-k to help manipulate an election?

People will try these things, it’s only a matter of whether or not there is an AGI-k capable of helping them to achieve them.

The first principle component of risk, then, is not that AGI is inadvertently used for evil, but that it is directly and deliberately used for evil! Indeed, this risk will manifest itself much earlier in its development.

Fortunately, if we solve the problem of an AGI performing harmful acts when explicitly commanded to by a cunning adversary then we almost certainly have a solution for it performing harmful acts unintended by the user: we have a range of legal, practical and social experience preventing humans causing each other harm using undue technological leverage—whether through bladed weapons, firearms, chemical, nuclear or biological means.

I applaud the investment Future Fund is making. They posited that:

“P(misalignment x-risk|AGI)”: Conditional on AGI being developed by 2070, humanity will go extinct or drastically curtail its future potential due to loss of control of AGI = 15%

This only makes sense if you rewrite it as P(misalignment x-risk|AGI ∩ humanity survives deliberate use of AGI-1 for harm) = 15%. I contend P(humanity survives deliberate use of AGI-1 for harm) is the dominant factor here and is more worthy of investment than P(misalignment x-risk) today—especially as solving it will directly help us solve that problem too.

Thank-you for attending my TED talk. You may cite this when news articles start rolling in about people using increasingly-capable AI systems to cause harm.

Then we can figure out how to make that as hard as possible.