The Fusion Power Generator Scenario

Suppose, a few years from now, I prompt GPT-N to design a cheap, simple fusion power generator—something I could build in my garage and use to power my house. GPT-N succeeds. I build the fusion power generator, find that it works exactly as advertised, share the plans online, and soon the world has easy access to cheap, clean power.

One problem: at no point did it occur to me to ask “Can this design easily be turned into a bomb?”. Had I thought to prompt it with the question, GPT-N would have told me that the design could easily be turned into a bomb. But I didn’t think to ask, so GPT-N had no reason to mention it. With the design in wide use, it’s only a matter of time until people figure it out. And so, just like that, we live in a world where anyone can build a cheap thermonuclear warhead in their garage.

This scenario highlights a few key constraints which I think are under-appreciated in alignment today.

Sharing Information is Irreversible

I’ve heard people say that we can make AI safe(r) by restricting the AI’s action space to things which we can undo. Problem is, sharing information is irreversible; once the cat is out of the bag, there’s no getting it back into the bag. And if an AI can’t share information, there’s very little that it can do. Not much point in an AI which just can’t do anything observable at all. (One could design an AI to “move in mysterious ways”, but I have trouble imagining that it ends up safer that way.)

This is a problem when information itself is dangerous, e.g. knowledge of how to build a thermonuclear warhead in one’s garage.

Humans Are Not Safe

Two key properties of humans:

  • We do not have full introspective understanding of our own wants

  • We do not have the processing power to fully understand the consequences of changes

Sometimes, we get something we thought we wanted, and find out that we don’t want it after all. Either we misunderstood our own wants, or misunderstood the full implications of the change.

Most of the time, this isn’t that huge an issue. We lose some money and/​or time, but we move on.

But if a human is capable of making large, irreversible changes to the world, then the problem becomes more serious. A human with access to powerful AI—even something as conceptually simple as GPT-N—is capable of making large irreversible changes, and they do not have the processing power to fully understand the implications of those changes. In general, a human won’t even know the right questions to ask. So, if a system’s safety relies on a human asking the right questions, then the system is not safe.

In particular, this is relevant to the HCH family of alignment schemes (e.g. IDA), as well as human-imitating AI more broadly.

Corollary: Tool AI Is Not Inherently Safe

Tool AI, in particular, relies primarily on human operators for safety. Just like a tablesaw is safe if-and-only-if the operator uses it safely, tool AI is safe if-and-only-if the operator uses it safely.

With a tablesaw, that’s usually fine. It’s pretty obvious what sorts of things will lead to bad outcomes from a tablesaw. But the big value-proposition of powerful AI is its ability to reason about systems or problems too complicated for humans—which are exactly the systems/​problems where safety issues are likely to be nonobvious. If we’re going to unlock the full value of AI at all, we’ll need to use it on problems where humans do not know the relevant safety issues. So: if the system’s safety relies on a human using it safely, then it’s not safe.

If you want a concrete, evocative analogy: picture a two-year-old playing on top of a tablesaw.

That said, people are designing tablesaws which auto-stop when skin contacts the blade. In general, a system’s designers may understand the relevant safety issues better than the operators. Indeed, since the first AGIs will be built by humans, any approach to AI safety ultimately relies on human designers asking the right questions. Point is: we can’t avoid the need for designers to ask (at least some of) the right questions upfront. But needing the designers to ask the right questions once is still a lot better than needing every user to ask the right questions every time they use the system.

(This perspective ties in nicely with AI alignment as interface design: if an interface offers an easy-to-overlook way to cut your hand off, and relies on users not doing so, then that’s a design problem.)

Safe tool AI could potentially be built, but safety won’t happen by itself any more than it would for other kinds of AI.

Generalization: Great Power, Great Responsibility

Finally, note that none of this is an issue if GPT-N can’t design fusion power generators (or garage warheads) at all. In general, it is easy to come up with designs for probably-safe AIs which just can’t do anything all that impressive. The greater an AI’s capabilities, the more precisely and reliably it needs to be aligned to human values.

In particular, the “capabilities” relevant here are an AI’s abilities to reason about systems too complicated for humans or solve problems too complicated for humans. It’s the complexity that matters; the inability of humans to fully understand all the implications of the AI’s reasoning/​solutions is exactly what makes humans unreliable judges of safety. So, the greater the complexity of systems/​problems an AI can handle, the more important it is for that AI to have its own model of what-humans-want, and to align its solutions with that model.