The Fusion Power Generator Scenario

Sup­pose, a few years from now, I prompt GPT-N to de­sign a cheap, sim­ple fu­sion power gen­er­a­tor—some­thing I could build in my garage and use to power my house. GPT-N suc­ceeds. I build the fu­sion power gen­er­a­tor, find that it works ex­actly as ad­ver­tised, share the plans on­line, and soon the world has easy ac­cess to cheap, clean power.

One prob­lem: at no point did it oc­cur to me to ask “Can this de­sign eas­ily be turned into a bomb?”. Had I thought to prompt it with the ques­tion, GPT-N would have told me that the de­sign could eas­ily be turned into a bomb. But I didn’t think to ask, so GPT-N had no rea­son to men­tion it. With the de­sign in wide use, it’s only a mat­ter of time un­til peo­ple figure it out. And so, just like that, we live in a world where any­one can build a cheap ther­monu­clear war­head in their garage.

This sce­nario high­lights a few key con­straints which I think are un­der-ap­pre­ci­ated in al­ign­ment to­day.

Shar­ing In­for­ma­tion is Irreversible

I’ve heard peo­ple say that we can make AI safe(r) by re­strict­ing the AI’s ac­tion space to things which we can undo. Prob­lem is, shar­ing in­for­ma­tion is ir­re­versible; once the cat is out of the bag, there’s no get­ting it back into the bag. And if an AI can’t share in­for­ma­tion, there’s very lit­tle that it can do. Not much point in an AI which just can’t do any­thing ob­serv­able at all. (One could de­sign an AI to “move in mys­te­ri­ous ways”, but I have trou­ble imag­in­ing that it ends up safer that way.)

This is a prob­lem when in­for­ma­tion it­self is dan­ger­ous, e.g. knowl­edge of how to build a ther­monu­clear war­head in one’s garage.

Hu­mans Are Not Safe

Two key prop­er­ties of hu­mans:

  • We do not have full in­tro­spec­tive un­der­stand­ing of our own wants

  • We do not have the pro­cess­ing power to fully un­der­stand the con­se­quences of changes

Some­times, we get some­thing we thought we wanted, and find out that we don’t want it af­ter all. Either we mi­s­un­der­stood our own wants, or mi­s­un­der­stood the full im­pli­ca­tions of the change.

Most of the time, this isn’t that huge an is­sue. We lose some money and/​or time, but we move on.

But if a hu­man is ca­pa­ble of mak­ing large, ir­re­versible changes to the world, then the prob­lem be­comes more se­ri­ous. A hu­man with ac­cess to pow­er­ful AI—even some­thing as con­cep­tu­ally sim­ple as GPT-N—is ca­pa­ble of mak­ing large ir­re­versible changes, and they do not have the pro­cess­ing power to fully un­der­stand the im­pli­ca­tions of those changes. In gen­eral, a hu­man won’t even know the right ques­tions to ask. So, if a sys­tem’s safety re­lies on a hu­man ask­ing the right ques­tions, then the sys­tem is not safe.

In par­tic­u­lar, this is rele­vant to the HCH fam­ily of al­ign­ment schemes (e.g. IDA), as well as hu­man-imi­tat­ing AI more broadly.

Corol­lary: Tool AI Is Not In­her­ently Safe

Tool AI, in par­tic­u­lar, re­lies pri­mar­ily on hu­man op­er­a­tors for safety. Just like a ta­ble­saw is safe if-and-only-if the op­er­a­tor uses it safely, tool AI is safe if-and-only-if the op­er­a­tor uses it safely.

With a ta­ble­saw, that’s usu­ally fine. It’s pretty ob­vi­ous what sorts of things will lead to bad out­comes from a ta­ble­saw. But the big value-propo­si­tion of pow­er­ful AI is its abil­ity to rea­son about sys­tems or prob­lems too com­pli­cated for hu­mans—which are ex­actly the sys­tems/​prob­lems where safety is­sues are likely to be nonob­vi­ous. If we’re go­ing to un­lock the full value of AI at all, we’ll need to use it on prob­lems where hu­mans do not know the rele­vant safety is­sues. So: if the sys­tem’s safety re­lies on a hu­man us­ing it safely, then it’s not safe.

If you want a con­crete, evoca­tive anal­ogy: pic­ture a two-year-old play­ing on top of a ta­ble­saw.

That said, peo­ple are de­sign­ing ta­ble­saws which auto-stop when skin con­tacts the blade. In gen­eral, a sys­tem’s de­sign­ers may un­der­stand the rele­vant safety is­sues bet­ter than the op­er­a­tors. In­deed, since the first AGIs will be built by hu­mans, any ap­proach to AI safety ul­ti­mately re­lies on hu­man de­sign­ers ask­ing the right ques­tions. Point is: we can’t avoid the need for de­sign­ers to ask (at least some of) the right ques­tions up­front. But need­ing the de­sign­ers to ask the right ques­tions once is still a lot bet­ter than need­ing ev­ery user to ask the right ques­tions ev­ery time they use the sys­tem.

(This per­spec­tive ties in nicely with AI al­ign­ment as in­ter­face de­sign: if an in­ter­face offers an easy-to-over­look way to cut your hand off, and re­lies on users not do­ing so, then that’s a de­sign prob­lem.)

Safe tool AI could po­ten­tially be built, but safety won’t hap­pen by it­self any more than it would for other kinds of AI.

Gen­er­al­iza­tion: Great Power, Great Responsibility

Fi­nally, note that none of this is an is­sue if GPT-N can’t de­sign fu­sion power gen­er­a­tors (or garage war­heads) at all. In gen­eral, it is easy to come up with de­signs for prob­a­bly-safe AIs which just can’t do any­thing all that im­pres­sive. The greater an AI’s ca­pa­bil­ities, the more pre­cisely and re­li­ably it needs to be al­igned to hu­man val­ues.

In par­tic­u­lar, the “ca­pa­bil­ities” rele­vant here are an AI’s abil­ities to rea­son about sys­tems too com­pli­cated for hu­mans or solve prob­lems too com­pli­cated for hu­mans. It’s the com­plex­ity that mat­ters; the in­abil­ity of hu­mans to fully un­der­stand all the im­pli­ca­tions of the AI’s rea­son­ing/​solu­tions is ex­actly what makes hu­mans un­re­li­able judges of safety. So, the greater the com­plex­ity of sys­tems/​prob­lems an AI can han­dle, the more im­por­tant it is for that AI to have its own model of what-hu­mans-want, and to al­ign its solu­tions with that model.