Disentangling arguments for the importance of AI safety

I re­cently at­tended the 2019 Benefi­cial AGI con­fer­ence or­ganised by the Fu­ture of Life In­sti­tute. I’ll pub­lish a more com­plete write-up later, but I was par­tic­u­larly struck by how varied at­ten­dees’ rea­sons for con­sid­er­ing AI safety im­por­tant were. Be­fore this, I’d ob­served a few differ­ent lines of thought, but in­ter­preted them as differ­ent facets of the same idea. Now, though, I’ve iden­ti­fied at least 6 dis­tinct se­ri­ous ar­gu­ments for why AI safety is a pri­or­ity. By dis­tinct I mean that you can be­lieve any one of them with­out be­liev­ing any of the oth­ers—al­though of course the par­tic­u­lar cat­e­gori­sa­tion I use is rather sub­jec­tive, and there’s a sig­nifi­cant amount of over­lap. In this post I give a brief overview of my own in­ter­pre­ta­tion of each ar­gu­ment (note that I don’t nec­es­sar­ily en­dorse them my­self). They are listed roughly from most spe­cific and ac­tion­able to most gen­eral. I finish with some thoughts on what to make of this un­ex­pected pro­lifer­a­tion of ar­gu­ments. Pri­mar­ily, I think it in­creases the im­por­tance of clar­ify­ing and de­bat­ing the core ideas in AI safety.

  1. Max­imisers are dan­ger­ous. Su­per­in­tel­li­gent AGI will be­have as if it’s max­imis­ing the ex­pec­ta­tion of some util­ity func­tion, since do­ing oth­er­wise can be shown to be ir­ra­tional. Yet we can’t write down a util­ity func­tion which pre­cisely de­scribes hu­man val­ues, and op­ti­mis­ing very hard for any other func­tion will lead to that AI rapidly seiz­ing con­trol (as a con­ver­gent in­stru­men­tal sub­goal) and build­ing a fu­ture which con­tains very lit­tle of what we value (be­cause of Good­hart’s law and the com­plex­ity and frag­ility of val­ues). We won’t have a chance to no­tice and cor­rect mis­al­ign­ment be­cause an AI which has ex­ceeded hu­man level will im­prove its in­tel­li­gence very quickly (ei­ther by re­cur­sive self-im­prove­ment or by scal­ing up its hard­ware), and then pre­vent us from mod­ify­ing it or shut­ting it down.

    1. This was the main the­sis ad­vanced by Yud­kowsky and Bostrom when found­ing the field of AI safety. Here I’ve tried to con­vey the origi­nal line of ar­gu­ment, al­though some parts of it have been strongly cri­tiqued since then. In par­tic­u­lar, Drexler and Shah have dis­puted the rele­vance of ex­pected util­ity max­imi­sa­tion (the lat­ter sug­gest­ing the con­cept of goal-di­rect­ed­ness as a re­place­ment), while Han­son and Chris­ti­ano dis­agree that AI in­tel­li­gence will in­crease in a very fast and dis­con­tin­u­ous way.

    2. Most of the ar­gu­ments in this post origi­nate from or build on this one in some way. This is par­tic­u­larly true of the next two ar­gu­ments—nev­er­the­less, I think that there’s enough of a shift in fo­cus in each to war­rant sep­a­rate list­ings.

  2. The tar­get load­ing prob­lem. Even if we knew ex­actly what we wanted a su­per­in­tel­li­gent agent to do, we don’t cur­rently know (even in the­ory) how to make an agent which ac­tu­ally tries to do that. In other words, if we were to cre­ate a su­per­in­tel­li­gent AGI be­fore solv­ing this prob­lem, the goals we would as­cribe to that AGI (by tak­ing the in­ten­tional stance to­wards it) would not be the ones we had in­tended to give it. As a mo­ti­vat­ing ex­am­ple, evolu­tion se­lected hu­mans for their ge­netic fit­ness, yet hu­mans have goals which are very differ­ent from just spread­ing their genes. In a ma­chine learn­ing con­text, while we can spec­ify a finite num­ber of data points and their re­wards, neu­ral net­works may then ex­trap­o­late from these re­wards in non-hu­man­like ways.

    1. This is a more gen­eral ver­sion of the “in­ner op­ti­miser prob­lem”, and I think it cap­tures the main thrust of the lat­ter while avoid­ing the difficul­ties of defin­ing what ac­tu­ally counts as an “op­ti­miser”. I’m grate­ful to Nate Soares for ex­plain­ing the dis­tinc­tion, and ar­gu­ing for the im­por­tance of this prob­lem.

  3. The pro­saic al­ign­ment prob­lem. It is plau­si­ble that we build “pro­saic AGI”, which repli­cates hu­man be­havi­our with­out re­quiring break­throughs in our un­der­stand­ing of in­tel­li­gence. Shortly af­ter they reach hu­man level (or pos­si­bly even be­fore), such AIs will be­come the world’s dom­i­nant eco­nomic ac­tors. They will quickly come to con­trol the most im­por­tant cor­po­ra­tions, earn most of the money, and wield enough poli­ti­cal in­fluence that we will be un­able to co­or­di­nate to place limits on their use. Due to eco­nomic pres­sures, cor­po­ra­tions or na­tions who slow down AI de­vel­op­ment and de­ploy­ment in or­der to fo­cus on al­ign­ing their AI more closely with their val­ues will be out­com­peted. As AIs ex­ceed hu­man-level in­tel­li­gence, their de­ci­sions will be­come too com­plex for hu­mans to un­der­stand or provide feed­back on (un­less we de­velop new tech­niques for do­ing so), and even­tu­ally we will no longer be able to cor­rect the di­ver­gences be­tween their val­ues and ours. Thus the ma­jor­ity of the re­sources in the far fu­ture will be con­trol­led by AIs which don’t pri­ori­tise hu­man val­ues. This ar­gu­ment was ex­plained in this blog post by Paul Chris­ti­ano.

    1. More gen­er­ally, al­ign­ing mul­ti­ple agents with mul­ti­ple hu­mans is much harder than al­ign­ing one agent with one hu­man, be­cause value differ­ences might lead to com­pe­ti­tion and con­flict even be­tween agents that are each fully al­igned with some hu­mans. (As my own spec­u­la­tion, it’s also pos­si­ble that hav­ing mul­ti­ple agents would in­crease the difficulty of sin­gle-agent al­ign­ment—e.g. the ques­tion “what would hu­mans want if I didn’t ma­nipu­late them” would no longer track our val­ues if we would coun­ter­fac­tu­ally be ma­nipu­lated by a differ­ent agent).

  4. The hu­man safety prob­lem. This line of ar­gu­ment (which Wei Dai has re­cently high­lighted) claims that no hu­man is “safe” in the sense that giv­ing them ab­solute power would pro­duce good fu­tures for hu­man­ity in the long term, and there­fore that build­ing AI which ex­trap­o­lates and im­ple­ments the val­ues of even a very al­tru­is­tic hu­man is in­suffi­cient. A pro­saic ver­sion of this ar­gu­ment em­pha­sises the cor­rupt­ing effect of power, and the fact that moral­ity is deeply in­ter­twined with so­cial sig­nal­ling—how­ever, I think there’s a stronger and more sub­tle ver­sion. In ev­ery­day life it makes sense to model hu­mans as mostly ra­tio­nal agents pur­su­ing their goals and val­ues. How­ever, this ab­strac­tion breaks down badly in more ex­treme cases (e.g. ad­dic­tive su­per­stim­uli, un­usual moral predica­ments), im­ply­ing that hu­man val­ues are some­what in­co­her­ent. One such ex­treme case is run­ning my brain for a billion years, af­ter which it seems very likely that my val­ues will have shifted or dis­torted rad­i­cally, in a way that my origi­nal self wouldn’t en­dorse. Yet if we want a good fu­ture, this is the pro­cess which we re­quire to go well: a hu­man (or a suc­ces­sion of hu­mans) needs to main­tain broadly ac­cept­able and co­her­ent val­ues for as­tro­nom­i­cally long time pe­ri­ods.

    1. An ob­vi­ous re­sponse is that we shouldn’t en­trust the fu­ture to one hu­man, but rather to some group of hu­mans fol­low­ing a set of de­ci­sion-mak­ing pro­ce­dures. How­ever, I don’t think any cur­rently-known in­sti­tu­tion is ac­tu­ally much safer than in­di­vi­d­u­als over the sort of timeframes we’re talk­ing about. Pre­sum­ably a com­mit­tee of sev­eral in­di­vi­d­u­als would have lower var­i­ance than just one, but as that com­mit­tee grows you start run­ning into well-known prob­lems with democ­racy. And while democ­racy isn’t a bad sys­tem, it seems un­likely to be ro­bust on the timeframe of mil­len­nia or longer. (Alex Zhu has made the in­ter­est­ing ar­gu­ment that the prob­lem of an in­di­vi­d­ual main­tain­ing co­her­ent val­ues is roughly iso­mor­phic to the prob­lem of a civil­i­sa­tion do­ing so, since both are com­plex sys­tems com­posed of in­di­vi­d­ual “mod­ules” which of­ten want differ­ent things.)

    2. While AGI am­plifies the hu­man safety prob­lem, it may also help solve it if we can use it to de­crease the value drift that would oth­er­wise oc­cur. Also, while it’s pos­si­ble that we need to solve this prob­lem in con­junc­tion with other AI safety prob­lems, it might be post­pon­able un­til af­ter we’ve achieved civil­i­sa­tional sta­bil­ity.

    3. Note that I use “broadly ac­cept­able val­ues” rather than “our own val­ues”, be­cause it’s very un­clear to me which types or ex­tent of value evolu­tion we should be okay with. Nev­er­the­less, there are some val­ues which we definitely find un­ac­cept­able (e.g. hav­ing a very nar­row moral cir­cle, or want­ing your en­e­mies to suffer as much as pos­si­ble) and I’m not con­fi­dent that we’ll avoid drift­ing into them by de­fault.

  5. Mi­suse and vuln­er­a­bil­ities. Th­ese might be catas­trophic even if AGI always car­ries out our in­ten­tions to the best of its abil­ity:

    1. AI which is su­per­hu­man at sci­ence and en­g­ineer­ing R&D will be able to in­vent very de­struc­tive weapons much faster than hu­mans can. Hu­mans may well be ir­ra­tional or mal­i­cious enough to use such weapons even when do­ing so would lead to our ex­tinc­tion, es­pe­cially if they’re in­vented be­fore we im­prove our global co­or­di­na­tion mechanisms. It’s also pos­si­ble that we in­vent some tech­nol­ogy which de­stroys us un­ex­pect­edly, ei­ther through un­luck­i­ness or care­less­ness. For more on the dan­gers from tech­nolog­i­cal progress in gen­eral, see Bostrom’s pa­per on the vuln­er­a­ble world hy­poth­e­sis.

    2. AI could be used to dis­rupt poli­ti­cal struc­tures, for ex­am­ple via un­prece­dent­edly effec­tive psy­cholog­i­cal ma­nipu­la­tion. In an ex­treme case, it could be used to es­tab­lish very sta­ble to­tal­i­tar­i­anism, with au­to­mated surveillance and en­force­ment mechanisms en­sur­ing an un­shake­able monopoly on power for lead­ers.

    3. AI could be used for large-scale pro­jects (e.g. cli­mate en­g­ineer­ing to pre­vent global warm­ing, or man­ag­ing the colon­i­sa­tion of the galaxy) with­out suffi­cient over­sight or ver­ifi­ca­tion of ro­bust­ness. Soft­ware or hard­ware bugs might then in­duce the AI to make un­in­ten­tional yet catas­trophic mis­takes.

    4. Peo­ple could use AIs to hack crit­i­cal in­fras­truc­ture (in­clude the other AIs which man­age afore­men­tioned large-scale pro­jects). In ad­di­tion to ex­ploit­ing stan­dard se­cu­rity vuln­er­a­bil­ities, hack­ers might in­duce mis­takes us­ing ad­ver­sar­ial ex­am­ples or ‘data poi­son­ing’.

  6. Ar­gu­ment from large im­pacts. Even if we’re very un­cer­tain about what AGI de­vel­op­ment and de­ploy­ment will look like, it seems likely that AGI will have a very large im­pact on the world in gen­eral, and that fur­ther in­ves­ti­ga­tion into how to di­rect that im­pact could prove very valuable.

    1. Weak ver­sion: de­vel­op­ment of AGI will be at least as big an eco­nomic jump as the in­dus­trial rev­olu­tion, and there­fore af­fect the tra­jec­tory of the long-term fu­ture. See Ben Garfinkel’s talk at EA Global Lon­don 2018. Ben noted that to con­sider work on AI safety im­por­tant, we also need to be­lieve the ad­di­tional claim that there are fea­si­ble ways to pos­i­tively in­fluence the long-term effects of AI de­vel­op­ment—some­thing which may not have been true for the in­dus­trial rev­olu­tion. (Per­son­ally my guess is that since AI de­vel­op­ment will hap­pen more quickly than the in­dus­trial rev­olu­tion, power will be more con­cen­trated dur­ing the tran­si­tion pe­riod, and so in­fluenc­ing its long-term effects will be more tractable.)

    2. Strong ver­sion: de­vel­op­ment of AGI will make hu­mans the sec­ond most in­tel­li­gent species on the planet. Given that it was our in­tel­li­gence which al­lowed us to con­trol the world to the large ex­tent that we do, we should ex­pect that en­tities which are much more in­tel­li­gent than us will end up con­trol­ling our fu­ture, un­less there are re­li­able and fea­si­ble ways to pre­vent it. So far we have not dis­cov­ered any.

What should we think about the fact that there are so many ar­gu­ments for the same con­clu­sion? As a gen­eral rule, the more ar­gu­ments sup­port a state­ment, the more likely it is to be true. How­ever, I’m in­clined to be­lieve that qual­ity mat­ters much more than quan­tity—it’s easy to make up weak ar­gu­ments, but you only need one strong one to out­weigh all of them. And this pro­lifer­a­tion of ar­gu­ments is (weak) ev­i­dence against their qual­ity: if the con­clu­sions of a field re­main the same but the rea­sons given for hold­ing those con­clu­sions change, that’s a warn­ing sign for mo­ti­vated cog­ni­tion (es­pe­cially when those be­liefs are con­sid­ered so­cially im­por­tant). This prob­lem is ex­ac­er­bated by a lack of clar­ity about which as­sump­tions and con­clu­sions are shared be­tween ar­gu­ments, and which aren’t.

On the other hand, su­per­in­tel­li­gent AGI is a very com­pli­cated topic, and so per­haps it’s nat­u­ral that there are many differ­ent lines of thought. One way to put this in per­spec­tive (which I credit to Beth Barnes) is to think about the ar­gu­ments which might have been given for wor­ry­ing about nu­clear weapons, be­fore they had been de­vel­oped. Off the top of my head, there are at least four:

  1. They might be used de­liber­ately.

  2. They might be set off ac­ci­den­tally.

  3. They might cause a nu­clear chain re­ac­tion much larger than an­ti­ci­pated.

  4. They might desta­bil­ise poli­tics, ei­ther do­mes­ti­cally or in­ter­na­tion­ally.

And there are prob­a­bly more which would have been cred­ible at the time, but which seem silly now due to hind­sight bias. So if there’d been an ac­tive anti-nu­clear move­ment in the 30’s or early 40’s, the mo­ti­va­tions of its mem­bers might well have been as dis­parate as those of AI safety ad­vo­cates to­day. Yet the over­all con­cern would have been (and still is) to­tally valid and rea­son­able.

I think the main take­away from this post is that the AI safety com­mu­nity as a whole is still con­fused about the very prob­lem we are fac­ing. The only way to dis­solve this tan­gle is to have more com­mu­ni­ca­tion and clar­ifi­ca­tion of the fun­da­men­tal ideas in AI safety, par­tic­u­larly in the form of writ­ing which is made widely available. And while it would be great to have AI safety re­searchers ex­plain­ing their per­spec­tives more of­ten, I think there is still a lot of ex­pli­ca­tory work which can be done re­gard­less of tech­ni­cal back­ground. In ad­di­tion to anal­y­sis of the ar­gu­ments dis­cussed in this post, I think it would be par­tic­u­larly use­ful to see more de­scrip­tions of de­ploy­ment sce­nar­ios and cor­re­spond­ing threat mod­els. It would also be valuable for re­search agen­das to high­light which prob­lem they are ad­dress­ing, and the as­sump­tions they re­quire to suc­ceed.

This post has benefited greatly from feed­back from Ro­hin Shah, Alex Zhu, Beth Barnes, Adam Mar­ble­stone, Toby Ord, and the Deep­Mind safety team. All opinions are my own.