Existential risk from AI without an intelligence explosion

[xpost from my blog]

In dis­cus­sions of ex­is­ten­tial risk from AI, it is of­ten as­sumed that the ex­is­ten­tial catas­tro­phe would fol­low an in­tel­li­gence ex­plo­sion, in which an AI cre­ates a more ca­pa­ble AI, which in turn cre­ates a yet more ca­pa­ble AI, and so on, a feed­back loop that even­tu­ally pro­duces an AI whose cog­ni­tive power vastly sur­passes that of hu­mans, which would be able to ob­tain a de­ci­sive strate­gic ad­van­tage over hu­man­ity, al­low­ing it to pur­sue its own goals with­out effec­tive hu­man in­terfer­ence. Vic­to­ria Krakovna points out that many ar­gu­ments that AI could pre­sent an ex­is­ten­tial risk do not rely on an in­tel­li­gence ex­plo­sion. I want to look in sightly more de­tail at how that could hap­pen. Kaj So­tala also dis­cusses this.

An AI starts an in­tel­li­gence ex­plo­sion when its abil­ity to cre­ate bet­ter AIs sur­passes that of hu­man AI re­searchers by a suffi­cient mar­gin (pro­vided the AI is mo­ti­vated to do so). An AI at­tains a de­ci­sive strate­gic ad­van­tage when its abil­ity to op­ti­mize the uni­verse sur­passes that of hu­man­ity by a suffi­cient mar­gin. Which of these hap­pens first de­pends on what skills AIs have the ad­van­tage at rel­a­tive to hu­mans. If AIs are bet­ter at pro­gram­ming AIs than they are at tak­ing over the world, then an in­tel­li­gence ex­plo­sion will hap­pen first, and it will then be able to get a de­ci­sive strate­gic ad­van­tage soon af­ter. But if AIs are bet­ter at tak­ing over the world than they are at pro­gram­ming AIs, then an AI would get a de­ci­sive strate­gic ad­van­tage with­out an in­tel­li­gence ex­plo­sion oc­cur­ring first.

Since an in­tel­li­gence ex­plo­sion hap­pen­ing first is usu­ally con­sid­ered the de­fault as­sump­tion, I’ll just sketch a plau­si­bil­ity ar­gu­ment for the re­verse. There’s a lot of vari­a­tion in how easy cog­ni­tive tasks are for AIs com­pared to hu­mans. Since pro­gram­ming AIs is not yet a task that AIs can do well, it doesn’t seem like it should be a pri­ori sur­pris­ing if pro­gram­ming AIs turned out to be an ex­tremely difficult task for AIs to ac­com­plish, rel­a­tive to hu­mans. Tak­ing over the world is also plau­si­bly es­pe­cially difficult for AIs, but I don’t see strong rea­sons for con­fi­dence that it would be harder for AIs than start­ing an in­tel­li­gence ex­plo­sion would be. It’s pos­si­ble that an AI with sig­nifi­cantly but not vastly su­per­hu­man abil­ities in some do­mains could iden­tify some vuln­er­a­bil­ity that it could ex­ploit to gain power, which hu­mans would never think of. Or an AI could be enough bet­ter than hu­mans at forms of en­g­ineer­ing other than AI pro­gram­ming (per­haps molec­u­lar man­u­fac­tur­ing) that it could build phys­i­cal ma­chines that could out-com­pete hu­mans, though this would re­quire it to ob­tain the re­sources nec­es­sary to pro­duce them.

Fur­ther­more, an AI that is ca­pa­ble of pro­duc­ing a more ca­pa­ble AI may re­frain from do­ing so if it is un­able to solve the AI al­ign­ment prob­lem for it­self; that is, if it can cre­ate a more in­tel­li­gent AI, but not one that shares its prefer­ences. This seems un­likely if the AI has an ex­plicit de­scrip­tion of its prefer­ences. But if the AI, like hu­mans and most con­tem­po­rary AI, lacks an ex­plicit de­scrip­tion of its prefer­ences, then the difficulty of the AI al­ign­ment prob­lem could be an ob­sta­cle to an in­tel­li­gence ex­plo­sion oc­cur­ring.

It also seems worth think­ing about the policy im­pli­ca­tions of the differ­ences be­tween ex­is­ten­tial catas­tro­phes from AI that fol­low an in­tel­li­gence ex­plo­sion ver­sus those that don’t. For in­stance, AIs that at­tempt to at­tain a de­ci­sive strate­gic ad­van­tage with­out un­der­go­ing an in­tel­li­gence ex­plo­sion will ex­ceed hu­man cog­ni­tive ca­pa­bil­ities by a smaller mar­gin, and thus would likely at­tain strate­gic ad­van­tages that are less de­ci­sive, and would be more likely to fail. Thus con­tain­ment strate­gies are prob­a­bly more use­ful for ad­dress­ing risks that don’t in­volve an in­tel­li­gence ex­plo­sion, while at­tempts to con­tain a post-in­tel­li­gence ex­plo­sion AI are prob­a­bly pretty much hope­less (al­though it may be worth­while to find ways to in­ter­rupt an in­tel­li­gence ex­plo­sion while it is be­gin­ning). Risks not in­volv­ing an in­tel­li­gence ex­plo­sion may be more pre­dictable in ad­vance, since they don’t in­volve a rapid in­crease in the AI’s abil­ities, and would thus be eas­ier to deal with at the last minute, so it might make sense far in ad­vance to fo­cus dis­pro­por­tionately on risks that do in­volve an in­tel­li­gence ex­plo­sion.

It seems likely that AI al­ign­ment would be eas­ier for AIs that do not un­dergo an in­tel­li­gence ex­plo­sion, since it is more likely to be pos­si­ble to mon­i­tor and do some­thing about it if it goes wrong, and lower op­ti­miza­tion power means lower abil­ity to ex­ploit the differ­ence be­tween the goals the AI was given and the goals that were in­tended, if we are only able to spec­ify our goals ap­prox­i­mately. The first of those rea­sons ap­plies to any AI that at­tempts to at­tain a de­ci­sive strate­gic ad­van­tage with­out first un­der­go­ing an in­tel­li­gence ex­plo­sion, whereas the sec­ond only ap­plies to AIs that do not un­dergo an in­tel­li­gence ex­plo­sion ever. Be­cause of these, it might make sense to at­tempt to de­crease the chance that the first AI to at­tain a de­ci­sive strate­gic ad­van­tage un­der­goes an in­tel­li­gence ex­plo­sion be­fore­hand, as well as the chance that it un­der­goes an in­tel­li­gence ex­plo­sion ever, though pre­vent­ing the lat­ter may be much more difficult. How­ever, some strate­gies to achieve this may have un­de­sir­able side-effects; for in­stance, as men­tioned ear­lier, AIs whose prefer­ences are not ex­plic­itly de­scribed seem more likely to at­tain a de­ci­sive strate­gic ad­van­tage with­out first un­der­go­ing an in­tel­li­gence ex­plo­sion, but such AIs are prob­a­bly more difficult to al­ign with hu­man val­ues.

If AIs get a de­ci­sive strate­gic ad­van­tage over hu­mans with­out an in­tel­li­gence ex­plo­sion, then since this would likely in­volve the de­ci­sive strate­gic ad­van­tage be­ing ob­tained much more slowly, it would be much more likely for mul­ti­ple, and pos­si­bly many, AIs to gain de­ci­sive strate­gic ad­van­tages over hu­mans, though not nec­es­sar­ily over each other, re­sult­ing in a mul­ti­po­lar out­come. Thus con­sid­er­a­tions about mul­ti­po­lar ver­sus sin­gle­ton sce­nar­ios also ap­ply to de­ci­sive strate­gic ad­van­tage-first ver­sus in­tel­li­gence ex­plo­sion-first sce­nar­ios.