Superintelligence 12: Malignant failure modes

This is part of a weekly read­ing group on Nick Bostrom’s book, Su­per­in­tel­li­gence. For more in­for­ma­tion about the group, and an in­dex of posts so far see the an­nounce­ment post. For the sched­ule of fu­ture top­ics, see MIRI’s read­ing guide.

Wel­come. This week we dis­cuss the twelfth sec­tion in the read­ing guide: Mal­ig­nant failure modes.

This post sum­ma­rizes the sec­tion, and offers a few rele­vant notes, and ideas for fur­ther in­ves­ti­ga­tion. Some of my own thoughts and ques­tions for dis­cus­sion are in the com­ments.

There is no need to pro­ceed in or­der through this post, or to look at ev­ery­thing. Feel free to jump straight to the dis­cus­sion. Where ap­pli­ca­ble and I re­mem­ber, page num­bers in­di­cate the rough part of the chap­ter that is most re­lated (not nec­es­sar­ily that the chap­ter is be­ing cited for the spe­cific claim).

Read­ing: ‘Mal­ig­nant failure modes’ from Chap­ter 8


  1. Mal­ig­nant failure mode: a failure that in­volves hu­man ex­tinc­tion; in con­trast with many failure modes where the AI doesn’t do much.

  2. Fea­tures of ma­lig­nant failures

    1. We don’t get a sec­ond try

    2. It sup­poses we have a great deal of suc­cess, i.e. enough to make an un­prece­dent­edly com­pe­tent agent

  3. Some ma­lig­nant failures:

    1. Per­verse in­stan­ti­a­tion: the AI does what you ask, but what you ask turns out to be most satis­fi­able in un­fore­seen and de­struc­tive ways.

      1. Ex­am­ple: you ask the AI to make peo­ple smile, and it in­ter­venes on their fa­cial mus­cles or neu­ro­chem­i­cals, in­stead of via their hap­piness, and in par­tic­u­lar via the bits of the world that usu­ally make them happy.

      2. Pos­si­ble coun­ter­ar­gu­ment: if it’s so smart, won’t it know what we meant? An­swer: Yes, it knows, but it’s goal is to make you smile, not to do what you meant when you pro­grammed that goal.

      3. AI which can ma­nipu­late its own mind eas­ily is at risk of ‘wire­head­ing’ - that is, a goal of max­i­miz­ing a re­ward sig­nal might be per­versely in­stan­ti­ated by just ma­nipu­lat­ing the sig­nal di­rectly. In gen­eral, an­i­mals can be mo­ti­vated to do out­side things to achieve in­ter­nal states, how­ever AI with suffi­cient ac­cess to in­ter­nal state can do this more eas­ily by ma­nipu­lat­ing in­ter­nal state.

      4. Even if we think a goal looks good, we should fear it has per­verse in­stan­ti­a­tions that we haven’t ap­pre­ci­ated.

    2. In­fras­truc­ture profu­sion: in pur­suit of some goal, an AI redi­rects most re­sources to in­fras­truc­ture, at our ex­pense.

      1. Even ap­par­ently self-limit­ing goals can lead to in­fras­truc­ture profu­sion. For in­stance, to an agent whose only goal is to make ten pa­per­clips, once it has ap­par­ently made ten pa­per­clips it is always more valuable to try to be­come more cer­tain that there are re­ally ten pa­per­clips than it is to just stop do­ing any­thing.

      2. Ex­am­ples: Rie­mann hy­poth­e­sis catas­tro­phe, pa­per­clip max­i­miz­ing AI

    3. Mind crime: AI con­tains morally rele­vant com­pu­ta­tions, and treats them badly

      1. Ex­am­ple: AI simu­lates hu­mans in its mind, for the pur­pose of learn­ing about hu­man psy­chol­ogy, then quickly de­stroys them.

      2. Other rea­sons for simu­lat­ing morally rele­vant crea­tures:

        1. Blackmail

        2. Creat­ing in­dex­i­cal un­cer­tainty in out­side creatures

Another view

In this chap­ter Bostrom dis­cussed the difficulty he per­ceives in de­sign­ing goals that don’t lead to in­definite re­source ac­qui­si­tion. Steven Pinker re­cently offered a differ­ent per­spec­tive on the in­evita­bil­ity of re­source ac­qui­si­tion:

...The other prob­lem with AI dystopias is that they pro­ject a parochial alpha-male psy­chol­ogy onto the con­cept of in­tel­li­gence. Even if we did have su­per­hu­manly in­tel­li­gent robots, why would they want to de­pose their mas­ters, mas­sacre by­stan­ders, or take over the world? In­tel­li­gence is the abil­ity to de­ploy novel means to at­tain a goal, but the goals are ex­tra­ne­ous to the in­tel­li­gence it­self: be­ing smart is not the same as want­ing some­thing. His­tory does turn up the oc­ca­sional mega­lo­ma­ni­a­cal despot or psy­cho­pathic se­rial kil­ler, but these are prod­ucts of a his­tory of nat­u­ral se­lec­tion shap­ing testos­terone-sen­si­tive cir­cuits in a cer­tain species of pri­mate, not an in­evitable fea­ture of in­tel­li­gent sys­tems. It’s tel­ling that many of our techno-prophets can’t en­ter­tain the pos­si­bil­ity that ar­tifi­cial in­tel­li­gence will nat­u­rally de­velop along fe­male lines: fully ca­pa­ble of solv­ing prob­lems, but with no burn­ing de­sire to an­nihilate in­no­cents or dom­i­nate the civ­i­liza­tion.

Of course we can imag­ine an evil ge­nius who de­liber­ately de­signed, built, and re­leased a bat­tal­ion of robots to sow mass de­struc­tion. But we should keep in mind the chain of prob­a­bil­ities that would have to mul­ti­ply out be­fore it would be a re­al­ity. A Dr. Evil would have to arise with the com­bi­na­tion of a thirst for pointless mass mur­der and a ge­nius for tech­nolog­i­cal in­no­va­tion. He would have to re­cruit and man­age a team of co-con­spir­a­tors that ex­er­cised perfect se­crecy, loy­alty, and com­pe­tence. And the op­er­a­tion would have to sur­vive the haz­ards of de­tec­tion, be­trayal, stings, blun­ders, and bad luck. In the­ory it could hap­pen, but I think we have more press­ing things to worry about.


1. Per­verse in­stan­ti­a­tion is a very old idea. It is what ge­nies are most fa­mous for. King Mi­das had similar prob­lems. Ap­par­ently it was ap­plied to AI by 1947, in With Folded Hands.

2. Adam Elga writes more on simu­lat­ing peo­ple for black­mail and in­dex­i­cal un­cer­tainty.

3. More di­rec­tions for mak­ing AI which don’t lead to in­fras­truc­ture profu­sion:

  • Some kinds of prefer­ences don’t lend them­selves to am­bi­tious in­vest­ments. Anna Sala­mon talks about risk averse prefer­ences. Short time hori­zons and goals which are cheap to fulfil should also make long term in­vest­ments in in­fras­truc­ture or in­tel­li­gence aug­men­ta­tion less valuable, com­pared to di­rect work on the prob­lem at hand.

  • Or­a­cle and tool AIs are in­tended to not be goal-di­rected, but as far as I know it is an open ques­tion whether this makes sense. We will get to these later in the book.

4. John Dana­her again sum­ma­rizes this sec­tion well, and com­ments on it.

5. Often when sys­tems break, or we make er­rors in them, they don’t work at all. Some­times, they fail more sub­tly, work­ing well in some sense, but lead­ing us to an un­de­sir­able out­come, for in­stance a ma­lig­nant failure mode. How can you tell whether a poorly de­signed AI is likely to just not work, vs. ac­ci­den­tally take over the world? An im­por­tant con­sid­er­a­tion for sys­tems in gen­eral seems to be the level of ab­strac­tion at which the er­ror oc­curs. We try to build sys­tems so that you can just in­ter­act with them at a rel­a­tively ab­stract level, with­out know­ing how the parts work. For in­stance, you can in­ter­act with your GPS by typ­ing places into it, then listen­ing to it, and you don’t need to know any­thing about how it works. If you make an er­ror while up writ­ing your ad­dress into the GPS, it will fail by tak­ing you to the wrong place, but it will still di­rect you there fairly well. If you fail by putting the wires in­side the GPS into the wrong places the GPS is more likely to just not work.

In-depth investigations

If you are par­tic­u­larly in­ter­ested in these top­ics, and want to do fur­ther re­search, these are a few plau­si­ble di­rec­tions, some in­spired by Luke Muehlhauser’s list, which con­tains many sug­ges­tions re­lated to parts of Su­per­in­tel­li­gence. Th­ese pro­jects could be at­tempted at var­i­ous lev­els of depth.

  1. Are there bet­ter ways to spec­ify ‘limited’ goals? For in­stance, to ask for ten pa­per­clips with­out ask­ing for the uni­verse to be de­voted to slightly im­prov­ing the prob­a­bil­ity of suc­cess?

  2. In what cir­cum­stances could you be con­fi­dent that the goals you have given an AI do not per­mit per­verse in­stan­ti­a­tions?

  3. Ex­plore pos­si­bil­ities for ma­lig­nant failure vs. other failures. If we fail, is it ac­tu­ally prob­a­ble that we will have enough ‘suc­cess’ for our cre­ation to take over the world?

If you are in­ter­ested in any­thing like this, you might want to men­tion it in the com­ments, and see whether other peo­ple have use­ful thoughts.

How to proceed

This has been a col­lec­tion of notes on the chap­ter. The most im­por­tant part of the read­ing group though is dis­cus­sion, which is in the com­ments sec­tion. I pose some ques­tions for you there, and I in­vite you to add your own. Please re­mem­ber that this group con­tains a va­ri­ety of lev­els of ex­per­tise: if a line of dis­cus­sion seems too ba­sic or too in­com­pre­hen­si­ble, look around for one that suits you bet­ter!

Next week, we will talk about ca­pa­bil­ity con­trol meth­ods, sec­tion 13. To pre­pare, read “Two agency prob­lems” and “Ca­pa­bil­ity con­trol meth­ods” from Chap­ter 9. The dis­cus­sion will go live at 6pm Pa­cific time next Mon­day De­cem­ber 8. Sign up to be no­tified here.