Superintelligence 11: The treacherous turn

This is part of a weekly read­ing group on Nick Bostrom’s book, Su­per­in­tel­li­gence. For more in­for­ma­tion about the group, and an in­dex of posts so far see the an­nounce­ment post. For the sched­ule of fu­ture top­ics, see MIRI’s read­ing guide.

Wel­come. This week we dis­cuss the 11th sec­tion in the read­ing guide: The treach­er­ous turn. This cor­re­sponds to Chap­ter 8.

This post sum­ma­rizes the sec­tion, and offers a few rele­vant notes, and ideas for fur­ther in­ves­ti­ga­tion. Some of my own thoughts and ques­tions for dis­cus­sion are in the com­ments.

There is no need to pro­ceed in or­der through this post, or to look at ev­ery­thing. Feel free to jump straight to the dis­cus­sion. Where ap­pli­ca­ble and I re­mem­ber, page num­bers in­di­cate the rough part of the chap­ter that is most re­lated (not nec­es­sar­ily that the chap­ter is be­ing cited for the spe­cific claim).

Read­ing: “Ex­is­ten­tial catas­tro­phe…” and “The treach­er­ous turn” from Chap­ter 8


  1. The pos­si­bil­ity of a first mover ad­van­tage + or­thog­o­nal­ity the­sis + con­ver­gent in­stru­men­tal val­ues sug­gests doom for hu­man­ity (p115-6)

    1. First mover ad­van­tage im­plies the AI is in a po­si­tion to do what it wants

    2. Orthog­o­nal­ity the­sis im­plies that what it wants could be all sorts of things

    3. In­stru­men­tal con­ver­gence the­sis im­plies that re­gard­less of its wants, it will try to ac­quire re­sources and elimi­nate threats

    4. Hu­mans have re­sources and may be threats

    5. There­fore an AI in a po­si­tion to do what it wants is likely to want to take our re­sources and elimi­nate us. i.e. doom for hu­man­ity.

  2. One kind of re­sponse: why wouldn’t the mak­ers of the AI be ex­tremely care­ful not to de­velop and re­lease dan­ger­ous AIs, or re­lat­edly, why wouldn’t some­one else shut the whole thing down? (p116)

  3. It is hard to ob­serve whether an AI is dan­ger­ous via its be­hav­ior at a time when you could turn it off, be­cause AIs have con­ver­gent in­stru­men­tal rea­sons to pre­tend to be safe, even if they are not. If they ex­pect their minds to be surveilled, even ob­serv­ing their thoughts may not help. (p117)

  4. The treach­er­ous turn: while weak, an AI be­haves co­op­er­a­tively. When the AI is strong enough to be un­stop­pable it pur­sues its own val­ues. (p119)

  5. We might ex­pect AIs to be more safe as they get smarter ini­tially—when most of the risks come from crash­ing self-driv­ing cars or mis-firing drones—then to get much less safe as they get too smart. (p117)

  6. One can imag­ine a sce­nario where there is lit­tle so­cial im­pe­tus for safety (p117-8): alarmists will have been wrong for a long time, smarter AI will have been safer for a long time, large in­dus­tries will be in­vested, an ex­cit­ing new tech­nique will be hard to set aside, use­less safety rit­u­als will be available, and the AI will look co­op­er­a­tive enough in its sand­box.

  7. The con­cep­tion of de­cep­tion: that mo­ment when the AI re­al­izes that it should con­ceal its thoughts (foot­note 2, p282)

Another view


This is all su­perfi­cially plau­si­ble. It is in­deed con­ceiv­able that an in­tel­li­gent sys­tem — ca­pa­ble of strate­gic plan­ning — could take such treach­er­ous turns. And a suffi­ciently time-in­differ­ent AI could play a “long game” with us, i.e. it could con­ceal its true in­ten­tions and abil­ities for a very long time. Nev­er­the­less, ac­cept­ing this has some pretty profound epistemic costs. It seems to sug­gest that no amount of em­piri­cal ev­i­dence could ever rule out the pos­si­bil­ity of a fu­ture AI tak­ing a treach­er­ous turn. In fact, its even worse than that. If we take it se­ri­ously, then it is pos­si­ble that we have already cre­ated an ex­is­ten­tially threat­en­ing AI. It’s just that it is con­ceal­ing its true in­ten­tions and pow­ers from us for the time be­ing.

I don’t quite know what to make of this. Bostrom is a pretty ra­tio­nal, bayesian guy. I tend to think he would say that if all the ev­i­dence sug­gests that our AI is non-threat­en­ing (and if there is a lot of that ev­i­dence), then we should heav­ily dis­count the prob­a­bil­ity of a treach­er­ous turn. But he doesn’t seem to add that qual­ifi­ca­tion in the chap­ter. He seems to think the threat of an ex­is­ten­tial catas­tro­phe from a su­per­in­tel­li­gent AI is pretty se­ri­ous. So I’m not sure whether he em­braces the epistemic costs I just men­tioned or not.


1. Dana­her also made a nice di­a­gram of the case for doom, and re­la­tion­ship with the treach­er­ous turn:

2. History

Ac­cord­ing to Luke Muehlhauser’s timeline of AI risk ideas, the treach­er­ous turn idea for AIs has been around at least 1977, when a fic­tional worm did it:

1977: Self-im­prov­ing AI could stealthily take over the in­ter­net; con­ver­gent in­stru­men­tal goals in AI; the treach­er­ous turn. Though the con­cept of a self-prop­a­gat­ing com­puter worm was in­tro­duced by John Brun­ner’s The Shock­wave Rider (1975), Thomas J. Ryan’s novel The Ado­les­cence of P-1 (1977) tells the story of an in­tel­li­gent worm that at first is merely able to learn to hack novel com­puter sys­tems and use them to prop­a­gate it­self, but later (1) has novel in­sights on how to im­prove its own in­tel­li­gence, (2) de­vel­ops con­ver­gent in­stru­men­tal sub­goals (see Bostrom 2012) for self-preser­va­tion and re­source ac­qui­si­tion, and (3) learns the abil­ity to fake its own death so that it can grow its pow­ers in se­cret and later en­gage in a “treach­er­ous turn” (see Bostrom forth­com­ing) against hu­mans.

3. The role of the premises

Bostrom’s ar­gu­ment for doom has one premise that says AI could care about al­most any­thing, then an­other that says re­gard­less of what an AI cares about, it will do ba­si­cally the same ter­rible things any­way. (p115) Do these sound a bit strange to­gether to you? Why do we need the first, if fi­nal val­ues don’t tend to change in­stru­men­tal goals any­way?

It seems the im­me­di­ate rea­son is that an AI with val­ues we like would not have the con­ver­gent goal of tak­ing all our stuff and kil­ling us. That is, the val­ues we want an AI to have are some of those rare val­ues that don’t lead to de­struc­tive in­stru­men­tal goals. Why is this? Be­cause we (and thus the AI) care about the ac­tivites the re­sources would be grabbed from. If the re­sources were cur­rently be­ing used for any­thing we didn’t care about, then our val­ues would also sug­gest grab­bing re­sources, and look similar to all of the other val­ues. The differ­ence that makes our val­ues spe­cial here is just that most re­sources are already be­ing used for them some­what.

4. Signaling

It is hard to tell apart a safe and an un­safe AI, be­cause both would like to look safe. This is a very com­mon prob­lem in hu­man in­ter­ac­tions. For in­stance, it can be non­triv­ial to tell a gen­uine lover from a gold dig­ger, a busi­ness­man from a con­man, and an ex­pert from a crank. All of them want to look like the de­sir­able sort. Par­tic­u­larly similar to the AI case is that of hiring a new em­ployee for a trial pe­riod. You will some­times find that the em­ployee’s val­ues are much bet­ter al­igned dur­ing the trial pe­riod, and then they un­dergo a ‘treach­er­ous turn’ once they have been hired more thor­oughly.

‘Costly sig­nal­ing’ is a gen­eral pur­pose solu­tion to this prob­lem, which works some of the time. The ba­sic idea is this. Every­one has in­stru­men­tal rea­sons to look like the good kind of per­son, but per­haps their rea­sons aren’t ex­actly as strong as one other’s, or the strength of their de­sire is harder to act on for one group than the other, so we can con­struct a set of op­tions that will lead the differ­ent types of peo­ple to choose differ­ently, even though they know this will set them apart. For in­stance, while an hon­est busi­ness­man and a con­man would both like to say ‘I’m an hon­est busi­ness­man’, and to be­have hon­estly if you watch them for five min­utes, a con­man would benefit less from ex­pen­sive shopfront as an hon­est busi­ness­man, be­cause his rep­u­ta­tion is less valuable, so a brand is a sig­nal of be­ing hon­est.

Similarly, we could per­haps de­sign situ­a­tions where most harm­ful AIs would have a rea­son­able in­cen­tive to re­veal them­selves. This is difficult be­cause so much is at stake, com­pared to the cost of the sig­nal. But I haven’t thought enough about it to rule it out, and I haven’t seen any­one else do so ei­ther.

5. When would the ‘con­cep­tion of de­cep­tion’ take place?

Below the level of the best hu­mans pre­sum­ably, since we have already thought of all this.

6. Surveillance of the mind

Bostrom is not op­ti­mistic that AI re­searchers could pro­tect much against risks by watch­ing AI men­tal pro­cesses di­rectly for signs of de­cep­tion, rapid progress, or un­de­sir­able schem­ing. He says if the AI is smart it will re­al­ize this might hap­pen, and hide the offend­ing thoughts.

This seems an open ques­tion to me, for sev­eral rea­sons:

  1. Mak­ing your men­tal con­tents look in­nocu­ous while main­tain­ing their se­man­tic con­tent sounds po­ten­tially very hard

  2. Espe­cially for a crea­ture which has only just be­come smart enough to re­al­ize it should treach­er­ously turn

  3. From the AI’s per­spec­tive, even if it is smart, surveillance could seem fairly un­likely, es­pe­cially if we de­ceive it about its surroundings

As a con­se­quence of 2, it seems bet­ter if the ‘con­cep­tion of de­cep­tion’ comes ear­lier.

In-depth investigations

If you are par­tic­u­larly in­ter­ested in these top­ics, and want to do fur­ther re­search, these are a few plau­si­ble di­rec­tions, some in­spired by Luke Muehlhauser’s list, which con­tains many sug­ges­tions re­lated to parts of Su­per­in­tel­li­gence. Th­ese pro­jects could be at­tempted at var­i­ous lev­els of depth.

  1. How trans­par­ent are AI minds likely to be? Should we ex­pect to be able to de­tect de­cep­tion? What are the an­swers to these ques­tions for differ­ent spe­cific ar­chi­tec­tures and meth­ods? This might be rele­vant.

  2. Are there other good ways to filter AIs with cer­tain de­sir­able goals from oth­ers? e.g. by offer­ing them choices that would filter them.

If you are in­ter­ested in any­thing like this, you might want to men­tion it in the com­ments, and see whether other peo­ple have use­ful thoughts.

How to proceed

This has been a col­lec­tion of notes on the chap­ter. The most im­por­tant part of the read­ing group though is dis­cus­sion, which is in the com­ments sec­tion. I pose some ques­tions for you there, and I in­vite you to add your own. Please re­mem­ber that this group con­tains a va­ri­ety of lev­els of ex­per­tise: if a line of dis­cus­sion seems too ba­sic or too in­com­pre­hen­si­ble, look around for one that suits you bet­ter!

Next week, we will talk about ‘ma­lig­nant failure modes’ (as op­posed pre­sum­ably to worse failure modes). To pre­pare, read “Mal­ig­nant failure modes” from Chap­ter 8. The dis­cus­sion will go live at 6pm Pa­cific time next Mon­day De­cem­ber 1. Sign up to be no­tified here.