Superintelligence 13: Capability control methods

This is part of a weekly read­ing group on Nick Bostrom’s book, Su­per­in­tel­li­gence. For more in­for­ma­tion about the group, and an in­dex of posts so far see the an­nounce­ment post. For the sched­ule of fu­ture top­ics, see MIRI’s read­ing guide.

Wel­come. This week we dis­cuss the thir­teenth sec­tion in the read­ing guide: ca­pa­bil­ity con­trol meth­ods. This cor­re­sponds to the start of chap­ter nine.

This post sum­ma­rizes the sec­tion, and offers a few rele­vant notes, and ideas for fur­ther in­ves­ti­ga­tion. Some of my own thoughts and ques­tions for dis­cus­sion are in the com­ments.

There is no need to pro­ceed in or­der through this post, or to look at ev­ery­thing. Feel free to jump straight to the dis­cus­sion. Where ap­pli­ca­ble and I re­mem­ber, page num­bers in­di­cate the rough part of the chap­ter that is most re­lated (not nec­es­sar­ily that the chap­ter is be­ing cited for the spe­cific claim).

Read­ing: “Two agency prob­lems” and “Ca­pa­bil­ity con­trol meth­ods” from Chap­ter 9


  1. If the de­fault out­come is doom, how can we avoid it? (p127)

  2. We can di­vide this ‘con­trol prob­lem’ into two parts:

    1. The first prin­ci­pal-agent prob­lem: the well known prob­lem faced by a spon­sor want­ing an em­ployee to fulfill their wishes (usu­ally called ‘the prin­ci­pal agent prob­lem’)

    2. The sec­ond prin­ci­pal-agent prob­lem: the emerg­ing prob­lem of a de­vel­oper want­ing their AI to fulfill their wishes

  3. How to solve sec­ond prob­lem? We can’t rely on be­hav­ioral ob­ser­va­tion (as seen in week 11). Two other op­tions are ‘ca­pa­bil­ity con­trol meth­ods’ and ‘mo­ti­va­tion se­lec­tion meth­ods’. We see the former this week, and the lat­ter next week.

  4. Ca­pa­bil­ity con­trol meth­ods: avoid­ing bad out­comes through limit­ing what an AI can do. (p129)

  5. Some ca­pa­bil­ity con­trol meth­ods:

    1. Box­ing: min­i­mize in­ter­ac­tion be­tween the AI and the out­side world. Note that the AI must in­ter­act with the world to be use­ful, and that it is hard to elimi­nate small in­ter­ac­tions. (p129)

    2. In­cen­tive meth­ods: set up the AI’s en­vi­ron­ment such that it is in the AI’s in­ter­est to co­op­er­ate. e.g. a so­cial en­vi­ron­ment with pun­ish­ment or so­cial reper­cus­sions of­ten achieves this for con­tem­po­rary agents. One could also de­sign a re­ward sys­tem, per­haps with cryp­to­graphic re­wards (so that the AI could not wire­head) or heav­ily dis­counted re­wards (so that long term plans are not worth the short term risk of de­tec­tion) (p131)

      • An­thropic cap­ture: an AI thinks it might be in a simu­la­tion, and so tries to be­have as will be re­warded by simu­la­tors (box 8; p134)

    3. Stunt­ing: limit the AI’s ca­pa­bil­ities. This may be hard to do to a de­gree that avoids dan­ger and is still use­ful. An op­tion here is to limit the AI’s in­for­ma­tion. A strong AI may in­fer much from lit­tle ap­par­ent ac­cess to in­for­ma­tion how­ever. (p135)

    4. Trip­wires: test the sys­tem with­out its knowl­edge, and shut it down if it crosses some bound­ary. This might be com­bined with ‘honey pots’ to at­tract un­de­sir­able AIs take an ac­tion that would re­veal them. Trip­wires could test be­hav­ior, abil­ity, or con­tent. (p137)

Another view

Brian Clegg re­views the book mostly fa­vor­ably, but isn’t con­vinced that con­trol­ling an AI via merely turn­ing it off should be so hard:

I also think a cou­ple of the fun­da­men­tals aren’t cov­ered well enough, but pretty much as­sumed. One is that it would be im­pos­si­ble to con­tain and re­strict such an AI. Although some effort is put into this, I’m not sure there is enough thought put into the ba­sics of ways you can pull the plug man­u­ally – if nec­es­sary by shut­ting down the power sta­tion that pro­vides the AI with elec­tric­ity.

Kevin Kelly also ap­par­ently doubts that AI will sub­stan­tially im­pede efforts to mod­ify it:

...We’ll re­pro­gram the AIs if we are not satis­fied with their perfor­mance...

...This is an en­g­ineer­ing prob­lem. So far as I can tell, AIs have not yet made a de­ci­sion that its hu­man cre­ators have re­gret­ted. If they do (or when they do), then we change their al­gorithms. If AIs are mak­ing de­ci­sions that our so­ciety, our laws, our moral con­sen­sus, or the con­sumer mar­ket, does not ap­prove of, we then should, and will, mod­ify the prin­ci­ples that gov­ern the AI, or cre­ate bet­ter ones that do make de­ci­sions we ap­prove. Of course ma­chines will make “mis­takes,” even big mis­takes – but so do hu­mans. We keep cor­rect­ing them. There will be tons of scrutiny on the ac­tions of AI, so the world is watch­ing. How­ever, we don’t have uni­ver­sal con­sen­sus on what we find ap­pro­pri­ate, so that is where most of the fric­tion about them will come from. As we de­cide, our AI will de­cide...

This may be re­lated to his view that AI is un­likely to mod­ify it­self (from fur­ther down the same page):

3. Re­pro­gram­ming them­selves, on their own, is the least likely of many sce­nar­ios.

The great fear pumped up by some, though, is that as AI gain our con­fi­dence in mak­ing de­ci­sions, they will some­how pre­vent us from al­ter­ing their de­ci­sions. The fear is they lock us out. They go rogue. It is very difficult to imag­ine how this hap­pens. It seems highly im­prob­a­ble that hu­man en­g­ineers would pro­gram an AI so that it could not be al­tered in any way. That is pos­si­ble, but so im­prac­ti­cal. That hob­ble does not even serve a bad ac­tor. The usual scary sce­nario is that an AI will re­pro­gram it­self on its own to be un­alter­able by out­siders. This is con­jec­tured to be a self­ish move on the AI’s part, but it is un­clear how an un­alter­able pro­gram is an ad­van­tage to an AI. It would also be an in­cred­ible achieve­ment for a gang of hu­man en­g­ineers to cre­ate a sys­tem that could not be hacked. Still it may be pos­si­ble at some dis­tant time, but it is only one of many pos­si­bil­ities. An AI could just as likely de­cide on its own to let any­one change it, in open source mode. Or it could de­cide that it wanted to merge with hu­man will power. Why not? In the only ex­am­ple we have of an in­tro­spec­tive self-aware in­tel­li­gence (ho­minids), we have found that evolu­tion seems to have de­signed our minds to not be eas­ily self-re­pro­grammable. Ex­cept for a few yo­gis, you can’t go in and change your core men­tal code eas­ily. There seems to be an evolu­tion­ary dis­ad­van­tage to be­ing able to eas­ily muck with your ba­sic op­er­at­ing sys­tem, and it is pos­si­ble that AIs may need the same self-pro­tec­tion. We don’t know. But the pos­si­bil­ity they, on their own, de­cide to lock out their part­ners (and doc­tors) is just one of many pos­si­bil­ities, and not nec­es­sar­ily the most prob­a­ble one.


1. What do you do with a bad AI once it is un­der your con­trol?

Note that ca­pa­bil­ity con­trol doesn’t nec­es­sar­ily solve much: box­ing, stunt­ing and trip­wires seem to just stall a su­per­in­tel­li­gence rather than provide means to safely use one to its full ca­pac­ity. This leaves the con­trol­led AI to be over­taken by some other un­con­strained AI as soon as some­one else isn’t so care­ful. In this way, ca­pa­bil­ity con­trol meth­ods seem much like slow­ing down AI re­search: helpful in the short term while we find bet­ter solu­tions, but not in it­self a solu­tion to the prob­lem.

How­ever this might be too pes­simistic. An AI whose ca­pa­bil­ities are un­der con­trol might ei­ther be al­most as use­ful as an un­con­trol­led AI who shares your goals (if in­ter­acted with the right way), or at least be helpful in get­ting to a more sta­ble situ­a­tion.

Paul Chris­ti­ano out­lines a scheme for safely us­ing an un­friendly AI to solve some kinds of prob­lems. We have both blogged on gen­eral meth­ods for get­ting use­ful work from ad­ver­sar­ial agents, which is re­lated.

2. Cryp­to­graphic boxing

Paul Chris­ti­ano de­scribes a way to stop an AI in­ter­act­ing with the en­vi­ron­ment us­ing a cryp­to­graphic box.

3. Philo­soph­i­cal Disquisitions

Dana­her again sum­ma­rizes the chap­ter well. Read it if you want a differ­ent de­scrip­tion of any of the ideas, or to re­fresh your mem­ory. He also pro­vides a table of the meth­ods pre­sented in this chap­ter.

4. Some rele­vant fiction

That Alien Mes­sage by Eliezer Yudkowsky

5. Con­trol through so­cial integration

Robin Han­son ar­gues that it mat­ters more that a pop­u­la­tion of AIs are in­te­grated into our so­cial in­sti­tu­tions, and that they keep the peace among them­selves through the same in­sti­tu­tions we keep the peace among our­selves, than whether they have the right val­ues. He thinks this is why you trust your neigh­bors, not be­cause you are con­fi­dent that they have the same val­ues as you. He has sev­eral fol­lowup posts.

6. More mis­cel­la­neous writ­ings on these topics

LessWrong wiki on AI box­ing. Arm­strong et al on con­trol­ling and us­ing an or­a­cle AI. Ro­man Yam­polskiy on ‘leakproofing’ the sin­gu­lar­ity. I have not nec­es­sar­ily read these.

In-depth investigations

If you are par­tic­u­larly in­ter­ested in these top­ics, and want to do fur­ther re­search, these are a few plau­si­ble di­rec­tions, some in­spired by Luke Muehlhauser’s list, which con­tains many sug­ges­tions re­lated to parts of Su­per­in­tel­li­gence. Th­ese pro­jects could be at­tempted at var­i­ous lev­els of depth.

  1. Choose any con­trol method and work out the de­tails bet­ter. For in­stance:

    1. Could one con­struct a cryp­to­graphic box for an un­trusted au­tonomous sys­tem?

    2. In­ves­ti­gate steep tem­po­ral dis­count­ing as an in­cen­tives con­trol method for an un­trusted AGI.

  2. Are there other ca­pa­bil­ity con­trol meth­ods we could add to the list?

  3. De­vise uses for a mal­i­cious but con­strained AI.

  4. How much pres­sure is there likely to be to de­velop AI which is not con­trol­led?

  5. If ex­ist­ing AI meth­ods had un­ex­pected progress and were head­ing for hu­man-level soon, what pre­cau­tions should we take now?

If you are in­ter­ested in any­thing like this, you might want to men­tion it in the com­ments, and see whether other peo­ple have use­ful thoughts.

How to proceed

This has been a col­lec­tion of notes on the chap­ter. The most im­por­tant part of the read­ing group though is dis­cus­sion, which is in the com­ments sec­tion. I pose some ques­tions for you there, and I in­vite you to add your own. Please re­mem­ber that this group con­tains a va­ri­ety of lev­els of ex­per­tise: if a line of dis­cus­sion seems too ba­sic or too in­com­pre­hen­si­ble, look around for one that suits you bet­ter!

Next week, we will talk about ‘mo­ti­va­tion se­lec­tion meth­ods’. To pre­pare, read “Mo­ti­va­tion se­lec­tion meth­ods” and “Synop­sis” from Chap­ter 9. The dis­cus­sion will go live at 6pm Pa­cific time next Mon­day 15th De­cem­ber. Sign up to be no­tified here.