Superintelligence 23: Coherent extrapolated volition

This is part of a weekly read­ing group on Nick Bostrom’s book, Su­per­in­tel­li­gence. For more in­for­ma­tion about the group, and an in­dex of posts so far see the an­nounce­ment post. For the sched­ule of fu­ture top­ics, see MIRI’s read­ing guide.

Wel­come. This week we dis­cuss the twenty-third sec­tion in the read­ing guide: Co­her­ent ex­trap­o­lated vo­li­tion.

This post sum­ma­rizes the sec­tion, and offers a few rele­vant notes, and ideas for fur­ther in­ves­ti­ga­tion. Some of my own thoughts and ques­tions for dis­cus­sion are in the com­ments.

There is no need to pro­ceed in or­der through this post, or to look at ev­ery­thing. Feel free to jump straight to the dis­cus­sion. Where ap­pli­ca­ble and I re­mem­ber, page num­bers in­di­cate the rough part of the chap­ter that is most re­lated (not nec­es­sar­ily that the chap­ter is be­ing cited for the spe­cific claim).

Read­ing: “The need for...” and “Co­her­ent ex­trap­o­lated vo­li­tion” from Chap­ter 13


  1. Prob­lem: we are morally and episte­molog­i­cally flawed, and we would like to make an AI with­out lock­ing in our own flaws for­ever. How can we do this?

  2. Indi­rect nor­ma­tivity: offload cog­ni­tive work to the su­per­in­tel­li­gence, by spec­i­fy­ing our val­ues in­di­rectly and hav­ing it trans­form them into a more us­able form.

  3. Prin­ci­ple of epistemic defer­ence: a su­per­in­tel­li­gence is more likely to be cor­rect than we are on most top­ics, most of the time. There­fore, we should defer to the su­per­in­tel­li­gence where fea­si­ble.

  4. Co­her­ent ex­trap­o­lated vo­li­tion (CEV): a goal of fulfilling what hu­man­ity would agree that they want, if given much longer to think about it, in more ideal cir­cum­stances. CEV is pop­u­lar pro­posal for what we should de­sign an AI to do.

  5. Virtues of CEV:

    1. It avoids the per­ils of speci­fi­ca­tion: it is very hard to spec­ify ex­plic­itly what we want, with­out caus­ing un­in­tended and un­de­sir­able con­se­quences. CEV speci­fies the source of our val­ues, in­stead of what we think they are, which ap­pears to be eas­ier.

    2. It en­cap­su­lates moral growth: there are rea­sons to be­lieve that our cur­rent moral be­liefs are not the best (by our own lights) and we would re­vise some of them, if we thought about it. Spec­i­fy­ing our val­ues now risks lock­ing in wrong val­ues, whereas CEV effec­tively gives us longer to think about our val­ues.

    3. It avoids ‘hi­jack­ing the des­tiny of hu­mankind’: it al­lows the re­spon­si­bil­ity for the fu­ture of mankind to re­main with mankind, in­stead of per­haps a small group of pro­gram­mers.

    4. It avoids cre­at­ing a mo­tive for mod­ern-day hu­mans to fight over the ini­tial dy­namic: a com­mit­ment to CEV would mean the cre­ators of AI would not have much more in­fluence over the fu­ture of the uni­verse than oth­ers, re­duc­ing the in­cen­tive to race or fight. This is even more so be­cause a per­son who be­lieves that their views are cor­rect should be con­fi­dent that CEV will come to re­flect their views, so they do not even need to split the in­fluence with oth­ers.

    5. It keeps hu­mankind ‘ul­ti­mately in charge of its own des­tiny’: it al­lows for a wide va­ri­ety of ar­range­ments in the long run, rather than ne­ces­si­tat­ing pa­ter­nal­is­tic AI over­sight of ev­ery­thing.

  6. CEV as de­scribed here is merely a schematic. For in­stance, it does not spec­ify which peo­ple are in­cluded in ‘hu­man­ity’.

    Another view

    Part of Olle Häg­gström’s ex­tended re­view of Su­per­in­tel­li­gence ex­presses a com­mon con­cern—that hu­man val­ues can’t be faith­fully turned into any­thing co­her­ent:

    Hu­man val­ues ex­hibit, at least on the sur­face, plenty of in­co­her­ence. That much is hardly con­tro­ver­sial. But what if the in­co­her­ence goes deeper, and is fun­da­men­tal in such a way that any at­tempt to un­tan­gle it is bound to fail? Per­haps any search for our CEV is bound to lead to more and more glar­ing con­tra­dic­tions? Of course any value sys­tem can be mod­ified into some­thing co­her­ent, but per­haps not all value sys­tems can­not be so mod­ified with­out sac­ri­fic­ing some of its most cen­tral tenets? And per­haps hu­man val­ues have that prop­erty?

    Let me offer a can­di­date for what such a fun­da­men­tal con­tra­dic­tion might con­sist in. Imag­ine a fu­ture where all hu­mans are per­ma­nently hooked up to life-sup­port ma­chines, ly­ing still in beds with no com­mu­ni­ca­tion with each other, but with elec­trodes con­nected to the plea­sure cen­tra of our brains in such a way as to con­stantly give us the most plea­surable ex­pe­riences pos­si­ble (given our brain ar­chi­tec­tures). I think nearly ev­ery­one would at­tach a low value to such a fu­ture, deem­ing it ab­surd and un­ac­cept­able (thus agree­ing with Robert Noz­ick). The rea­son we find it un­ac­cept­able is that in such a sce­nario we no longer have any­thing to strive for, and there­fore no mean­ing in our lives. So we want in­stead a fu­ture where we have some­thing to strive for. Imag­ine such a fu­ture F1. In F1 we have some­thing to strive for, so there must be some­thing miss­ing in our lives. Now let F2 be similar to F1, the only differ­ence be­ing that that some­thing is no longer miss­ing in F2, so al­most by defi­ni­tion F2 is bet­ter than F1 (be­cause oth­er­wise that some­thing wouldn’t be worth striv­ing for). And as long as there is still some­thing worth striv­ing for in F2, there’s an even bet­ter fu­ture F3 that we should pre­fer. And so on. What if any such pro­ce­dure quickly takes us to an ab­surd and mean­ingless sce­nario with life-su­port ma­chines and elec­trodes, or some­thing along those lines. Then no fu­ture will be good enough for our prefer­ences, so not even a su­per­in­tel­li­gence will have any­thing to offer us that al­igns ac­cept­ably with our val­ues.

    Now, I don’t know how se­ri­ous this par­tic­u­lar prob­lem is. Per­haps there is some way to gen­tly cir­cum­vent its con­tra­dic­tions. But even then, there might be some other fun­da­men­tal in­con­sis­tency in our val­ues—one that can­not be cir­cum­vented. If that is the case, it will throw a span­ner in the works of CEV. And per­haps not only for CEV, but for any se­ri­ous at­tempt to set up a long-term fu­ture for hu­man­ity that al­igns with our val­ues, with or with­out a su­per­in­tel­li­gence.


    1. While we are on the topic of cri­tiques, here is a bet­ter list:

    1. Hu­man val­ues may not be co­her­ent (Olle Häg­gström above, Mar­cello; Eliezer re­sponds in sec­tion 6. ques­tion 9)

    2. The val­ues of a col­lec­tion of hu­mans in com­bi­na­tion may be even less co­her­ent. Ar­row’s im­pos­si­bil­ity the­o­rem sug­gests rea­son­able ag­gre­ga­tion is hard, but this only ap­plies if val­ues are or­di­nal, which is not ob­vi­ous.

    3. Even if hu­man val­ues are com­plex, this doesn’t mean com­plex out­comes are re­quired—maybe with some thought we could spec­ify the right out­comes, and don’t need an in­di­rect means like CEV (Wei Dai)

    4. The moral ‘progress’ we see might ac­tu­ally just be moral drift that we should try to avoid. CEV is de­signed to al­low this change, which might be bad. Ideally, the CEV cir­cum­stances would be op­ti­mized for de­liber­a­tion and not for other forces that might change val­ues, but per­haps de­liber­a­tion it­self can’t pro­ceed with­out our val­ues be­ing changed (Cousin_it)

    5. In­di­vi­d­u­als will prob­a­bly not be a sta­ble unit in the fu­ture, so it is un­clear how to weight differ­ent peo­ple’s in­puts to CEV. Or to be con­crete, what if Dr Evil can cre­ate trillions of em­u­lated copies of him­self to go into the CEV pop­u­la­tion. (Wei Dai)

    6. It is not clear that ex­trap­o­lat­ing ev­ery­one’s vo­li­tion is bet­ter than ex­trap­o­lat­ing a sin­gle per­son’s vo­li­tion, which may be eas­ier. If you want to take into ac­count oth­ers’ prefer­ences, then your own vo­li­tion is fine (it will do that), and if you don’t, then why would you be us­ing CEV?

    7. A pur­ported ad­van­tage of CEV is that it makes con­flict less likely. But if a group is dis­posed to honor ev­ery­one else’s wishes, they will not con­flict any­way, and if they aren’t dis­posed to honor ev­ery­one’s wishes, why would they fa­vor CEV? CEV doesn’t provide any ad­di­tional means to com­mit to co­op­er­a­tive be­hav­ior. (Cousin_it)

    8. More in Co­her­ent Ex­trap­o­lated Vo­li­tion sec­tion 6. ques­tion 9

    2. Luke Muehlhauser has writ­ten a list of re­sources you might want to read if you are in­ter­ested in this topic. It sug­gests these main sources:

    He also dis­cusses some closely re­lated philo­soph­i­cal con­ver­sa­tions:

    • Reflec­tive equil­ibrium. Yud­kowsky’s pro­posed ex­trap­o­la­tion works analo­gously to what philoso­phers call ‘re­flec­tive equil­ibrium.’ The most thor­ough work here is the 1996 book by Daniels, and there have been lots of pa­pers, but this genre is only barely rele­vant for CEV...

    • Full-in­for­ma­tion ac­counts of value and ideal ob­server the­o­ries. This is what philoso­phers call the­o­ries of value that talk about ‘what we would want if we were fully in­formed, etc.’ or ‘what a perfectly in­formed agent would want’ like CEV does. There’s some liter­a­ture on this, but it’s only marginally rele­vant to CEV...

    Muehlhauser later wrote at more length about the re­la­tion­ship of CEV to ideal ob­server the­o­ries, with Chris Willi­am­son.

    3. This chap­ter is con­cerned with avoid­ing lock­ing in the wrong val­ues. One might won­der ex­actly what this ‘lock­ing in’ is, and why AI will cause val­ues to be ‘locked in’ while hav­ing chil­dren for in­stance does not. Here is my take: there are two is­sues—the ex­tent to which val­ues change, and the ex­tent to which one can per­son­ally con­trol that change. At the mo­ment, val­ues change plenty and we can’t con­trol the change. Per­haps in the fu­ture, tech­nol­ogy will al­low the change to be con­trol­led (this is the hope with value load­ing). Then, if any­one can con­trol val­ues they prob­a­bly will, be­cause val­ues are valuable to con­trol. In par­tic­u­lar, if AI can con­trol its own val­ues, it will avoid hav­ing them change. Thus in the fu­ture, prob­a­bly val­ues will be con­trol­led, and will not change. It is not clear that we will lock in val­ues as soon as we have ar­tifi­cial in­tel­li­gence—per­haps an ar­tifi­cial in­tel­li­gence will be built for which its im­plicit val­ues ran­domly change—but if we are suc­cess­ful we will con­trol val­ues, and thus lock them in, and if we are even more suc­cess­ful we will lock in val­ues that ac­tu­ally de­sir­able for us. Paul Chris­ti­ano has a post on this topic, which I prob­a­bly pointed you to be­fore.

    4. Paul Chris­ti­ano has also writ­ten about how to con­cretely im­ple­ment the ex­trap­o­la­tion of a sin­gle per­son’s vo­li­tion, in the in­di­rect nor­ma­tivity scheme de­scribed in box 12 (p199-200). You prob­a­bly saw it then, but I draw it to your at­ten­tion here be­cause the ex­trap­o­la­tion pro­cess is closely re­lated to CEV and is con­crete. He also has a re­cent pro­posal for ‘im­ple­ment­ing our con­sid­ered judg­ment’.

    In-depth investigations

    If you are par­tic­u­larly in­ter­ested in these top­ics, and want to do fur­ther re­search, these are a few plau­si­ble di­rec­tions, some in­spired by Luke Muehlhauser’s list, which con­tains many sug­ges­tions re­lated to parts of Su­per­in­tel­li­gence. Th­ese pro­jects could be at­tempted at var­i­ous lev­els of depth.

    1. Spec­ify a method for in­stan­ti­at­ing CEV, given some as­sump­tions about available tech­nol­ogy.

    2. In prac­tice, to what de­gree do hu­man val­ues and prefer­ences con­verge upon learn­ing new facts? To what de­gree has this hap­pened in his­tory? (No­body val­ues the will of Zeus any­more, pre­sum­ably be­cause we all learned the truth of Zeus’ non-ex­is­tence. But per­haps such ex­am­ples don’t tell us much.) See also philo­soph­i­cal analy­ses of the is­sue, e.g. So­bel (1999).

    3. Are changes in spe­cific hu­man prefer­ences (over a life­time or many life­times) bet­ter un­der­stood as changes in un­der­ly­ing val­ues, or changes in in­stru­men­tal ways to achieve those val­ues? (driven by be­lief change, or ad­di­tional de­liber­a­tion)

    4. How might demo­cratic sys­tems deal with new agents be­ing read­ily cre­ated?

    If you are in­ter­ested in any­thing like this, you might want to men­tion it in the com­ments, and see whether other peo­ple have use­ful thoughts.

    How to proceed

    This has been a col­lec­tion of notes on the chap­ter. The most im­por­tant part of the read­ing group though is dis­cus­sion, which is in the com­ments sec­tion. I pose some ques­tions for you there, and I in­vite you to add your own. Please re­mem­ber that this group con­tains a va­ri­ety of lev­els of ex­per­tise: if a line of dis­cus­sion seems too ba­sic or too in­com­pre­hen­si­ble, look around for one that suits you bet­ter!

    Next week, we will talk about more ideas for giv­ing an AI de­sir­able val­ues. To pre­pare, read “Mo­ral­ity mod­els” and “Do what I mean” from Chap­ter 13. The dis­cus­sion will go live at 6pm Pa­cific time next Mon­day 23 Fe­bru­ary. Sign up to be no­tified here.