Superintelligence 23: Coherent extrapolated volition
This is part of a weekly reading group on Nick Bostrom’s book, Superintelligence. For more information about the group, and an index of posts so far see the announcement post. For the schedule of future topics, see MIRI’s reading guide.
Welcome. This week we discuss the twenty-third section in the reading guide: Coherent extrapolated volition.
This post summarizes the section, and offers a few relevant notes, and ideas for further investigation. Some of my own thoughts and questions for discussion are in the comments.
There is no need to proceed in order through this post, or to look at everything. Feel free to jump straight to the discussion. Where applicable and I remember, page numbers indicate the rough part of the chapter that is most related (not necessarily that the chapter is being cited for the specific claim).
Reading: “The need for...” and “Coherent extrapolated volition” from Chapter 13
Problem: we are morally and epistemologically flawed, and we would like to make an AI without locking in our own flaws forever. How can we do this?
Indirect normativity: offload cognitive work to the superintelligence, by specifying our values indirectly and having it transform them into a more usable form.
Principle of epistemic deference: a superintelligence is more likely to be correct than we are on most topics, most of the time. Therefore, we should defer to the superintelligence where feasible.
Coherent extrapolated volition (CEV): a goal of fulfilling what humanity would agree that they want, if given much longer to think about it, in more ideal circumstances. CEV is popular proposal for what we should design an AI to do.
Virtues of CEV:
It avoids the perils of specification: it is very hard to specify explicitly what we want, without causing unintended and undesirable consequences. CEV specifies the source of our values, instead of what we think they are, which appears to be easier.
It encapsulates moral growth: there are reasons to believe that our current moral beliefs are not the best (by our own lights) and we would revise some of them, if we thought about it. Specifying our values now risks locking in wrong values, whereas CEV effectively gives us longer to think about our values.
It avoids ‘hijacking the destiny of humankind’: it allows the responsibility for the future of mankind to remain with mankind, instead of perhaps a small group of programmers.
It avoids creating a motive for modern-day humans to fight over the initial dynamic: a commitment to CEV would mean the creators of AI would not have much more influence over the future of the universe than others, reducing the incentive to race or fight. This is even more so because a person who believes that their views are correct should be confident that CEV will come to reflect their views, so they do not even need to split the influence with others.
It keeps humankind ‘ultimately in charge of its own destiny’: it allows for a wide variety of arrangements in the long run, rather than necessitating paternalistic AI oversight of everything.
CEV as described here is merely a schematic. For instance, it does not specify which people are included in ‘humanity’.
Part of Olle Häggström’s extended review of Superintelligence expresses a common concern—that human values can’t be faithfully turned into anything coherent:
Human values exhibit, at least on the surface, plenty of incoherence. That much is hardly controversial. But what if the incoherence goes deeper, and is fundamental in such a way that any attempt to untangle it is bound to fail? Perhaps any search for our CEV is bound to lead to more and more glaring contradictions? Of course any value system can be modified into something coherent, but perhaps not all value systems cannot be so modified without sacrificing some of its most central tenets? And perhaps human values have that property?
Let me offer a candidate for what such a fundamental contradiction might consist in. Imagine a future where all humans are permanently hooked up to life-support machines, lying still in beds with no communication with each other, but with electrodes connected to the pleasure centra of our brains in such a way as to constantly give us the most pleasurable experiences possible (given our brain architectures). I think nearly everyone would attach a low value to such a future, deeming it absurd and unacceptable (thus agreeing with Robert Nozick). The reason we find it unacceptable is that in such a scenario we no longer have anything to strive for, and therefore no meaning in our lives. So we want instead a future where we have something to strive for. Imagine such a future F1. In F1 we have something to strive for, so there must be something missing in our lives. Now let F2 be similar to F1, the only difference being that that something is no longer missing in F2, so almost by definition F2 is better than F1 (because otherwise that something wouldn’t be worth striving for). And as long as there is still something worth striving for in F2, there’s an even better future F3 that we should prefer. And so on. What if any such procedure quickly takes us to an absurd and meaningless scenario with life-suport machines and electrodes, or something along those lines. Then no future will be good enough for our preferences, so not even a superintelligence will have anything to offer us that aligns acceptably with our values.
Now, I don’t know how serious this particular problem is. Perhaps there is some way to gently circumvent its contradictions. But even then, there might be some other fundamental inconsistency in our values—one that cannot be circumvented. If that is the case, it will throw a spanner in the works of CEV. And perhaps not only for CEV, but for any serious attempt to set up a long-term future for humanity that aligns with our values, with or without a superintelligence.
1. While we are on the topic of critiques, here is a better list:
The values of a collection of humans in combination may be even less coherent. Arrow’s impossibility theorem suggests reasonable aggregation is hard, but this only applies if values are ordinal, which is not obvious.
Even if human values are complex, this doesn’t mean complex outcomes are required—maybe with some thought we could specify the right outcomes, and don’t need an indirect means like CEV (Wei Dai)
The moral ‘progress’ we see might actually just be moral drift that we should try to avoid. CEV is designed to allow this change, which might be bad. Ideally, the CEV circumstances would be optimized for deliberation and not for other forces that might change values, but perhaps deliberation itself can’t proceed without our values being changed (Cousin_it)
Individuals will probably not be a stable unit in the future, so it is unclear how to weight different people’s inputs to CEV. Or to be concrete, what if Dr Evil can create trillions of emulated copies of himself to go into the CEV population. (Wei Dai)
It is not clear that extrapolating everyone’s volition is better than extrapolating a single person’s volition, which may be easier. If you want to take into account others’ preferences, then your own volition is fine (it will do that), and if you don’t, then why would you be using CEV?
A purported advantage of CEV is that it makes conflict less likely. But if a group is disposed to honor everyone else’s wishes, they will not conflict anyway, and if they aren’t disposed to honor everyone’s wishes, why would they favor CEV? CEV doesn’t provide any additional means to commit to cooperative behavior. (Cousin_it)
Yudkowsky, Metaethics sequence
Yudkowsky, ‘Coherent Extrapolated Volition’
He also discusses some closely related philosophical conversations:
Reflective equilibrium. Yudkowsky’s proposed extrapolation works analogously to what philosophers call ‘reflective equilibrium.’ The most thorough work here is the 1996 book by Daniels, and there have been lots of papers, but this genre is only barely relevant for CEV...
Full-information accounts of value and ideal observer theories. This is what philosophers call theories of value that talk about ‘what we would want if we were fully informed, etc.’ or ‘what a perfectly informed agent would want’ like CEV does. There’s some literature on this, but it’s only marginally relevant to CEV...
Muehlhauser later wrote at more length about the relationship of CEV to ideal observer theories, with Chris Williamson.
3. This chapter is concerned with avoiding locking in the wrong values. One might wonder exactly what this ‘locking in’ is, and why AI will cause values to be ‘locked in’ while having children for instance does not. Here is my take: there are two issues—the extent to which values change, and the extent to which one can personally control that change. At the moment, values change plenty and we can’t control the change. Perhaps in the future, technology will allow the change to be controlled (this is the hope with value loading). Then, if anyone can control values they probably will, because values are valuable to control. In particular, if AI can control its own values, it will avoid having them change. Thus in the future, probably values will be controlled, and will not change. It is not clear that we will lock in values as soon as we have artificial intelligence—perhaps an artificial intelligence will be built for which its implicit values randomly change—but if we are successful we will control values, and thus lock them in, and if we are even more successful we will lock in values that actually desirable for us. Paul Christiano has a post on this topic, which I probably pointed you to before.
4. Paul Christiano has also written about how to concretely implement the extrapolation of a single person’s volition, in the indirect normativity scheme described in box 12 (p199-200). You probably saw it then, but I draw it to your attention here because the extrapolation process is closely related to CEV and is concrete. He also has a recent proposal for ‘implementing our considered judgment’.
If you are particularly interested in these topics, and want to do further research, these are a few plausible directions, some inspired by Luke Muehlhauser’s list, which contains many suggestions related to parts of Superintelligence. These projects could be attempted at various levels of depth.
Specify a method for instantiating CEV, given some assumptions about available technology.
In practice, to what degree do human values and preferences converge upon learning new facts? To what degree has this happened in history? (Nobody values the will of Zeus anymore, presumably because we all learned the truth of Zeus’ non-existence. But perhaps such examples don’t tell us much.) See also philosophical analyses of the issue, e.g. Sobel (1999).
Are changes in specific human preferences (over a lifetime or many lifetimes) better understood as changes in underlying values, or changes in instrumental ways to achieve those values? (driven by belief change, or additional deliberation)
How might democratic systems deal with new agents being readily created?
If you are interested in anything like this, you might want to mention it in the comments, and see whether other people have useful thoughts.
How to proceed
This has been a collection of notes on the chapter. The most important part of the reading group though is discussion, which is in the comments section. I pose some questions for you there, and I invite you to add your own. Please remember that this group contains a variety of levels of expertise: if a line of discussion seems too basic or too incomprehensible, look around for one that suits you better!
Next week, we will talk about more ideas for giving an AI desirable values. To prepare, read “Morality models” and “Do what I mean” from Chapter 13. The discussion will go live at 6pm Pacific time next Monday 23 February. Sign up to be notified here.