Eliezer Yudkowsky comments on Ideal Advisor Theories and Personal CEV

Eliezer Yudkowsky 25 Dec 2012 22:50 UTC
39 points
Apparently this is being read by major philosophers now, which is good on the one hand, but on the other hand a really quick review of historical context:

The background problem here is that we want an effective decision procedure of bounded complexity which can actually be implemented in sufficiently advanced Artificial Intelligences.

The first difficulty is the “effective” part. Suppose you want to build a chess-playing program. A philosophy undergrad wisely informs you that you ought to instruct your chess-playing program to make “good moves”. You reply that you need a more “effective” specification of what a good move is, so that you can get your program to do it. The undergrad tells you that a good move is one which is wise, highly informed, which will not later be revealed to be a bad move, and so on. What you actually need here is something along the lines of “A good move is one which, when combined with the other player’s moves, results in a board state which the following computable predicate verifies as ‘winning’”. Once you realize the other player is trying to perform a symmetric but opposed procedure, you can model the chessboard’s future using search trees. Pragmatically, you’re still a long way off from beating Kasparov. But given unbounded finite computing power you could play perfect chess. In turn, this means you’re able to get started on the problem of approximating good moves, now that you have an effectively specified definition of maximally good moves, even though you can’t evaluate the latter definition using available computing power.

A lot of the motivation in CEV is that we’re trying to describe a beneficial AI in terms that allow beneficial-ness to actually be computed or approximated. The AI observes a human and builds up an abstract predictive model of how that human makes decisions—this is an in-principle straightforward problem the way that playing perfect chess is straightforward; Solomonoff induction ideally says how to build good predictive models. What should the AI do with this predictive model, though? An accurate model will accurately predict that the human will choose to drink the glass of bleach, but in an intuitive sense, it seems like we’d want the AI to give the human water.

But suppose we can idealize this decision model in a way which separates terminal values from empirical beliefs. Then we can substitute the AI’s world-model for the human’s world-model and re-run the decision model. If the AI is much more intelligent than us, this takes care of the bleach-vs.-water case, since the AI knows that the glass contains bleach and that the human values water.

This is the basic paradigm of CEV—build up predictively accurate abstract models of a human decision process, then manipulate them in effectively specified ways to ‘construe a volition’. (I would ordinarily say ‘extrapolate’, but the paper above gave a specific definition of ‘extrapolate’ that sounds more like surgery followed by prediction than a general, ‘look over this accurate human decision model and do X with it’).

The appeal of Rawls’s reflective equilibrium / Ideal Advisor models is that they describe a construal procedure that sounds effectively computable and approximable: add more veridical knowledge to the decision process (the AI’s knowledge, in the case where the AI is smarter than us), run the decision process for a longer time, and allow the model more veridical knowledge of itself and possible even some set of choices for modifying itself. Similarly, the appeal of Bostrom’s parliament is not so much that it sounds like a plausible ultimate metaethical theory but that it gives us an effective-sounding procedure for resolving multiple possible volitions (even within a single person) into a coherent output.

More generally, CEV is a case of what Bostrom termed an ‘indirect normativity’ strategy. If we think values are complex—see e.g. William Frankena’s list of terminal values not obviously reducible to one another—a robust strategy would involve trying to teach the AI how to look at humans and absorb and idealize values from them, so as to avoid the problem of accidentally leaving out one value.

The motivation for indirect normativity—for delving into metaethics rather than giving a superintelligent AI a laundry list of cool-sounding wishes—is that we want to pick something close enough to a correct core metaethical structure that it will compactly cover everything human beings want, ought to want, or might later regret asking for, without relying on the ability of human programmers to visualize the outcome in advance. (“I wish you’d get me that glass!” cough cough dies)

Most of the empirical challenge in CEV would stem from the fact that a predictively accurate model of human decisions would be a highly messy structure, and ‘construing a volition’ suitable for coherent advice isn’t a trivial problem. (It sounds to me on a first reading like neither ‘idealization’ nor ‘extrapolation’ as defined in the above document may be sufficient for this. Any rational agent needs a coherent utility function, but getting this out of a messy accurate predictive human model is not as simple as conducting a point surgery and extrapolating forward in time, nor as simple as supposing infinite knowledge.)

To compete with CEV in its intended ecological niche (useful advice to (designers of) sufficiently advanced AIs) means looking for alternate theories of how to produce reliable epistemic advice about what-to-do in the presence of messy human values, with sufficient indirection to automatically cover imaginable use-cases of things we didn’t think to ask for or might later regret, which theories are close enough to being effectively specified that AI programmers can implement them (though perhaps as something requiring development work to imbue in an AI, rather than a direct computer program).
- bryjnar 26 Dec 2012 1:58 UTC
  3 points
  Parent
  A lot of what you’ve said sounds like you’re just reiterating what Luke says quite clearly near the beginning: Ideal Advisor theories are “metaphysical”, and CEV is epistemic, i.e. Ideal Advisor theories are usually trying to give an account of what is good, whereas, as you say, CEV is just about trying to find a good effective approximation to the good. In that sense, this article is comparing apples to oranges. But the point is that some criticisms may carry over.
  
  [EDIT: this comment is pretty off the mark, given that I appear to be unable to read the first sentence of comments I’m replying to. “historical context” facepalm]