CEV-inspired models

Stuart_Armstrong7 Dec 2011 18:35 UTC

10 points

I’ve been involved in a recent thread where discussion of coherent extrapolated volition came up. The general consensus was that CEV might—or might not—do certain things, probably, maybe, in certain situations, while ruling other things out, possibly, and that certain scenarios may or may not be the same in CEV, or it might be the other way round, it was too soon to tell.

Ok, that’s an exaggeration. But any discussion of CEV is severely hampered by our lack of explicit models. Even bad, obviously incomplete models would be good, as long as we can get useful information as to what they would predict. Bad models can be improved; undefined models are intuition pumps for whatever people feel about them—I dislike CEV, and can construct a sequence of steps that takes my personal CEV to wanting the death of the universe, but that is no more credible than someone claiming that CEV will solve all problems and make lots of cute puppies.

So I’d like to ask for suggestions of models that formalise CEV to at least some extent. Then we can start improving them, and start making CEV concrete.

To start it off, here’s my (simplistic) suggestion:

Volition

Use revealed preferences as the first ingredient for individual preferences. To generalise, use hypothetical revealed preferences: the AI calculates what the person would decide in these particular situations.

Extrapolation

Whenever revealed preferences are non-transitive or non-independent, use the person’s stated meta-preferences to remove the issue. The AI thus calculates what the person would say if asked to resolve the transitivity or independence (for people who don’t know about the importance of resolving them, the AI would present them with a set of transitive and independent preferences, derived from their revealed preferences, and have them choose among them). Then (wave your hands wildly and pretend you’ve never heard of non-standard reals, lexicographical preferences, refusal to choose and related issues) everyone’s preferences are now expressible as utility functions.

Coherence

Normalise each existing person’s utility function and add them together to get your CEV. At the FHI we’re looking for sensible ways of normalising, but one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.

What links here?

Stuart_Armstrong7 Dec 2011 18:35 UTC

10 points

43 comments1 min readLW link Archive

Coherent Extrapolated Volition

jimrandomh 7 Dec 2011 20:57 UTC
14 points
0

At the FHI we’re looking for sensible ways of normalising, but one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.

Unfortunately, this utility function isn’t game-theoretically stable; if you expect your utility to be analyzed this way, you have an incentive to modify your utility function to clip or flatten the ends, to make your utility have a steeper gradient around the amount of utility you expect to receive.
- steven0461 7 Dec 2011 21:42 UTC
  6 points
  0
  Parent
  That seems like it may be true of every scheme that doesn’t consider the causal origins of people’s utility functions. Does something like Gibbard-Satterthwaite apply?
  - Vladimir_Nesov 8 Dec 2011 11:32 UTC
    1 point
    0
    Parent
    Not specifically causal origins (there’s evolution), instead I suppose there might be a way of directly negating most effects resulting from such strategic considerations (that is, decisions that were influenced by their expected effect on the decision in question that wants to negate that effect).
    - Stuart_Armstrong 8 Dec 2011 14:19 UTC
      −1 points
      0
      Parent
      “Easy” way of doing this: see what would have happened if everyone believed that you were using a model where liars don’t prosper (ie random dictator), but actually use a Pareto method.
  - cousin_it 8 Dec 2011 1:42 UTC
    0 points
    0
    Parent
    Considering the causal origins of people’s utility functions is a nice hack, thanks for pointing it out! How far back do we need to go, though? Should my children benefit if I manipulate their utility function genetically while they’re in the womb?
    
    Another way to aggregate utility functions is by simulated bargaining, but it’s biased in favor of rich and powerful people.
    - Vladimir_Nesov 8 Dec 2011 16:24 UTC
      2 points
      0
      Parent
      
      How far back do we need to go, though?
      
      As far as needed to understand (the dependence of current agent’s values on (the dependence of (expected benefit from value extraction) on current agent’s values)). (Sorry, adding parens was simpler!)
      
      This involves currently confusing “benefit” (to whom?) and assumed-mistaken “expected” (by whom?), supposedly referring to aspects of past agents (that built/determined the current agent) deciding on the strategic value bargaining. (As usual, ability to parse the world and see things that play the roles of elements of agents’ algorithms seems necessary to get anything of this sort done.)
    - Larks 8 Dec 2011 2:02 UTC
      1 point
      0
      Parent
      If I’m rich it’s because I delayed consumption, allowing others to invest the capital that I had earned. Should we not allow these people some return on their investment?
      
      To be clear, I’m not very sure the answer is yes; but nor do I think it’s clear that ‘wealth’ falls into the category of ‘things that should not influence CEV’, where things like ‘race’, ‘eye colour’ etc. live.
      - cousin_it 8 Dec 2011 2:18 UTC
        7 points
        0
        Parent
        Fair point about delayed gratification, but you may also be rich because your parents were rich, or because you won the lottery, or because you robbed someone. Judging people by their bargaining power conflates all those possible reasons.
        Larks 8 Dec 2011 22:36 UTC
        0 points
        0
        Parent
        No; if you didn’t delay gratification you’d spend the money quickly, regardless of how you got it.
        cousin_it 8 Dec 2011 23:05 UTC
        1 point
        0
        Parent
        The funniest counterexample I know is Jefri Bolkiah =)
        Kaj_Sotala 9 Dec 2011 10:24 UTC
        0 points
        0
        Parent
        If you didn’t delay gratification and had expensive tastes, you’d spend the money quickly, regardless of how you got it.
        
        Even if everyone did have expensive tastes, people who started off with less money would need to delay their gratification more. A very poor person might need to delay gratification an average of 80% of the time, since they couldn’t afford almost anything. A sufficiently rich person might only need to delay gratification 10% of the time without running into financial trouble. So if you wanted to reward delaying of gratification, then on average the poorer that a person was, the more you’d want to reward him
    - timtyler 8 Dec 2011 12:44 UTC
      −2 points
      0
      Parent
      
      Another way to aggregate utility functions is by simulated bargaining, but it’s biased in favor of rich and powerful people.
      
      The same rich and powerful people who are most likely to be funding the research, maybe?
      
      Today, to resolve their differences, people mostly just bargain I.R.L.
      
      They do simulate bargains in their heads, but only to help them with the actual bargaining.
- Stuart_Armstrong 8 Dec 2011 9:39 UTC
  3 points
  0
  Parent
  You can’t be Pareto and game-theoretically stable at the same time (I have a nice picture proof of that, that I’ll post some time). You can be stable without being Pareto—we each choose our favoured outcome, and go 50-50 between them. Then no one has an incentive to lie.
  
  Edit: Picture-proof now posted at: http://lesswrong.com/r/discussion/lw/8qv/in_the_pareto_world_liars_prosper/
  - wedrifid 8 Dec 2011 12:57 UTC
    0 points
    0
    Parent
    
    You can be stable without being Pareto—we each choose our favoured outcome, and go 50-50 between them. Then no one has an incentive to lie.
    
    I seem to have an incentive to lie in that scenario.
  - Luke_A_Somers 8 Dec 2011 14:24 UTC
    −1 points
    0
    Parent
    You can estimate where the others’ favoured outcomes and go a ways in the opposite direction to try to balance it out. Of course, if one of you takes this to the second level and the others are honest, then no one is happy except by coincidence (one of the honest people deviated from the mean more than you in the same way, and your overshoot happened to land on them).
Kaj_Sotala 7 Dec 2011 19:38 UTC
8 points
0
Upvoted for trying to say something useful about CEV.

Whenever revealed preferences are non-transitive or non-independent, use the person’s stated meta-preferences to remove the issue.

It seems odd that this is the only step where you’re using meta-preferences: I would have presumed that any theory would start off from giving a person’s approved preferences considerably stronger weight than non-approved ones. (Though since approved desires are often far and non-approved ones near, one’s approved ideal self might be completely unrealistic and not what they’d actually want. So non-approved ones should also be taken into account somehow.)
- steven0461 7 Dec 2011 20:19 UTC
  2 points
  0
  Parent
  What do you mean by “actually want”? You seem to be coming dangerously close to the vomit fallacy: “Humans sometimes vomit. By golly, the future must be full of vomit!”
  - Kaj_Sotala 7 Dec 2011 20:36 UTC
    0 points
    0
    Parent
    
    What do you mean by “actually want”?
    
    Would not actually want X = would not endorse X after finding out the actual consequences of X; would not have X as a preference after reaching reflective equilibrium.
    - steven0461 7 Dec 2011 20:45 UTC
      0 points
      0
      Parent
      Oh I see, by “approved ideal self” you meant something different than “self after reaching reflective equilibrium”. So instead of fiddling around with revealed preferences, why not just simulate the person reaching reflective equilibrium and then ask the person what preferences he or she endorses?
      - torekp 11 Dec 2011 2:49 UTC
        2 points
        0
        Parent
        That was my first thought on reading the “revealed preferences” part of the post. Extrapolation first—then volition.
      - Stuart_Armstrong 8 Dec 2011 13:22 UTC
        0 points
        0
        Parent
        Could be done—but is harder to define (what counts as a reflective equilibrium?) and harder to model (what do you expect your reflective equilibrium?)
moridinamael 7 Dec 2011 22:04 UTC
6 points
0
In a previous thread I suggested starting by explicitly defining something like a CEV for a simple worm. After thinking about it, I think perhaps a norn, or some other simple hypothetical organism might be better. To make the situation as simple as possible, start with a universe where the norn are the most intelligent life in existence.

A norn (or something simpler than a norn) has explicitly defined drives, meaning the utility functions of individual norns could potentially be approximated very accurately.

The biggest weakness of this idea is that a norn, or worm, or cellular automaton, can’t really participate in the process of approving or rejecting the resulting set of extrapolated solutions. For some people, I think this indicates that you can’t do CEV on something that isn’t sentient. It only causes me to wonder, what if we are literally too stupid to even comprehend the best possible CEV that can be offered to us? I don’t think this is unlikely.
- dlthomas 8 Dec 2011 18:07 UTC
  0 points
  0
  Parent
  
  It only causes me to wonder, what if we are literally too stupid to even comprehend the best possible CEV that can be offered to us?
  
  I think this doesn’t matter, if we can
  
  1) successfully define the CEV concept itself,
  
  2) define a suitable reference class,
  
  3) build a superintelligence, and
  
  4) ensure that the superintelligence continues to pursue the best CEV it can find for the appropriate reference class.
  - TheOtherDave 8 Dec 2011 18:27 UTC
    1 point
    0
    Parent
    Well, it would be helpful if we could also:
    2.5) work out a reliable test for whether a given X really is an instance of the CEV concept for the given reference class
    
    Which seems to depend on having some kind of understanding.
    
    Lacking that, we are left with having to trust that whatever the SI we’ve built is doing is actually what we “really want” it to do, even if we don’t seem to want it to do that, which is an awkward place to be.
- Stuart_Armstrong 8 Dec 2011 17:35 UTC
  0 points
  0
  Parent
  You’re the first to suggest something approaching a model on this thread :-)
steven0461 7 Dec 2011 20:17 UTC
6 points
0

one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0

If Alice likes cookies, and Bob likes cookies but hates whippings, this method gives Alice more cookies than Bob. Moreover, the number of bonus cookies Alice gets depends on the properties of whips that nobody ever uses.
- Vladimir_Nesov 8 Dec 2011 16:29 UTC
  3 points
  0
  Parent
  (In general, it’s proper for properties of counterfactuals to have impact on which decisions are correct in reality, so this consideration alone isn’t sufficient to demonstrate that there’s a problem.)
  - steven0461 8 Dec 2011 19:01 UTC
    0 points
    0
    Parent
    It feels intuitively like it’s a problem in this specific case.
- Stuart_Armstrong 8 Dec 2011 13:23 UTC
  2 points
  0
  Parent
  You can restrict to a Pareto boundary before normalising—not as mathematically elegant, but indifferent to effects “that nobody ever wants/uses”.
Shmi 8 Dec 2011 0:05 UTC
3 points
0

Use revealed preferences as the first ingredient for individual preferences. To generalise, use hypothetical revealed preferences: the AI calculates what the person would decide in these particular situations.

There seems to be a feedback loop missing. Provide people with a broad range of choices, let them select a few, provide a range of alternatives within that selection, repeat. Allow for going back a step or ten. That’s what happens IRL when you make a major purchase, like a TV, a car or a house.
Larks 7 Dec 2011 19:36 UTC
2 points
0

one cheap and easy method (with surprisingly good properties) is to take the maximal possible expected utility (the expected utility that person would get if the AI did exactly what they wanted) as 1, and the minimal possible expected utility (if the AI was to work completely against them) as 0.

“if the AI did exactly what they wanted” as opposed to “if the universe went exactly as they wanted” to avoid issues with unbounded utility functions? This seems like it might not be enough if the universe itself were unbounded in the relivant sense.

For example, suppose my utility function is U(Universe) = #paperclips, which is unbounded in a big universe. Then you’re going to normalise me as assigning U(AI becomes clippy) = 1, and U(individual paperclips) = 0.
- Stuart_Armstrong 8 Dec 2011 9:41 UTC
  2 points
  0
  Parent
  
  For example, suppose my utility function is U(Universe) = #paperclips, which is unbounded in a big universe. Then you’re going to normalise me as assigning U(AI becomes clippy) = 1, and U(individual paperclips) = 0.
  
  Yep.
  
  So most likely a certain proportion of the universe will become paperclips.
Giles 8 Dec 2011 18:26 UTC
0 points
0
What about recursive CEV?

Start off with CEV-0. I won’t go into how that is generated, but it will have a lot of arbitrary decisions and stuff that seems vaguely sensible.

Then ask CEV-0 the following questions:
- How should CEV-1 go about aggregating people’s preferences?
- How should CEV-1 deal with non-transitive or non-independent preferences?
- How should CEV-1 determine preferences between outcomes that the subject could never have imagined?
- What should CEV-1 do if people lack the expertise to judge the long-term consequences of their preferences?
- Should CEV-1 consider people’s stated or revealed preferences, or both?
- Should CEV-1 consider preferences of non-human animals, people in comas, etc.?
- How should CEV-1 deal with people who seem to be trying to modify their own preferences in order to game the system? (utility monsters/tactical voting)
… and so on. The answers to these questions then make up CEV-1. And then CEV-1 is asked the same questions to produce CEV-2.

Various different things could happen here. It could converge to a single stable fixed point. It could oscillate. It could explode, diverging wildly from anything we’d consider reasonable. Or its behavior could depend on the initial choice of CEV-0 (e.g. multiple attractive fixed points).

Explosion could (possibly) be avoided by requiring CEV-n to pass some basic sanity checks, though that has problems too (the sanity checks may not be valid, i.e. they just reflect our own biases. Or they may not be enough—they act as constraints on the evolution of the system but it could still end up insane in respects we haven’t anticipated).

Some other problems could be resolved by asking CEV-n how to resolve them.

I’m not sure how to deal with the multiple stable fixed points case. That would seem to correspond to different cultures or special interest groups all trying to push whichever meta-level worldview benefits them the most.
- Stuart_Armstrong 10 Dec 2011 15:43 UTC
  0 points
  0
  Parent
  Once CEV-n becomes a utility function, it will generally (but not always) get stuck there for ever.
crazy88 7 Dec 2011 20:32 UTC
−1 points
0
Sorry if this is answered elsewhere but I thought interpersonal comparisons of utility were generally considered to be impossible.

Is the crucial difference about CEV the fact that it doesn’t attempt to maximise the utility of humanity but rather to extract the volition of humanity by treating each person’s input equally without attempting to claim that utility is being compared between people to do so? Or does CEV involve interpersonal comparison of utility and, if so, why is this not considered problematic?
- AlexSchell 7 Dec 2011 20:50 UTC
  3 points
  0
  Parent
  
  I thought interpersonal comparisons of utility were generally considered to be impossible.
  
  This is true about aggregating ordinal utilities, but doesn’t hold for cardinal utilities (see Arrow’s theorem). If you are talking about comparing utilities (i.e. choosing a normalization method), I’m not aware of a general consensus that this is impossible.
  - Larks 7 Dec 2011 22:31 UTC
    9 points
    0
    Parent
    Economists generally regard interpersonal utility comparisons as impossible; hence the focus on Pareto, and then Kalder-Hicks, optimality. See for example this, though any decent economics textbook will cover it.
    
    The problem, of course, is that utility functions are only defined up to an affine transformation.
    - Stuart_Armstrong 8 Dec 2011 9:36 UTC
      0 points
      0
      Parent
      
      The problem, of course, is that utility functions are only defined up to an affine transformation.
      
      Which is why I normalise them first before adding them up.
- timtyler 8 Dec 2011 12:48 UTC
  0 points
  0
  Parent
  
  Sorry if this is answered elsewhere but I thought interpersonal comparisons of utility were generally considered to be impossible.
  
  Not impossible, just challenging.
- Stuart_Armstrong 8 Dec 2011 9:36 UTC
  0 points
  0
  Parent
  
  Sorry if this is answered elsewhere but I thought interpersonal comparisons of utility were generally considered to be impossible.
  
  It’s hard. You can do it, in many ways, but most of the properties you’d want to have cannot be had. The max-min method of normalisation I mentioned has the most of the intuitive properties (despite not being very intuitive itself).
  - crazy88 9 Dec 2011 7:36 UTC
    0 points
    0
    Parent
    If you have the time, I’d be interested to know what these desirable properties are (or would be happy to read a paper on the topic if you have one to suggest).
    - Stuart_Armstrong 9 Dec 2011 9:02 UTC
      0 points
      0
      Parent
      We’re working on those at the moment, so they’re still in flux; but we’ll put them out there once we’ve firmed them up.
      - crazy88 9 Dec 2011 22:09 UTC
        0 points
        0
        Parent
        Cool, I’ll keep my eye out.