Best utility normalisation method to date?

Stuart_Armstrong2 Sep 2019 18:24 UTC

LW: 19 AF: 9

For some time, me and others have been looking at ways of normalising utility functions, so that we can answer questions like:

Suppose that you are uncertain between maximising $U_{1}$ and $U_{2}$ , what do you do?

...without having to worry about normalising $U_{1}$ or $U_{2}$ (since utility functions are only defined up to positive affine transformations).

I’ve long liked the mean-max normalisation; in this view, what matters is the difference between a utility’s optimal policy, and a random policy. So, in a sense, each utility function has a equal shot of moving the outcome away from an expected random policy, and towards themselves.

The intuition still seems good to me, but the “random policy” is a bit of a problem. First of all, it’s not all that well defined—are we talking about a policy that just spits out random outputs, or one that picks randomly among outcomes? Suppose there are three options, option A (if A is output), option B’ (if B’ is output), or do nothing (any other output), should we really say that A happens twice as often as B’ (since typing out A randomly is twice as likely that typing out B’?).

Relatedly, if we add another option C, which is completely equivalent to A for all possible utilities, then this redefines the random policy. There’s also a problem with branching—what if option A now leads to twenty choices later, while B leads to no further choices, are we talking about twenty-one equivalent choices, or twenty equivalent choices and one other one as likely as all of them put together? Also, the concept has some problem with infinite option sets.

A more fundamental problem is that the random policy includes options that neither $U_{1}$ nor $U_{2}$ would ever consider sensible.

Random dictator policy

These problems can be solved by switching instead to the random dictator policy as the default, rather than a random policy.

Assume we are hesitating between utility functions $U_{1}$ , $U_{2}$ , … $U_{n}$ , with $π_{i}^{*}$ the optimal policy for utility $U_{i}$ . Then the random dictator policy is just $π_{r d}$ which picks a $π_{i}^{*}$ at random and then follows that. So

$π_{r d} = \frac{1}{n} π_{1}^{*} + \frac{1}{n} π_{2}^{*} + \dots + \frac{1}{n} π_{n}^{*}$ .

Normalising to the random dictator policy

This $π_{r d}$ is an excellent candidate for replacing the random policy in the normalisation. It is well defined, it would never choose options that all utilities object to, and it doesn’t care about how options are labelled or about how to count them.

Therefore we can present the random dictator normalisation: if you are hesitating between utility functions $U_{1}$ , $U_{2}$ , … $U_{n}$ , then normalise each one to ${ˆ U}_{i}$ as follows:

${ˆ U}_{i} = \frac{U_{i}}{E_{π_{i}^{*}} [U_{i}] - E_{π_{r d}} [U_{i}]}$ ,

where $E_{π_{i}^{*}} [U_{i}]$ is the expected utility of $U_{i}$ given optimal policy, and $E_{π_{r d}} [U_{i}]$ is its expected utility given the random dictator policy.

Our overall utility to maximise then becomes:

$U = \frac{1}{n} ({ˆ U}_{1} + {ˆ U}_{2} + \dots + {ˆ U}_{n})$ .

Note that that normalisation has a singularity when $E_{π_{i}^{*}} [U_{i}] = E_{π_{r d}} [U_{i}]$ . But realise what that means: it means that the random dictator policy is optimal for $U_{i}$ . That means that every single $π_{j}^{*}$ is optimal for $U_{i}$ . So, though the explosion in the normalisation means that we must pick an optimal policy for $U_{i}$ , this set is actually quite large, and we can use the normalisations of the other $U_{j}$ to pick from among it (so maximising $U_{i}$ becomes a lexicographic preference for us).

Normalising a distribution over utilities

Now suppose that there is a distribution over the utilities—we’re not equally sure of each $U_{i}$ , instead we assign a probability $p_{i}$ to them. Then the random dictator policy is defined quite obviously as:

$π_{r d} = p_{1} π_{1}^{*} + p_{2} π_{2}^{*} + \dots + p_{n} π_{n}^{*}$ .

And the normalisation can proceed as before, generating the ${ˆ U}_{i}$ , and maximising the normalised sum:

$U = p_{1} {ˆ U}_{1} + p_{2} {ˆ U}_{2} + \dots + p_{n} {ˆ U}_{n}$ .

Properties

The random dictator normalisation has all the good properties of the mean-max normalisation in this post, namely that the utility is continuous in the data and that it respects indistinguishable choices. It is also invariant under cloning (ie adding another option that is completely equivalent to one of the options already there), which the mean-max normalisation does not.

But note that, unlike all the normalisations in that post, it is not a case of normalising each $U_{i}$ without looking at the other $U_{j}$ , and only then combining them. Each normalisation of $U_{i}$ takes the other $U_{j}$ into account, because of the definition of the random dictator policy.

Problems? Double counting, or the rich get richer

Suppose we are hesitating between utilities $U_{1}$ (with $9 / 10$ probability) and $U_{2}$ (with $1 / 10$ ) probability.

Then $π_{r d} = (9 / 10) π_{1}^{*} + (1 / 10) π_{1}^{*}$ is the random dictator policy, and is likely to be closer to optimal for $U_{1}$ than for $U_{2}$ .

Because of this, we expect $U_{1}$ to get “boosted” more by the normalisation process than $U_{2}$ does (since the normalisation is the inverse of the difference between $π_{r d}$ and the optimal policies).

But then when we take the weighted sum, this advantage is compounded, because the boosted ${ˆ U}_{1}$ is weighted $9 / 10$ versus $1 / 10$ for the relatively unboosted ${ˆ U}_{2}$ . It seems that the weight of $U_{1}$ thus gets double-counted.

A similar phenomena happens when we are equally indifferent between utilities $U_{1}$ , $U_{2}$ , … $U_{10}$ , if the $U_{1}$ , … $U_{9}$ all roughly agree with each other while $U_{10}$ is completely different: the similarity of the first nine utilities seems to give them a double boost effect.

There are some obvious ways to fix this (maybe use $\sqrt{p_{i}}$ rather than $p_{i}$ ), but they all have problems with continuity, either when $p_{i} \to 0$ , or when $U_{i} \to U_{j}$ .

I’m not sure how much of a problem this is.

What links here?

Stuart_Armstrong2 Sep 2019 18:24 UTC

LW: 19 AF: 9

7 comments3 min readLW link

Wei Dai 7 Sep 2019 17:56 UTC
LW: 5 AF: 3
0
AF
Stuart, what’s your view on the problem I described in Is the potential astronomical waste in our universe too small to care about? Translated to this setting, the problem is that if you do a normalisation when you’re uncertain about the size of the universe (i.e., $E_{π_{r d}}$ is computed under this uncertainty), and then later find out the actual size of the universe (or just gets some information that shifts your expectation of the size of the universe or of how many lives or observer-moments it can support), you’ll end up putting almost all of your efforts into Total Utilitarianism (if the shift is towards the universe being bigger) or almost none of your efforts into it (if the shift is in the opposite direction).
What links here?
- Does any thorough discussion of moral parliaments exist? by richard_ngo (EA Forum; 6 Sep 2019 15:33 UTC; 36 points)
- Stuart_Armstrong 7 Sep 2019 23:09 UTC
  LW: 2 AF: 1
  0
  AF Parent
  Hum… It seems that we can stratify here. Let $X$ represent the values of a collection of variables that we are uncertain about, and that we are stratifying on.
  
  When we compute the normalising factor for utility $U$ under two policies $π$ and $π^{'}$ , we normally do it as:
  - $U \to U / N_{U}$ , with $N_{U} = \sum_{x} P (X = x) (E_{π, X = x} U - E_{π^{'}, X = x} U)$ .
  And then we replace $U$ with $U / N_{U}$ .
  
  Instead we might normalise the utility $U$ separately for each value of $x$ :
  - Conditional on $X = x$ , then $U \to U / N_{U, x}$ , with $N_{U, x} = E_{π, X = x} U - E_{π^{'}, X = x} U$ .
  The problem is that, since we’re dividing by the $N$ , the expectation of $U / N_{U, x}$ is not the same $U / N_{U}$ .
  
  Is there an obvious improvement on this?
  
  Note that here, total utilitarianism get less weight in large universes, and more in small ones.
  
  I’ll think more...
Gurkenglas 2 Sep 2019 20:49 UTC
5 points
0
Desirable properties that this may or may not have:
- Partitioning the utilities, aggregating each component, then aggregating the results ought to not depend on the partition.
- Any agent ought to want to submit its true utility function.
Taking the limit of introducing many copies of an indifferent utility into the mix recovers mean-max.

What happens when we use the resulting aggregated action as the new normalization pivot, and take a fixed point? The double-counting problem gets worse, but fixing it should also make this work.

If each agent can choose which action to submit to the random dictator policy, they might want to sacrifice a bit of their own utility (which they only currently want to improve their normalization position) in order to ruin other utilities (to worsen their normalization position). Two agents might cooperate by agreeing on an action they both submit.

In addition to the pivot each utility submits, we could take into account pivots selected by an aggregate of a subset of utilities. The full aggregate’s pivot would agree with what the others submit (due to the convergent instrumental goal of reflective consistency). This construction might be easy to make invariant under partitioning.
Pattern 2 Sep 2019 21:02 UTC
1 point
0
I’ve long liked the mean-max normalisation; in this view, what matters is the difference between a utility’s optimal policy, and a random policy. So, in a sense, each utility function has a equal shot of moving the outcome away from an expected random policy, and towards themselves.
So utility normalization is about making a compromise. (I’m visualizing a frontier of some sort*.)
This π_rd is an excellent candidate for replacing the random policy in the normalisation. It is well defined, it would never choose options that all utilities object to, and it doesn’t care about how options are labelled or about how to count them.
How related is this to the literature on voting? (There I understand there are some issues, including: (under some circumstances) if the random dictator policy is used there is zero probability of an option being chosen which is all the second choice of all parties.)
where Eπ∗i[Ui] is the expected utility of Ui given optimal policy, and Eπrd[Ui] is its expected utility given the random dictator policy.
That was difficult to understand. (In part because of the self reference.**)
It is also invariant under cloning (ie adding another option that is completely equivalent to one of the options already there), which the mean-max normalisation does not.
But it isn’t invariant to adding another utility function which is identical to one already present.
There are some obvious ways to fix this (maybe use √pi rather than pi), but they all have problems with continuity, either when pi→0, or when Ui→Uj.
I didn’t entirely follow this. (Would replacing U_i with ln(U_i) help?)

*Like the one mentioned in the post about a multi-round prisoner’s dilemma, where one player says they value “utility” while the other says they value “difference in utility”, and the solution to the problem was described (abstractly) based on the frontier.
** I guess I’ll have to come up with a toy problem involving some options and utilities to figure this out.
- Gurkenglas 2 Sep 2019 21:34 UTC
  2 points
  0
  Parent
  
  there is zero probability of an option being chosen which is all the second choice of all parties
  
  We might get around this by letting each agent submit not only a utility, but also the probability distribution over actions it would choose if it were dictator. If he’s a maximizer, this doesn’t get around that. If he’s a quantilizer, this should. A desirable property would be that an agent wants to not lie about this.
  - Stuart_Armstrong 3 Sep 2019 16:38 UTC
    3 points
    0
    Parent
    Er, this normalisation system way well solve that problem entirely. If $U_{i}$ prefers option $o_{i}$ (utility $1$ ), with second choice $o_{0}$ (utility $1 / 2$ ), and all the other options as third choice (utility $0$ ), then the expected utility of the random dictator is $1 / n$ for all $U_{i}$ (as $p_{i}^{*}$ gives utility $1$ , and $p_{j}^{*}$ gives utility $0$ for all $j \neq i$ ), so the normalised weighted utility to maximise is:
    
    $U = \frac{1}{n - 1} (U_{1} + U_{2} + \dots U_{n})$ .
    
    Using $(n - 1) U$ (because scaling doesn’t change expected utility decisions), the utility of any $o_{i}$ , $i > 0$ , is $1$ , while the utility of $o_{0}$ is $n / 2$ . So if $n > 2$ , the compromise option $o_{0}$ will get chosen.
    
    Don’t confuse the problems of the random dictator, with the problems of maximising the weighted sum of the normalisations that used the random dictator (and don’t confuse the other way, either; the random dictator is immune to players’ lying, this normalisation is not).
    - Gurkenglas 3 Sep 2019 16:51 UTC
      2 points
      0
      Parent
      I was aware, but addressing his objection as though it were justified, which it would be if this were the only place where the agent’s preferences matter. This counterfactual is supported by my fondness for linear logic.